Inference API Pricing: Cost Per Token Across 10 Providers

If you are running a managed-models business — building products on top of inference APIs rather than training your own models — pricing is not an afterthought. It is the variable that most directly determines whether your margins hold up at scale. A $0.002/request cost that looks fine at 10,000 monthly calls becomes a $2,000/month line item at 1 million calls, and that math catches a lot of teams off guard.

This article covers the actual per-token costs across 10 major inference providers as of January 2026, breaks down what is included at each price point, flags the hidden costs that don’t show up in the rate card, and gives you a framework for deciding which models belong in which tiers of your stack.

All prices here come from public pricing pages. This market moves fast — verify before you budget.

How inference pricing works

Most providers bill on a per-token basis, split into input tokens (the prompt you send) and output tokens (the completion you receive). Output tokens cost more than input tokens at every major provider because generating text requires more compute than processing it.

The key numbers to track:

Price per 1 million input tokens — what you pay to send prompts
Price per 1 million output tokens — what you pay for generated text
Context window — maximum tokens per request; longer contexts cost more at some providers
Batch pricing — some providers offer 50% discounts for asynchronous batch jobs

A 1,000-token prompt that generates a 500-token response at $2.50/$10.00 per 1M tokens (the GPT-4o rate) costs roughly $0.0075 per request. At $0.54/$0.54 per 1M tokens (Together AI’s Llama 3.3 70B rate), the same request costs about $0.00081. That is a 9x cost difference on an apples-to-apples request.

Pricing across 10 providers

These are the standard on-demand rates for flagship and mid-tier models at each provider, verified January 2026.

Provider	Model	Input (per 1M tokens)	Output (per 1M tokens)
OpenAI	GPT-4o	$2.50	$10.00
OpenAI	GPT-4o mini	$0.15	$0.60
Anthropic	Claude 3.5 Sonnet	$3.00	$15.00
Anthropic	Claude 3.5 Haiku	$0.80	$4.00
Google	Gemini 1.5 Pro (≤128k)	$1.25	$5.00
Google	Gemini 1.5 Flash	$0.075	$0.30
AWS Bedrock	Claude 3.5 Sonnet	$3.00	$15.00
Azure OpenAI	GPT-4o	$2.50	$10.00
Together AI	Llama 3.3 70B	$0.54	$0.54
Fireworks AI	Llama 3.1 70B	$0.90	$0.90
Groq	Llama 3.3 70B	$0.59	$0.79
DeepSeek	DeepSeek R1	$0.55	$2.19
Mistral AI	Mistral Large	$2.00	$6.00

What you get at each tier

Frontier models: OpenAI, Anthropic, Google Pro

At $2.50-$15.00 per 1M output tokens, you are paying for models that consistently perform well on tasks requiring multi-step reasoning, nuanced instruction-following, long-context synthesis, and structured output generation. GPT-4o handles function calling and JSON mode reliably. Claude 3.5 Sonnet is the current leader on coding benchmarks and produces better-formatted long-form output than most alternatives. Gemini 1.5 Pro supports a 1 million token context window, which is genuinely useful for large document analysis.

None of these are cheap for high-volume use. The ceiling on Claude 3.5 Sonnet output cost ($15.00/1M tokens) means that a product generating 100 words of output per user interaction, running at 100,000 users/day, costs around $900/day in output tokens alone. That is $27,000/month before you count input tokens, infrastructure, or any other operating cost.

Mid-tier: Claude Haiku, GPT-4o mini, Gemini Flash

The $0.075-$4.00 per 1M token range gives you significantly faster and cheaper models that still perform well on most production tasks. Claude 3.5 Haiku in particular punches above its price — it handles structured extraction, classification, and summarization tasks with accuracy close to Sonnet at roughly 20% of the cost. GPT-4o mini is faster than GPT-4o and costs 95% less per output token. Gemini 1.5 Flash is the cheapest option at $0.075 input / $0.30 output and is strong on factual retrieval tasks.

For most managed-models products, this tier should handle the majority of production traffic. Reserve the frontier models for tasks where the quality delta is measurable and revenue-relevant.

Open-source inference: Together AI, Fireworks, Groq

At $0.54-$0.90 per 1M tokens for both input and output, these platforms run the latest open-source models — primarily Llama 3.3/3.1 70B, Mistral 7B/8x7B, and DeepSeek variants — without the overhead of managing your own GPU infrastructure.

The performance difference versus frontier models on knowledge-intensive or reasoning-heavy tasks is real. Llama 3.3 70B is genuinely capable on classification, extraction, and instruction-following, but it will not match Claude 3.5 Sonnet on complex multi-step reasoning or nuanced writing tasks. For the right use cases, the 5-10x cost savings justify the capability trade-off.

Groq stands out for raw speed: it routinely delivers 200-500 tokens/second on Llama models due to its LPU hardware architecture. If your product is latency-sensitive and the task is well-defined, Groq can reduce time-to-first-token from 300-800ms to under 100ms.

DeepSeek R1 deserves a separate mention. At $0.55/$2.19 per 1M tokens with a chain-of-thought reasoning architecture, it is the most cost-effective option for tasks that require explicit reasoning steps — significantly cheaper than OpenAI’s o1 series while achieving competitive performance on coding and math benchmarks.

Hidden costs to watch for

The per-token rate is only part of the bill. These costs show up in invoices but not rate cards:

Context window length. Several providers charge more for requests over a certain context length. Google Gemini 1.5 Pro doubles its rate to $2.50/$10.00 per 1M tokens for requests over 128k tokens. If you are processing long documents and not tracking context length per request, your actual costs may be 2-4x your estimate.

Batch vs. real-time pricing. OpenAI charges 50% less for batch API requests (asynchronous, results within 24 hours). Anthropic offers a similar batch discount. For any workload that is not latency-sensitive — nightly data processing, bulk classification, report generation — not using the batch API means paying double. Many teams default to the synchronous API everywhere and miss this.

Egress and data transfer fees. AWS Bedrock adds data transfer charges on top of token costs. If you are running Bedrock in us-east-1 and your application servers are in us-west-2, the transfer overhead adds up. Azure OpenAI has similar regional transfer pricing. Self-hosted providers like Together AI and Fireworks do not charge for data transfer.

Reserved capacity and commitments. Some enterprise agreements offer lower per-token rates in exchange for monthly spend commitments. These can cut costs by 20-40% at scale, but they require predictable volume forecasting. Committing to $50,000/month when your usage drops to $20,000/month eliminates any savings.

Rate limit overage handling. When you hit rate limits and retry failed requests, you are paying twice for the same work. At high volume, retry rates of 1-5% from rate limiting or transient errors add real cost. Track your retry rates per provider.

Cost comparison: 1 million production requests

To make the pricing concrete, here is what 1 million requests look like at a typical managed-models workload: 500 tokens input, 250 tokens output per request.

Provider + Model	Cost per Request	Cost per 1M Requests
Anthropic Claude 3.5 Sonnet	$0.00525	$5,250
OpenAI GPT-4o	$0.00375	$3,750
Mistral Large	$0.00250	$2,500
Anthropic Claude 3.5 Haiku	$0.00140	$1,400
OpenAI GPT-4o mini	$0.000225	$225
Together AI Llama 3.3 70B	$0.000405	$405
Groq Llama 3.3 70B	$0.000493	$493
Gemini 1.5 Flash	$0.0000488	$48.75
DeepSeek R1	$0.000823	$823

The range from Gemini 1.5 Flash ($48.75) to Claude 3.5 Sonnet ($5,250) is 100x. Choosing the right model for each workload type is the most direct lever you have on COGS.

Bar chart showing inference API cost per 1 million requests across 10 providers, from Gemini Flash at $48.75 to Claude 3.5 Sonnet at $5,250 — Cost per 1M requests (500 input / 250 output tokens) across major inference providers, January 2026 managedmodels.com

ROI framework for managed-models businesses

Picking a model is a unit economics decision. The question is not which model is cheapest — it is which model generates the most revenue per dollar of inference cost.

Step 1: Categorize your request types

Most managed-models products have 3-5 distinct request patterns with different value profiles:

High-value, low-volume: Complex reasoning tasks that directly generate customer value (analysis reports, personalized recommendations, code generation)
High-volume, commodity: Classification, extraction, summarization tasks that process data but are not the primary customer-facing output
Latency-critical: User-facing chat or autocomplete where response time affects conversion and retention

Each category should have its own model selection.

Step 2: Measure output quality per dollar, not just output quality

Run A/B tests comparing models on your actual production tasks. If GPT-4o and Claude 3.5 Haiku achieve the same accuracy on your document classification pipeline — which is common for well-defined classification tasks — Haiku at $1.40/1M requests versus GPT-4o at $3,750/1M requests is not a close decision.

The key metric is task pass rate × revenue per task / inference cost. A model with 95% accuracy versus 99% accuracy is only worth the premium if that 4-point gap represents a measurable revenue or churn difference.

Step 3: Build a tiered model strategy

A practical starting point for a managed-models product at $10,000-$50,000/month in inference spend:

Request Type	Recommended Tier	Example Models
Complex reasoning, writing, analysis	Frontier	Claude 3.5 Sonnet, GPT-4o
Structured extraction, classification	Mid-tier	Claude Haiku, GPT-4o mini, Gemini Flash
High-volume preprocessing	Open-source inference	Together AI Llama, Groq
Latency-critical user-facing	Fast open-source	Groq Llama 3.3 70B
Batch document processing	Batch API + mid-tier	GPT-4o mini batch, Claude Haiku batch

Step 4: Track cost per unit of business value

Pick a single metric that ties inference cost to revenue. Good options:

Cost per successful completion (relevant outputs, not just API calls)
Inference COGS as % of contract value (should stay under 15-20% for healthy margins)
Cost per active user per month (useful for subscription pricing)

Set a target and alert threshold. At $0.10 COGS per successful completion and a $30/month subscription price, you have 300x headroom. If COGS climbs to $1.00/completion, your margins compress fast. Monitor this weekly during growth phases.

Provider-specific notes

AWS Bedrock charges Anthropic rates plus AWS infrastructure overhead. The main reason to use Bedrock over the direct Anthropic API is compliance (SOC 2, HIPAA eligibility, data residency) and AWS ecosystem integration. If compliance is not a requirement, the direct API is simpler and often cheaper.

Azure OpenAI matches OpenAI’s public rates but offers enterprise SLAs, virtual network support, and regional data residency. Similar trade-off to Bedrock: worth the overhead if you already operate in Azure or need compliance controls.

Together AI and Fireworks are functionally similar for most open-source text workloads. Together AI has a broader model catalog; Fireworks has a reputation for better fine-tuned model hosting. Both are suitable for production use.

Groq is the right choice when you need output fast and the task is well-defined. The throughput advantage over all other providers is substantial, but Groq’s model selection is narrower and the platform has less fine-tuning support.

DeepSeek is the most cost-effective option for reasoning-intensive tasks. The API is straightforward and the R1 model produces explicit reasoning traces that are useful for debugging and evaluation. Latency is higher than Groq but acceptable for non-interactive workloads.

Practical next steps

If you are currently using a single frontier model for all requests, the highest-impact change you can make is routing high-volume, well-defined tasks to a mid-tier or open-source model. Start by identifying your top 3 request types by volume, measure accuracy on each using a smaller model, and migrate any workload where the quality difference is below your threshold.

The pricing gap between open-source and frontier inference is large enough that even a partial routing change typically pays for the engineering effort within the first month of traffic.