GPU Cloud Pricing: A100 vs H100 vs L40S Across Providers
H100 SXM5 instances on hyperscalers run $4-12/hr per GPU — often 3-5x what you pay on Lambda Labs or CoreWeave for the same chip. L40S is the underrated option: at $0.79-1.47/hr it handles most inference workloads at a fraction of H100 cost, with one meaningful catch (no NVLink peer bandwidth for multi-GPU tensor parallelism). For managed-models businesses, the right answer is usually A100 80GB for production inference, H100 only when throughput math demands it, and L40S for cost-optimized single-GPU inference jobs.
Picking a GPU cloud provider for model inference is not primarily a hardware decision — it is a unit economics decision. The same NVIDIA A100 80GB chip costs $1.54/hr on RunPod and $4.10/hr on AWS EC2 at on-demand rates. That 2.7x gap compounds quickly: at 10,000 GPU-hours per month, you are choosing between $15,400 and $41,000 for identical hardware. The decision deserves more than a quick Google search.
This article covers on-demand pricing for A100, H100, and L40S instances across six providers as of February 2026, what you actually get at each price point, the hidden costs that do not show up on pricing pages, and a framework for deciding which GPU belongs in your inference stack.
The three GPUs, briefly
Before the pricing tables, a quick grounding on what separates these chips:
A100 80GB is NVIDIA’s Ampere-architecture data center GPU, available in PCIe and SXM4 variants. The 80GB HBM2e memory makes it suitable for running 7B-70B parameter models in full precision or 70B+ models quantized. It remains the most widely available data center GPU and the most liquid market for on-demand instances. NVLink enables fast multi-GPU tensor parallelism for larger model deployments.
H100 80GB is NVIDIA’s Hopper-architecture successor. The SXM5 version connects via NVLink 4.0 and delivers roughly 3x the FP16 throughput of an A100 at peak. In practice, for autoregressive inference (as opposed to training), the real-world speedup is typically 1.5-2x over A100 because inference is memory-bandwidth-bound, not compute-bound. The PCIe variant is slower than SXM5 and pricing reflects that. The H100 is worth the premium when you need to serve large context windows fast or run 70B+ models at high concurrency.
L40S 48GB uses NVIDIA’s Ada Lovelace architecture. The 48GB GDDR6 memory (not HBM) limits memory bandwidth compared to A100/H100, which matters for large models but less for smaller ones. The L40S is optimized for inference workloads: it has better FP8 throughput than the A100, solid INT8 performance, and a price point that is significantly lower than H100. The main limitation for managed-models use: it does not support NVLink, so multi-GPU tensor parallelism is off the table. That rules it out for running 70B+ models efficiently across multiple GPUs.
On-demand pricing by provider
These are verified on-demand rates for a single GPU. Multi-GPU instances are priced per GPU and listed separately where the per-GPU rate differs.
A100 80GB pricing
| Provider | Instance Type | Price/hr |
|---|---|---|
| RunPod | Secure Cloud | $1.54 |
| Lambda Labs | On-demand | $1.99 |
| Vast.ai | Marketplace average | $1.10–$2.00 |
| CoreWeave | On-demand | $2.21 |
| Google Cloud (a2-highgpu-1g) | On-demand | $3.67 |
| AWS EC2 (p4d.24xlarge, 8x A100 40GB) | Per GPU, on-demand | $4.10 |
Note: AWS’s P4 instances use the A100 40GB variant. For A100 80GB on AWS, you need P4de instances at approximately $5.25/GPU/hr on-demand.
H100 80GB pricing
| Provider | Instance Type | Price/hr |
|---|---|---|
| Vast.ai | Marketplace average | $1.60–$2.50 |
| RunPod | Secure Cloud | $2.49 |
| Lambda Labs | On-demand (PCIe) | $2.49 |
| CoreWeave | PCIe on-demand | $2.81 |
| CoreWeave | SXM5 on-demand | $3.50 |
| Azure (ND H100 v5) | Per GPU, on-demand | $3.60 |
| Google Cloud (a3-highgpu-8g) | Per GPU, on-demand | $4.15 |
| AWS EC2 (p5.48xlarge, 8x H100 SXM) | Per GPU, on-demand | $12.29 |
AWS’s P5 instances are priced well above market rate for H100 compute. The $12.29/GPU on-demand figure is accurate but reflects hyperscaler overhead, compliance packaging, and the fact that AWS knows enterprise buyers will pay it. For pure compute, Lambda or CoreWeave H100 instances are a more rational option.
L40S 48GB pricing
| Provider | Instance Type | Price/hr |
|---|---|---|
| Vast.ai | Marketplace average | $0.50–$1.20 |
| RunPod | Secure Cloud | $0.79 |
| Lambda Labs | On-demand | $1.09 |
| CoreWeave | On-demand | $1.47 |
The L40S is not available on AWS or Google Cloud as a discrete on-demand SKU at this time. GCP offers it in some regions under A2 Ultra, but availability is limited.
What you get at each price tier
Hyperscalers: AWS, Azure, Google Cloud ($3.60-$12.29/GPU/hr)
You are paying for compliance packaging, global availability, and deep ecosystem integration. SOC 2, HIPAA eligibility, VPC isolation, IAM, and native connections to managed storage (S3, GCS, Azure Blob) are included. If your managed-models business serves healthcare or finance customers with data residency requirements, this overhead has real value. If it does not, you are paying for compliance controls you do not need.
Hyperscaler GPU inventory is also more reliable at scale. If you need 64 H100s for a training run, booking that capacity on Lambda or CoreWeave may require waiting. On AWS p5 or GCP A3, you can (usually) provision that within minutes via reserved instance commitments.
What you do not get at hyperscaler prices: better hardware. The NVIDIA H100 in an AWS P5 instance is the same chip as the one in a CoreWeave SXM5 instance at $3.50/hr.
Specialized GPU clouds: Lambda Labs, CoreWeave ($1.99-$3.50/GPU/hr)
Lambda Labs and CoreWeave are the two most established specialized GPU clouds for production workloads. Both offer:
- On-demand and reserved pricing tiers
- SSH access to bare-metal or VM instances
- Persistent storage options (though priced separately)
- Enterprise SLAs on reserved capacity
CoreWeave has a larger infrastructure footprint and more instance types, including high-memory variants and NVLink clusters for multi-GPU jobs. Lambda Labs is simpler operationally and slightly cheaper on A100 on-demand. For a managed-models team that does not have dedicated ML infrastructure engineering, Lambda’s straightforwardness is an advantage.
Neither provides the compliance tooling of AWS or Azure. Both have had occasional availability constraints on popular GPU SKUs.
Spot/marketplace: RunPod, Vast.ai ($0.50-$2.49/GPU/hr)
RunPod Secure Cloud and Vast.ai are the cheapest reliable options for GPU compute. The trade-offs are real:
RunPod Secure Cloud runs on vetted, consumer-grade hardware in third-party data centers. The pricing is significantly below Lambda or CoreWeave, and RunPod handles containerization and persistent volumes. The main limitation is uptime guarantees — RunPod does not offer an SLA comparable to Lambda or CoreWeave, and instances can be preempted on community cloud tiers.
Vast.ai is a true marketplace: individual data center operators list GPU capacity, and you bid or pay asking price. The floor prices are the lowest you will find outside of negotiated enterprise deals. The ceiling risk is reliability variance — some Vast.ai instances are rock-solid; others are not. For workloads that can tolerate interruptions (batch jobs, fine-tuning runs), Vast.ai often beats every other option on cost. For low-latency production serving, the reliability variance is a real concern.
Hidden costs to watch for
Storage I/O. Most GPU clouds charge separately for persistent storage. CoreWeave charges $0.10-$0.17/GB/month for block storage. Lambda Labs charges $0.20/GB/month for persistent storage. Storing large model weights (70B FP16 = ~140GB) adds $14-28/month in storage costs alone, before you count the time spent loading weights into VRAM at instance start.
Egress fees. Moving data out of CoreWeave or AWS to your application layer costs money. AWS charges $0.09/GB for outbound transfer from EC2. CoreWeave’s egress pricing is lower but not zero. For high-throughput inference services that stream responses, egress adds up.
Idle instance time. On-demand GPU instances billed by the hour accumulate cost when idle. A Lambda A100 instance left running overnight “just in case” costs $47.76 in unproductive GPU-hours. Autoscaling to zero is operationally important for cost control, but startup latency (model loading can take 2-5 minutes for large models) means you need to balance cold start cost against idle cost.
Network within clusters. If you are running multi-GPU inference across NVLink-connected GPUs, the topology matters. CoreWeave’s NVLink cluster instances have fast peer bandwidth. Running what should be a 4-GPU tensor-parallel job on 4 separate single-GPU instances with network-attached communication is dramatically slower and will not match the throughput you expect from benchmarks.
Reserved capacity pricing. One-year reserved contracts on CoreWeave typically run 30-40% below on-demand. Lambda Labs offers 1-year and 3-year reserved pricing at 20-35% discounts. If you have predictable baseline GPU demand, not using reserved pricing is leaving money on the table. The risk is capacity you cannot return if demand drops.
Cost comparison: 10,000 GPU-hours per month
A managed-models team running continuous inference workloads might use 10,000 GPU-hours per month. Here is what that looks like across GPU types and providers:
| GPU + Provider | On-demand Cost | 1-Year Reserved (est.) |
|---|---|---|
| L40S — RunPod | $7,900 | ~$5,200 |
| L40S — Lambda Labs | $10,900 | ~$7,600 |
| A100 80GB — RunPod | $15,400 | ~$10,500 |
| A100 80GB — Lambda Labs | $19,900 | ~$14,000 |
| A100 80GB — CoreWeave | $22,100 | ~$14,400 |
| H100 PCIe — Lambda Labs | $24,900 | ~$16,900 |
| H100 SXM5 — CoreWeave | $35,000 | ~$22,000 |
| A100 80GB — Google Cloud | $36,700 | ~$22,000 |
| H100 SXM5 — Google Cloud | $41,500 | ~$26,000 |
The spread between RunPod L40S and Google Cloud H100 at the same 10,000 GPU-hours is $33,600/month — $403,200/year. For most managed-models businesses, that gap funds multiple engineering hires.
ROI framework for managed-models businesses
Choosing a GPU is a unit economics problem. The right framework evaluates three things: required throughput per dollar, latency requirements, and model size constraints.
Step 1: Determine your throughput requirement
Start with your target concurrent users or requests per second. For token-streaming inference, throughput is measured in tokens/second per GPU. Approximate benchmarks for Llama 3.1 70B FP16:
- A100 80GB: ~40-60 tokens/second (single GPU, non-batched)
- H100 80GB SXM5: ~80-120 tokens/second (single GPU, non-batched)
- L40S 48GB: ~25-40 tokens/second (single GPU; memory bandwidth ceiling)
If you need 200 tokens/second to serve your user base, that is 4-5 A100s, 2-3 H100s, or 6-8 L40S instances. Calculate GPU count × hourly rate to find your monthly floor.
Step 2: Check model fit
The L40S 48GB can run Llama 3.1 70B in INT4 quantization (roughly 35-40GB). A100 80GB handles it in INT8 (roughly 70GB). H100 80GB handles it in BF16. If output quality at INT4 meets your quality bar, the L40S may be the right choice. If you need BF16 or FP16 precision, you need A100 or H100.
For models above 70B (Llama 3.1 405B, for instance), you need multi-GPU tensor parallelism, which rules out the L40S. The cost calculation changes significantly: you are now pricing 4-8 GPU instances, and per-GPU price differences matter more.
Step 3: Price out your options
Take your GPU count from Step 1, multiply by the on-demand rate for your required GPU type, and compare across two or three providers. Factor in storage and egress based on your model size and request volume.
If your monthly on-demand cost exceeds $10,000, model reserved capacity pricing. Most providers will negotiate 1-year contracts directly; the public reserved rates on their pricing pages are starting points.
Step 4: Set a cost-per-successful-output target
The useful metric for ongoing cost control is cost per successful output — not cost per GPU-hour. An output that fails validation, requires a retry, or is rejected by the user is a cost without corresponding revenue. Track:
- GPU cost per request = (GPU-hours used / total requests) × hourly rate
- Inference COGS as % of contract value — target under 15-20% for healthy margins
- Cost per request by model + GPU combination, so you can compare alternatives against actual production tasks
Which GPU for which workload
If you are running a single-GPU inference server for a 7B-13B model, L40S at $0.79-$1.47/hr is the most cost-effective option and sufficient for the job. The memory bandwidth ceiling does not bite you at these model sizes.
If you are serving a 70B model with moderate to high concurrency and quality requirements that do not tolerate heavy quantization, A100 80GB gives you the best cost-to-capability ratio. Lambda Labs and CoreWeave on-demand rates are well below hyperscaler pricing for the same chip.
If you need maximum throughput on 70B+ models — serving a high-concurrency product where inference latency directly affects retention — H100 SXM5 is the correct answer, and CoreWeave is the most rational place to buy it unless you have specific compliance requirements that mandate AWS or Azure.
The hyperscaler premium is worth paying exactly one situation: when compliance, data residency, or existing enterprise agreements make it the only viable option. Otherwise, you are paying 2-3x for the same compute.
Practical next steps
If you are currently on a hyperscaler and have not run a provider comparison in the past six months, the pricing gap has widened enough to justify the analysis. Start with a one-week trial on CoreWeave or Lambda Labs, run your standard workload, compare total cost including storage and egress, and make the decision with real data.
If you are choosing a GPU cloud for the first time, start with Lambda Labs: the operations are straightforward, the A100 pricing is reasonable, and you can migrate to CoreWeave or a reserved contract once you have volume data.