Your Inference Bill Is Going Up. Even as Costs Go Down.

Your Inference Bill Is Going Up. Even as Costs Go Down.

Table of Contents

The Number That Should Worry You

AWS raised GPU Capacity Block prices by 15% on a Saturday in January. No blog post. No announcement. Just a quiet update to the pricing page that said prices were “scheduled to be updated” without mentioning which direction.

The p5e.48xlarge jumped from $34.61 to $39.80 per hour. The p5en.48xlarge went from $36.18 to $41.61. If you’re running in US West (N. California), the p5e now costs $49.75 per hour. The next scheduled pricing review is April 2026. Nobody is expecting it to go down.

This happened seven months after AWS announced “up to 45% price reductions” on GPU instances. That announcement covered On-Demand and Savings Plans. The Capacity Blocks that teams actually use for ML workloads went the other way.

If you’re running production inference on AWS, this is the moment to pay attention.

The Paradox

Here’s where it gets confusing. Per-token inference costs are collapsing. Andreessen Horowitz calls it “LLMflation”: for a model of equivalent performance, inference costs are dropping 10x every year. What cost $60 per million tokens in 2021 now costs $0.06. That’s a 1,000x reduction in three years.

So why is AWS raising GPU prices? And why are total inference bills going up?

Because demand is growing faster than costs are falling. Total inference spending grew 320% even as per-token costs dropped 280-fold. Inference workloads now account for 55% of AI infrastructure spending, up from 33% in 2023. Deloitte is calling 2026 the year of “Inference Famine.”

This is the Jevons Paradox playing out in real time. Make something cheaper, people use more of it. Make inference 10x cheaper, teams deploy agents that run 24/7 instead of answering one-off queries. The per-unit cost drops. The total bill climbs.

The Supply Side Is Not Helping

The GPU shortage is real and it’s not going away soon.

NVIDIA is cutting gaming GPU production by 30-40% in early 2026 to prioritise data centre chips that generate twelve times more revenue. Even with that reallocation, demand from hyperscalers consumes entire production runs with lead times exceeding 30 weeks.

The memory situation is making things worse. Samsung and SK hynix are planning 20% HBM3E price hikes for 2026 as demand from NVIDIA’s H200 and custom ASICs outstrips supply. When your GPU’s memory costs more, the GPU costs more. Simple as that.

And the hyperscalers are pouring fuel on the fire. The big five are spending over $600 billion on infrastructure in 2026, a 36% increase over 2025. Roughly 75% of that is AI infrastructure. Amazon alone is committing $200 billion. Google is at $175-185 billion. Microsoft at $145 billion. Meta at $115-135 billion. Amazon is now looking at negative free cash flow of $17-28 billion this year according to Morgan Stanley and Bank of America analysts.

They’re spending that money because they have to, not because they want to. Nobody wins the AI race by being the provider that can’t supply GPU capacity.

What This Means for Your AWS Bill

If you’re running inference workloads on AWS today, here’s what the next twelve months probably look like.

Capacity Block prices will keep rising. The January 15% hike was the first. The next review is April 2026. GPU demand is not declining. HBM memory costs are increasing. There is no economic signal pointing toward cheaper reserved GPU capacity.

On-demand pricing might stay flat or even drop. AWS has form for cutting on-demand instance prices while raising the prices that committed workloads actually pay. The headline number looks good. The bill doesn’t.

Spot availability will get worse. As more teams move production inference workloads to GPU instances, spot capacity gets scarcer. The discount narrows. The interruption rate climbs. If your architecture depends on spot GPUs for inference, that’s a bet that gets riskier every quarter.

The inference-to-training ratio will flip your budget. Most teams budgeted for AI assuming training was the expensive part. It was, in 2024. But inference is the recurring cost, and it scales with every user, every agent, every API call. Training is a one-off. Inference is the meter running.

The Escape Routes

Not all of this is bad news. There are real options for managing inference costs on AWS, but they require actual architectural decisions, not just hoping the price comes down.

Trainium and Inferentia are the obvious play. AWS launched Trainium3 in December 2025: 3nm, 2.52 petaflops of FP8 compute, 144 GB HBM3e. The cost savings are meaningful, 40-60% less than equivalent NVIDIA Blackwell infrastructure. Inferentia delivers 40-50% cost reduction for compatible inference workloads. The catch: not every model runs on custom silicon without modification. The ecosystem is improving fast but it’s not plug-and-play yet.

Smaller models with comparable performance. The LLMflation trend means last year’s frontier model is this year’s commodity. If you’re running GPT-4 class inference, you might get equivalent results from a model that costs 10x less. The trick is actually testing this against your specific use case rather than assuming benchmark scores translate to production quality.

Hybrid inference architectures. Route simple queries to cheap, small models. Escalate complex ones to frontier models. Cache aggressively. Batch where possible. This is the equivalent of what we’ve been doing with compute for decades: right-size the resource to the workload. It’s just that most teams haven’t applied this discipline to inference yet.

Multi-cloud or specialised providers. Midjourney moved from NVIDIA H100s to Google TPU v6e and cut monthly inference spend from $2.1 million to under $700,000. That’s $16.8 million in annualised savings. If your inference workload is large enough, provider-specific silicon (TPUs, Trainium) or specialised inference providers might change the maths entirely.

The Honest Assessment

Let me be direct about what I think is happening.

The cost per token will keep falling. That’s almost certain. Hardware gets better, models get more efficient, competition drives prices down. The a16z data on 10x annual cost reduction is real and there’s no reason to think it stops.

But your total inference bill will probably go up anyway. Because you’ll use more inference. Your teams will deploy more agents. Your products will make more API calls. Your customers will expect AI features that didn’t exist last year. And the GPU capacity you need to serve all of that will be more expensive to reserve, because everyone else wants it too.

The teams that will do well are the ones treating inference cost as an architectural concern, not a line item to be surprised by at the end of the month. That means benchmarking smaller models against your actual use cases. It means evaluating Trainium and Inferentia seriously, not as an experiment but as a production path. It means building cost awareness into the inference pipeline the same way we built it into compute and storage years ago.

The Bigger Picture

We’ve been through this before with cloud computing. In the early days, everyone assumed costs would just keep falling. They did, per unit. But total spend grew because usage grew faster. The organisations that managed their cloud bills well were the ones that treated cost optimisation as an engineering discipline, not an afterthought.

Inference is the same pattern, just compressed into a shorter timeframe. The window between “this is cheap enough to experiment with” and “this is a significant line item” is about eighteen months. For most teams, that window is closing right now.

The GPU Capacity Block price hike was a signal. Pay attention to it.

I hope someone else finds this useful.

Cheers

Share :

Related Posts

The Real Skill Isn't Coding Anymore. It's Describing What You Want.

The Real Skill Isn't Coding Anymore. It's Describing What You Want.

You’ve Got the Tools. So Why Are You Still Slow? If you’re building on AWS right now, you have access to more managed services, more abstraction layers, and more AI-assisted tooling than at any point in computing history. CDK, SAM, Amplify, Bedrock, Kiro, Claude Code. The list keeps growing.

Read More
AWS STS Finally Lets You Write Trust Policies That Actually Mean Something

AWS STS Finally Lets You Write Trust Policies That Actually Mean Something

If you’ve ever written an IAM trust policy for GitHub Actions OIDC federation, you’ve probably done the thing we all did. You set the sub condition to repo:my-org/my-repo:*, told yourself “that’s scoped enough,” and moved on with your day.

Read More
The Friction Was the Point

The Friction Was the Point

My dad had a camera. Not a phone with a camera. A camera. A proper one with a roll of film that gave you 24 shots, maybe 36 if you were feeling extravagant.

Read More