AI Engineering 11 min readMay 8, 2026

How to Reduce LLM API Costs in Production

A practical engineering guide to cutting your LLM bill in production — caching, model routing, retrieval, prompt compression, output limits, and per-request cost tracking — without wrecking quality.

Key Takeaways

The biggest LLM cost wins come from doing less work: caching repeated calls, routing easy requests to cheaper models, and sending less context per request.
Prompt caching reuses a stable prefix (system prompt, instructions, schema) so you pay full price for it once instead of on every call, often the single fastest saving to ship.
A model cascade tries a small, cheap model first and escalates to a larger one only when a confidence or validation check fails, paying for the expensive model only when it is actually needed.
Retrieval (RAG) usually beats stuffing everything into the prompt: fetching the few relevant chunks cuts input tokens sharply while often improving answer quality.
Output tokens typically cost more than input tokens, so capping length, asking for structured output, and streaming are direct levers on both spend and latency.
You cannot optimize what you do not measure: log tokens and cost per request, attribute spend to features, and set budgets so a runaway prompt is caught before the invoice arrives.

The biggest LLM cost wins come from doing less work: caching repeated calls so you pay once, routing easy requests to cheaper models, and sending less context per request. You pay per token, so every reduction in tokens in, tokens out, or number of calls is money saved — and most of these levers can be pulled without hurting quality if you guard each change with an evaluation set.

Costs creep up quietly. A prompt grows a few examples, a context window gets stuffed with an entire document "just in case," every request defaults to the largest model, and nothing is cached. None of that shows up in a demo, but it compounds into a bill that scales linearly with usage. This guide walks through the practical engineering moves we reach for to cut that bill in production, and the quality trade-off behind each one.

Why are LLM costs so high?

According to the Stanford HAI AI Index, 2025, the cost of querying a GPT-3.5–equivalent model fell from $20 per million tokens in November 2022 to $0.07 per million tokens by October 2024 – a more than 280-fold reduction in under two years. That trajectory is real, yet costs remain high in practice because you pay for tokens, and tokens add up fast in ways that are easy to overlook. Providers bill separately for input tokens (everything you send: system prompt, instructions, context, history, and the user's message) and output tokens (everything the model generates). Output tokens are usually the more expensive of the two, and larger or more capable models cost meaningfully more per token than smaller ones.

Three habits drive most avoidable spend, in our experience:

Over-provisioning the model. Sending every request to the most capable model, including the simple ones a far cheaper model would handle perfectly well.
Over-stuffing the context. Pasting whole documents, long chat histories, or many few-shot examples into every call, so you re-pay for the same tokens again and again.
Re-doing identical work. Recomputing answers and re-sending an unchanged system prompt on every call with no caching at all.

The good news is that the same three habits map directly onto the three biggest levers — cheaper models, less context, and caching — so a focused effort usually moves the number a lot. Picking the right base model is upstream of all of this; our guide to how to choose the right LLM covers that decision in depth.

How do you cache LLM calls?

You cache LLM calls by storing work you have already paid for and reusing it instead of recomputing it. Caching is frequently the single fastest saving to ship because it changes nothing about model behavior — it just stops you from buying the same output twice. There are three layers worth knowing.

Exact-match caching is the simplest: hash the full request, and if you have seen it before, return the stored response. This is ideal for deterministic, repeated lookups and for idempotent steps in a pipeline. It only helps when inputs repeat verbatim, so its hit rate depends entirely on how repetitive your traffic is.

Prompt caching is a provider feature that reuses a stable prefix of your prompt — typically the system prompt, instructions, tool definitions, or a schema — at a steep discount instead of charging full price for it on every call. The trick is structure: put the unchanging content first and the variable user input last, so the cacheable prefix is as long as possible. For applications with a large, fixed system prompt, this alone can cut input cost substantially.

Semantic caching goes one step further. Semantic caching is the practice of matching requests that mean the same thing, even when the wording differs, by comparing embeddings of the requests and returning a stored answer when a new query is close enough to a previous one. It dramatically widens cache hit rates for things like FAQ and support, but it carries a real risk:

Caching strategies compared

Strategy	How it works	Hit-rate potential	Quality / safety risk
Exact-match caching	Hash the full request; return stored response on identical input	Low – only helps when inputs repeat verbatim	Very low – identical input always produces the same answer
Prompt caching	Provider reuses a stable prefix (system prompt, schema) at a steep discount	High – every call with that prefix benefits	Very low – only the static prefix is reused; user input is still processed fresh
Semantic caching	Match requests that mean the same thing via embedding similarity	High – catches paraphrases and near-duplicates	Higher – a loose threshold can return a confidently wrong cached answer

The quality trade-off for caching is mostly about freshness and correctness of matches. Exact and prompt caching are very safe; semantic caching needs guardrails. As a rule, cache aggressively where answers are stable and cache cautiously where they are personalized or time-sensitive.

What is model routing?

Model routing is sending each request to the cheapest model that can handle it, rather than defaulting every call to the largest one. Because capable models cost far more per token than small ones, matching model size to task difficulty is one of the highest-leverage cost moves available — and done well, it barely touches quality.

A model cascade is the most reliable pattern. A model cascade tries a small, cheap model first, then escalates to a larger model only when a check on the small model's output fails. You pay for the expensive model only on the requests that genuinely need it. Typical escalation triggers include:

The small model reports low confidence or explicitly says it cannot answer.
The output fails validation — malformed JSON, a missing required field, or a failed business rule.
A cheap classifier judges the task as hard, long, or high-stakes up front and routes it straight to the larger model.

Routing works because real traffic is rarely uniform. A large share of requests — classification, extraction, short factual answers, simple rewrites — are easy, and a small or distilled model handles them at a fraction of the price. The honest way to set the boundary is empirical: run candidate models against an evaluation set for your task and route based on where each one actually passes your quality bar. The trade-off to watch is that escalation adds a little latency and the occasional double cost when a request runs twice, so tune the trigger so escalations stay the exception, not the rule. If your workload also spans devices, the same logic extends to where the model runs — see on-device vs cloud AI for that dimension.

How does retrieval cut costs?

Retrieval cuts costs by sending the model only the small slice of context relevant to the question instead of stuffing everything into the prompt. Since you pay per input token, the difference between pasting a fifty-page document and sending the three paragraphs that actually answer the question is large — and it often improves the answer.

Retrieval-augmented generation (RAG) is the pattern here: index your documents, and at query time fetch only the most relevant chunks to include in the prompt. The cost benefits are direct:

Far fewer input tokens. You send a handful of relevant chunks rather than entire documents on every call.
No re-paying for static knowledge. Reference material lives in the index, not in the prompt, so you are not billed for it on each request.
Often better quality. A focused context can beat a giant one, because the model is not forced to find a needle in a haystack of irrelevant text.

The trade-off is that retrieval introduces a quality dependency of its own: if the retriever fetches the wrong chunks, the model answers from bad context. So the savings are real, but they ride on retrieval quality, which is worth its own evaluation. Retrieval is also frequently a cheaper and faster alternative to retraining the model on your data — we compare the two approaches directly in RAG vs fine-tuning.

How do you shrink prompts and outputs?

You shrink prompts and outputs by removing tokens that do not earn their place — tightening instructions, trimming examples, and capping how much the model is allowed to generate. Because every token is billed on every call, even small reductions compound across high volume.

On the input side:

Compress the prompt. Prompt compression is the practice of saying the same thing in fewer tokens — cutting verbose preambles, deduplicating instructions, and pruning few-shot examples once the model reliably does the task without them.
Trim conversation history. Do not resend an entire chat transcript every turn. Keep a short rolling window and, for long sessions, replace old turns with a compact running summary.
Move static rules into the cached prefix. Anything that never changes belongs in the part of the prompt that prompt caching can reuse, not re-sent fresh each time.

On the output side:

Cap maximum output length. Set a sensible ceiling so a model cannot run on for thousands of tokens when a short answer was requested.
Ask for structured output. Requesting concise JSON or a fixed schema keeps responses short, parseable, and free of filler prose — and structured output is also what makes downstream automation reliable, a theme in our guide to designing software and APIs for AI agents.
Stream and stop early. Streaming improves perceived latency and lets you halt generation as soon as you have what you need rather than waiting for, and paying for, a full completion.

The quality trade-off is real but manageable: cut too far and you can starve the model of an instruction or example it needed, so reduce deliberately and re-check against your evaluation set after each round of trimming.

When should you batch requests?

You should batch requests whenever the work does not need an immediate answer, because batching trades latency for lower cost. It is a natural fit for background and bulk jobs, and a poor fit for interactive, user-facing features where someone is waiting on the response.

There are two distinct things people mean by batching:

Provider batch mode. Some providers offer an asynchronous batch API at a discount, where you submit many requests and collect results later. This suits overnight enrichment, bulk classification, dataset labeling, and evaluation runs.
Grouping items into one call. Processing several records in a single well-structured request can amortize a cached system prompt across all of them and cut per-item overhead, as long as you keep the combined output parseable and within context limits.

The trade-off is purely latency: batched work is not instant. Used in the right place — pipelines, scheduled jobs, analytics — it is close to free savings; used on an interactive path, it will frustrate users. Decide by the response-time requirement of each workload.

How do you track and control spend?

You track and control spend by measuring cost per request and attributing it to the feature that caused it, then setting budgets and alerts on top. You cannot optimize what you do not measure — and without per-request visibility, the first signal of a runaway prompt is the invoice.

The practical setup is straightforward:

Log tokens on every call. Providers return input and output token counts in the response; capture both for each request.
Convert tokens to cost and tag the source. Multiply by current per-token prices and store the result alongside a tag for the feature, workflow, user, or tenant that triggered it.
Aggregate into cost per request and per feature. This is what reveals the expensive paths — usually a small number of workflows drive most of the bill, and those are where optimization pays off.
Set budgets and alerts. Put thresholds on spend per feature and per day so an accidental loop, a ballooning prompt, or a traffic spike trips an alert early instead of compounding silently.

This measurement loop is also what keeps your other optimizations honest. When you ship a cache, a router, or a prompt trim, the cost-per-request number tells you whether it actually worked — and pairing it with a quality evaluation tells you whether it cost you anything in accuracy. The two together are how you cut spend on purpose rather than by guesswork, and they pair naturally with the eval discipline in how to evaluate and test AI agents.

How do these tactics fit together?

Used together rather than in isolation, these tactics stack, because each one attacks a different part of the bill. In our experience a sensible order of operations looks roughly like this:

Measure first. Add cost-per-request logging so you know what to attack and can prove each change worked.
Cache the obvious wins. Turn on prompt caching and exact caching — low risk, often immediate savings.
Trim what you send. Move static knowledge into retrieval, shorten prompts, and cap outputs.
Right-size the model. Introduce routing or a cascade so easy work stops hitting the expensive model.
Batch the background. Push non-interactive jobs to a batch path for the remaining discount.

Crucially, every step rides on the same safety net: an evaluation set that confirms quality holds before and after the change. The goal is not the cheapest possible system — it is the cheapest system that still clears your quality bar. These same instincts shape how we approach building an AI agent for your business from day one.

Common mistakes when cutting LLM costs

Optimizing before measuring. Guessing at the expensive path instead of logging cost per request and following the data.
A loose semantic cache. Setting the similarity threshold too low and serving confidently wrong cached answers.
Downgrading the model blindly. Switching to a smaller model with no evaluation set, then discovering the quality drop in production.
Stuffing context out of habit. Pasting whole documents and full chat histories into every call when retrieval and a rolling window would do.
Ignoring output length. Leaving generation uncapped so the most expensive tokens — the output — run unchecked.

Cheaper by design, not by accident

Cutting an LLM bill is rarely one heroic change. It is a handful of unglamorous engineering decisions — caching, routing, retrieval, tighter prompts and outputs, batching, and honest measurement — each guarded by an evaluation so quality does not quietly slip while the cost drops. Do them together and the savings compound; the system gets cheaper as a property of how it is built, not as a one-off cleanup.

This is how Game Changer Labs builds AI in production: cost-aware from the first design decision, with caching, routing, and per-request budgets treated as part of the architecture rather than afterthoughts. If you are running LLMs at a scale where the bill has started to matter and you want it brought down without sacrificing quality, that is the kind of work we do.

Frequently Asked Questions

How do I lower my OpenAI API bill?

Start by measuring cost per request so you know where the money goes, then attack the top spenders. The highest-leverage moves are usually caching repeated work, routing simple requests to a smaller and cheaper model, trimming the context you send with retrieval, and capping output length. Each one reduces tokens or calls, which is what you actually pay for.

Does caching work for LLMs?

Yes, and it is often the fastest saving available. Two kinds help: exact caching returns a stored response when the same input repeats, and prompt caching lets the provider reuse a stable prefix such as your system prompt so you are not charged full price for it every call. Semantic caching goes further by matching requests that mean the same thing, with a relevance check to stay safe.

What is prompt caching?

Prompt caching is a provider feature that stores the processed form of a stable prompt prefix — typically your system prompt, instructions, tool definitions, or schema — so repeated calls reuse it at a steep discount instead of paying the full input price each time. You structure prompts so the unchanging part comes first and the variable user input comes last, which maximizes how much can be cached.

How can I make LLM calls cheaper without losing quality?

Match the model and the context to the difficulty of the task rather than over-provisioning every request. Route easy requests to a smaller model and reserve the large one for hard cases, send only the retrieved context that is relevant instead of everything, and cap output length. Guard each change with an evaluation set so you can confirm quality holds before and after the optimization ships.

Is a smaller model good enough to save money?

Often, yes — for classification, extraction, routing, short answers, and well-scoped tasks, a smaller or distilled model frequently matches a larger one at a fraction of the cost. The honest way to decide is to run both against an evaluation set for your specific task. If the small model passes your quality bar, the savings are real; if it fails on hard cases, use a cascade so the big model only handles those.

Does sending less context actually reduce cost?

Yes. You pay per token, so every paragraph of context you include is billed on every call. Stuffing entire documents into the prompt is expensive and can even hurt quality when the model has to find a needle in a haystack. Retrieving and sending only the few relevant chunks usually cuts input tokens substantially while keeping, or improving, answer accuracy.

How do I track cost per LLM request?

Log the input and output token counts the provider returns for every call, multiply by the current per-token prices, and store that alongside a tag for the feature, user, or workflow that triggered it. Aggregating those records shows cost per request and per feature, which surfaces the expensive paths worth optimizing and lets you set budgets and alerts before a bill surprises you.

Will batching requests save money?

It can, in two ways. Some providers offer an asynchronous batch mode at a discount for work that does not need an immediate answer, such as overnight enrichment or evaluation runs. Separately, grouping items into one well-structured call can amortize a cached system prompt across many records. Both trade latency for lower cost, so they fit background jobs better than interactive features.

Free Tools

AI Cost EstimatorA directional cost range for your AI build in five questions.AI Readiness ScorecardScore whether your team is ready to build and ship AI.

Game Changer Labs

Tell us what you're building — book a free scoping call.

Pick a time that works and walk us through your project — 30 minutes, straight to the point. You leave with a concrete plan, timeline, and cost. No sales pitch — if we're not the right fit, we'll say so.

Book a free scoping call Or send a note instead

Keep Reading

AI Engineering

How to Choose the Right LLM for Your Product

Read

Developer Tools

How to Design Software and APIs That AI Agents Can Actually Use

Read

Get new playbooks by email

Occasional, no-fluff field notes on building production AI — new guides and tools, straight to your inbox. Unsubscribe anytime.

Published: May 8, 2026Game Changer Labs