Last updated: March 2026
AI API costs are the fastest-growing line item for development teams in 2026. With Claude Opus at $15/$75 per million tokens and GPT-4o at $2.50/$10, a single agent session can cost $0.50-$5.00. This guide covers 7 proven strategies to cut those costs by 60-80% without sacrificing output quality.
If you are building with large language models in 2026, you already know the sticker shock. API pricing has improved since 2024, but the way we use models has changed dramatically. Agent sessions, multi-turn conversations, tool-heavy workflows, and long context windows mean that the total token volume per task has increased by 5-10x even as per-token prices have dropped.
The net effect: most teams are spending more on AI APIs today than they were a year ago, despite cheaper individual calls. A developer running Claude Code for a full workday can easily burn through 2-5 million tokens. A production pipeline processing customer requests through an agent loop can hit 50-100 million tokens per day.
Here is what the major providers charge as of March 2026:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| Claude Opus 4 | $15.00 | $75.00 | 200K |
| Claude Sonnet 4 | $3.00 | $15.00 | 200K |
| Claude Haiku 3.5 | $0.80 | $4.00 | 200K |
| GPT-4o | $2.50 | $10.00 | 128K |
| GPT-4o mini | $0.15 | $0.60 | 128K |
| Gemini 2.0 Pro | $1.25 | $5.00 | 1M |
| Gemini 2.0 Flash | $0.075 | $0.30 | 1M |
The spread is enormous. Claude Opus input tokens cost 200x more than Gemini Flash input tokens. Output tokens cost 250x more. The model you choose and how efficiently you use it determines whether your monthly AI bill is $50 or $5,000.
But model selection is only part of the equation. The bigger lever is how many tokens you actually send. And for most teams, the answer is: far more than necessary.
Before you can reduce AI API costs, you need to understand where your tokens are being consumed. Most developers assume the prompt they type is the main cost driver. It is not. In a typical agent session or production pipeline, the token breakdown looks like this:
The user's actual prompt accounts for only about 15% of the total tokens in a typical request. The other 85% is context that gets sent along with every API call: the system prompt, the conversation history from previous turns, and the results of tool calls like file reads, web searches, and code executions.
This distribution has critical implications for cost optimization. Compressing just your prompt text is helpful, but compressing the context window — the full payload that gets sent to the model — is where the real savings are. A 40% reduction across the entire context window saves far more than a 40% reduction on just the user prompt.
In agent sessions specifically, the compounding effect makes this even more dramatic. Every piece of context from turn 1 gets carried forward to turns 2, 3, 4, and beyond. A verbose tool result in turn 3 gets re-sent in every subsequent API call. Over a 40-turn session, that single tool result might be transmitted 37 times. Compressing it once saves tokens 37 times over.
This is why token optimization at the point of entry — before text enters the conversation history — delivers outsized returns compared to any other cost reduction strategy.
The most direct way to reduce AI API costs is to send fewer tokens. Research from LLMLingua (EMNLP 2023) demonstrated that 40-70% of tokens in a typical prompt are redundant — the model can produce identical outputs without them. Filler words, hedging phrases, politeness padding, redundant qualifiers, and verbose sentence constructions account for the bulk of this waste.
Token optimization tools compress prompts before they reach the API, stripping redundant tokens while preserving the semantic content the model needs. The savings compound across every turn in a conversation, every retry, and every agent loop iteration.
Terse implements this as a real-time, on-device optimization pipeline. It applies over 20 compression techniques — from typo correction and whitespace normalization to filler removal, phrase shortening, and telegraph-style compression — in under 1 millisecond with zero API calls. Three configurable modes (Soft, Normal, Aggressive) let you control the compression-quality tradeoff.
Typical savings: 25-70% token reduction depending on the mode and input text. On a 100K tokens/day workflow at Claude Opus rates, that translates to $225-$1,050 saved per month on input tokens alone.
Anthropic, OpenAI, and Google all offer prompt caching that dramatically reduces the cost of repeated context. Anthropic's prompt caching, for example, charges only 10% of the normal input price for cached tokens after the first request.
The key to maximizing cache hits is structuring your prompts so that the static portions (system prompts, instructions, reference documents) come first, and the dynamic portions (user query, current context) come last. This way, the expensive static content is cached and reused across requests.
Cache-aware prompting strategies include:
Typical savings: 50-90% on cached input tokens. Combined with token optimization, you can reduce the cost of repeated context to a fraction of list price.
Not every API call needs Claude Opus or GPT-4o. In a typical agent workflow, many tasks are simple enough for a cheaper model: classifying intent, extracting structured data, formatting output, summarizing content, or making binary decisions.
Model routing directs each request to the cheapest model capable of handling it. A routing layer evaluates the complexity of each task and selects the appropriate model:
In practice, 60-70% of API calls in a production pipeline can be handled by the cheapest tier. If you are routing everything through Opus or GPT-4o by default, you are likely overpaying by 10-50x on the majority of your calls.
Typical savings: 50-80% on total API spend when properly routing, depending on the task mix.
As conversations grow longer, every new API call includes the full conversation history. By turn 20, you might be sending 50,000 tokens of history with each request — most of which the model does not need for the current task.
Context window management strategies include:
Context management is especially critical in agent sessions where tool results dominate the token count. A single file read can inject 5,000-10,000 tokens into the context, and that content gets carried forward through every subsequent turn. Truncating or summarizing tool results immediately after they are used can cut total session costs by 30-50%.
AI agents frequently make redundant tool calls — reading the same file multiple times, running the same search query twice, or fetching data they already have in context. Each duplicate call costs tokens both for the request and for the result that gets injected into the context.
Terse's agent monitor tracks every tool call in a session and flags duplicates in real time. In a typical 40-turn Claude Code session, we observe 15-25% of tool calls are duplicates that could be eliminated through better context management or caching.
Strategies for reducing duplicate tool calls:
Typical savings: 15-25% reduction in tool call tokens, which translates to 6-10% of total session cost since tool results account for roughly 40% of total tokens.
Both Anthropic and OpenAI offer batch APIs with significant discounts — typically 50% off standard pricing. If your workload can tolerate latency (results returned in hours rather than seconds), batch processing cuts your per-token cost in half.
Batch processing works well for:
The key constraint is latency: batch requests are not suitable for interactive applications or real-time agent sessions. But for any workload where you can queue requests and process results later, the 50% discount is essentially free money.
Typical savings: 50% on qualifying workloads.
Output tokens are universally more expensive than input tokens — often 3-5x more. Claude Opus charges $75 per million output tokens versus $15 for input. GPT-4o charges $10 versus $2.50. Controlling output length is therefore one of the highest-leverage cost optimizations available.
Strategies for controlling output length:
A model that defaults to 500-token responses when 50 tokens would suffice is costing you 10x more on output than necessary. Across thousands of API calls, this adds up fast.
Typical savings: 30-70% on output tokens, which is the most expensive component of your API bill.
Terse is purpose-built to address three of the seven strategies described above: token optimization (strategy 1), context window management (strategy 4), and duplicate tool call elimination (strategy 5). Here is how each works in practice.
Terse's 7-stage optimization pipeline compresses prompts in real time as you type. The pipeline runs entirely on-device with zero network calls, processing a typical prompt in under 1 millisecond. This is fundamentally different from approaches like LLMLingua that require a separate language model to compute per-token importance scores.
The pipeline applies compression techniques in a carefully ordered sequence:
Three modes give you control over the aggressiveness of compression:
The critical design choice is that compression happens at the point of entry — before text enters the conversation history. This means compressed text benefits from the compounding effect described earlier: tokens saved in turn 1 remain saved across all subsequent turns that include that content as context.
Terse's agent monitor provides real-time visibility into context window usage and tool call patterns across Claude Code, Cursor, OpenClaw, and Aider sessions. It tracks input tokens, output tokens, cache reads/writes, every tool invocation, and cumulative cost per session.
For context window management (strategy 4), the monitor shows you exactly where your tokens are going on each turn. You can see when tool results are dominating the context, when conversation history is ballooning, and when you are approaching the context window limit. This visibility alone changes behavior — developers who can see the per-turn cost of their agent sessions naturally write more concise prompts and manage context more carefully.
For duplicate tool call elimination (strategy 5), the monitor tracks every tool call and flags repeats. When Claude Code reads the same file three times in a session or runs the same grep query twice, Terse surfaces that information so you can understand the waste. In production agent pipelines, this data informs tool-caching strategies that eliminate redundant calls entirely.
Let us run the numbers on a concrete scenario. Consider a developer using Claude Opus for agent-assisted coding, consuming 100,000 input tokens per day across multiple sessions.
Now scale this to a team of 10 developers. Without optimization: $825/month. With the full optimization stack: $370/month. That is $5,460 saved per year on a single team. For organizations running production agent pipelines at 10-100x this token volume, the savings are proportionally larger — easily reaching tens of thousands of dollars per month.
Add model routing (strategy 3) on top of this — redirecting 60% of simple calls to Sonnet or Haiku — and the total cost reduction reaches 70-80%. A team spending $825/month can realistically bring that down to $165-$250/month by combining all seven strategies.
Here is the same calculation for GPT-4o, which many teams use as their primary model:
GPT-4o's lower base prices mean smaller absolute savings per developer, but the percentage reduction is actually higher because caching discounts are more impactful on already-cheaper tokens. For high-volume production pipelines processing millions of tokens daily, even GPT-4o savings scale into significant budgets.
Reducing AI API costs does not require a massive infrastructure overhaul. You can start seeing savings today with a few targeted changes.
Before optimizing, you need visibility. Check your provider dashboards (Anthropic Console, OpenAI Usage, Google Cloud billing) and understand your current token volume, model mix, and cost breakdown. If you are using agent tools like Claude Code, install Terse to get per-session cost tracking and tool call analysis.
The easiest win is compressing the tokens you are already sending. Download Terse from the GitHub releases page, install it, and grant Accessibility permissions. Terse runs as a floating overlay that automatically detects when you are using ChatGPT, Claude Code, Cursor, or other supported tools. Start in Normal mode for a good balance of compression and readability.
For detailed setup instructions, see the Terse documentation.
Review your prompt structure and reorganize to maximize cache hits. Put static content first, dynamic content last. If you are using Anthropic's API directly, enable prompt caching and monitor your cache hit rate in the dashboard.
Audit your API calls and classify them by complexity. Any call that a cheaper model can handle — classification, extraction, formatting, simple Q&A — should be routed to Haiku or GPT-4o mini. Reserve expensive models for complex reasoning, creative generation, and tasks where quality meaningfully improves with a more capable model.
Set explicit max_tokens on every API call. Add conciseness instructions to your system prompts. Use structured output (JSON mode, function calling) wherever possible. Output tokens are 3-5x more expensive than input tokens — every unnecessary output token is money wasted.
Cost optimization is not a one-time project. Token volumes change as your usage patterns evolve, new models launch with different pricing, and agent workflows become more complex. Set up ongoing monitoring, review your costs weekly, and continuously tune your optimization stack.
Terse compresses prompts in real time, monitors agent sessions, and tracks every token and tool call. On-device, zero latency, no API calls. Cut 40-70% of your token costs across Claude Code, ChatGPT, and every AI tool you use.
Download TerseExplore more about token optimization and AI cost reduction: