Research

Selective Context Pruning — How Terse Removes Redundant Context from AI Conversations

Agent sessions accumulate massive context windows filled with repeated instructions, duplicate tool results, and redundant file reads. Selective context pruning identifies and removes this waste, cutting costs by 20-40% without losing meaningful information.

What Is Selective Context?

Selective context is the practice of identifying and removing low-information or redundant content from an AI conversation's context window. Instead of feeding every prior turn, tool result, and system message verbatim into the next request, selective context pruning analyzes the conversation history and strips out content that adds no new information.

The concept is straightforward: if the model has already seen the contents of a file, there is no reason to include the full file again when it appears in a later tool result. If the same instruction has been repeated across five consecutive turns, one instance carries the same semantic weight as five. Selective context pruning formalizes this intuition into a systematic process.

In the context of AI coding agents like Claude Code, Cursor, and Aider, this matters enormously. A typical 30-minute agent session can consume 150,000+ tokens of context, with 30-50% of that being redundant information the model has already processed.

The Academic Foundation

The research basis for selective context pruning comes from Li et al.'s 2023 paper, Selective Context: Efficient Prompt Compression via Token-Level Self-Information. The core finding is that not all tokens in a prompt carry equal information. Many tokens are predictable given their surrounding context and can be removed without degrading the model's ability to understand the remaining content.

Li et al. demonstrated that by computing the self-information of each token (how surprising it is given the preceding context), you can identify and remove the most predictable tokens while preserving the high-information ones. Their experiments showed that removing up to 50% of tokens from prompts often had negligible impact on task performance across question answering, summarization, and code generation benchmarks.

This principle maps directly to the agent session context: tool results that repeat previously seen information carry near-zero self-information. The model already knows the content. Including it again wastes tokens and money without improving the model's understanding.

Why Context Matters in Agent Sessions

Modern AI agents operate within fixed context windows. Claude's context window is 200,000 tokens. GPT-4o supports 128,000 tokens. These are hard limits, but the practical limits are tighter than the theoretical ones.

Cost Scales Linearly with Context

Every token in the context window costs money. At Claude Sonnet's pricing of $3 per million input tokens, a fully loaded 200K context window costs $0.60 per request. Over a 50-turn agent session, that adds up fast. If 35% of those tokens are redundant, you are paying $10.50 in unnecessary input costs per session.

Repeated Instructions Accumulate

Agent frameworks often prepend system instructions, CLAUDE.md rules, and tool definitions to every turn. In a 40-turn session, this means the same 2,000-token system prompt appears 40 times, consuming 80,000 tokens of context for information the model absorbed on the first turn.

Tool Results Contain Massive Redundancy

When an agent reads a file, the full file content enters the context. When it reads the same file again three turns later (a common pattern during iterative debugging), the identical content enters the context a second time. Grep results often overlap with previously read files. Bash command outputs frequently echo information already present in the conversation.

Context Fill Degrades Performance

Research consistently shows that model performance degrades as context utilization approaches the maximum. The "lost in the middle" phenomenon means that information buried in the middle of a long context is less likely to be attended to. At 85%+ context fill, models begin to miss important details, produce less coherent responses, and make more errors. Keeping context lean is not just a cost optimization; it is a quality optimization.

How Terse Implements Selective Context

Terse's agent monitoring system tracks context composition in real-time and applies multiple selective context techniques to identify waste and generate actionable insights.

Jaccard Deduplication

Terse computes Jaccard similarity between sentences across conversation turns. When two sentences share a Jaccard coefficient above 0.7, they are flagged as semantically duplicate. This catches not just exact repeats but paraphrased repetitions — a common pattern when agents rephrase instructions or tool results across turns.

The Jaccard index is computed as the size of the intersection of two token sets divided by the size of their union. It is lightweight enough to run on every turn without adding latency:

J(A, B) = |A ∩ B| / |A ∪ B|

// Example: two sentences from different turns
A = "Read the contents of src/main.rs to understand the structure"
B = "Let me read src/main.rs to understand the code structure"
// Token overlap is high → J ≈ 0.72 → flagged as duplicate

Repeated Instruction Detection

Terse identifies when the same instructions appear across multiple turns. This is especially common with CLAUDE.md rules, system prompts, and tool-use guidelines that get prepended to every request. The monitor flags these and estimates the token waste from repetition.

Context Fill Monitoring

The agent monitor tracks cumulative context usage across the session and provides real-time warnings:

60% fill — advisory warning, suggesting the agent consider summarizing context
85% fill — critical alert, indicating performance degradation is likely
95% fill — the agent will soon hit the context limit and lose earlier conversation history

These thresholds are based on empirical testing showing that model coherence drops measurably above 85% utilization, particularly for complex multi-step reasoning tasks.

Tool Result Compression Estimation

Different tool types produce different levels of redundancy. Terse applies compression estimates based on observed patterns across thousands of agent sessions:

Read tool results: ~60% compressible — file contents are often already partially present in the context from previous reads or grep results
Grep tool results: ~40% compressible — search results frequently overlap with files already read in full
Bash tool results: ~30% compressible — command outputs tend to be more unique, but build logs, test outputs, and status commands often repeat

These estimates feed into the overall context efficiency score that Terse displays for each agent session.

Redundant File Read Detection

One of the most common sources of context waste in agent sessions is reading the same file multiple times. Terse tracks every file read across the session and flags when a file is read again without any intervening edits. In a typical Claude Code session, 15-25% of Read tool calls target files that are already fully present in the context.

The monitor distinguishes between genuinely redundant reads (file unchanged since last read) and necessary re-reads (file was edited between reads), so it only flags actual waste.

Duplicate Tool Call Detection

Beyond file reads, Terse detects duplicate tool calls of any type: repeated grep searches with the same pattern, repeated bash commands, and repeated glob operations. These duplicates are surprisingly common in agent sessions, particularly when the agent loses track of work it has already done due to the "lost in the middle" attention problem.

CLAUDE.md Rule Generation

Selective context pruning addresses redundancy within a single session, but the same patterns tend to recur across sessions. Terse extends the selective context principle into cross-session optimization through automatic CLAUDE.md rule generation.

When Terse detects recurring patterns of context waste — such as an agent repeatedly reading the same set of files at the start of every session, or consistently issuing duplicate tool calls — it generates rules that can be added to CLAUDE.md to teach the agent to avoid these patterns in future sessions.

For example, if the monitor observes that an agent reads package.json, tsconfig.json, and src/index.ts at the start of every session, it might generate a rule like:

# Project structure is well-established. Avoid reading
# config files (package.json, tsconfig.json) unless
# specifically needed for the current task.

This transforms reactive pruning into proactive prevention. The agent learns to avoid generating redundant context in the first place, which is strictly better than pruning it after the fact.

Real-World Context Bloat: Examples

To illustrate how severe context bloat gets in practice, here are patterns Terse commonly detects in agent sessions:

The Debugging Loop

An agent debugging a failing test will often: read the test file, read the implementation file, run the test, see the error, read the test file again, read the implementation again, make a change, run the test again. In a 10-iteration debug loop, the same two files might be read 8-10 times each. With files averaging 200 lines (roughly 1,500 tokens each), that is 12,000-15,000 wasted tokens on redundant reads alone.

The System Prompt Multiplier

A Claude Code session with a 3,000-token CLAUDE.md file that runs for 40 turns includes 120,000 tokens of repeated system instructions. Cache hits mitigate the cost impact, but the context window consumption is real. At turn 35, those repeated instructions are competing with the actual conversation content for the model's attention.

The Grep-Then-Read Pattern

Agents frequently grep for a pattern, receive results showing matching lines from 5 files, then proceed to read all 5 files in full. The grep results (which already contain the relevant lines and surrounding context) are now redundant — the full file reads supersede them. Yet both the grep results and the full file contents remain in the context, effectively doubling the token cost for that information.

Measuring the Impact

Across sessions monitored by Terse, selective context analysis reveals consistent patterns:

Average context redundancy: 32% of tokens are duplicates or near-duplicates
Average file re-read rate: 2.3x per unique file per session
Average duplicate tool calls: 18% of all tool invocations
Potential cost savings from eliminating redundancy: 20-40% per session

These numbers translate directly into dollars. For a team running 50 agent sessions per day at an average cost of $2 per session, eliminating 30% redundancy saves $30/day or roughly $900/month.

How This Relates to Other Optimization Techniques

Selective context pruning is one layer in a comprehensive optimization strategy. It works alongside other techniques that Terse implements:

LLMLingua-style compression targets token-level redundancy within individual messages, removing predictable tokens while preserving meaning
Pattern optimization applies rule-based transformations to shorten common phrases and remove filler language
NLP analysis identifies and removes hedging, meta-language, and politeness markers that consume tokens without adding information

Together, these techniques form Terse's 7-stage optimization pipeline, which processes prompts before they are sent and monitors context composition throughout the session.

Start Monitoring Your Agent Sessions

Terse tracks context composition, detects redundancy, and generates optimization rules — all on-device, with zero cloud dependency.

Download Terse