Research

LLMLingua and Prompt Compression — The Research Behind Terse's Token Optimization

LLMLingua (EMNLP 2023) proved that LLM prompts contain massive redundancy. Here's how Terse applies those findings to cut token costs by 40-70% with zero latency and zero API calls.

What Is LLMLingua?

LLMLingua is a prompt compression framework introduced by Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu at EMNLP 2023 in their paper "LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models." The core discovery was striking: the natural language prompts we send to large language models contain enormous amounts of redundancy, and a significant portion of those tokens can be removed without any measurable degradation in the model's output quality.

This was not a marginal finding. The researchers demonstrated compression ratios of up to 20x on certain tasks, with the majority of real-world prompts showing 40-70% redundancy. That means for every dollar you spend on API tokens, somewhere between 40 and 70 cents is paying for information the model does not need to produce a correct response.

The paper arrived at a moment when the AI industry was grappling with a fundamental scaling problem. Models were getting more capable but also more expensive. GPT-4's context window was precious real estate, and every token counted against both latency and cost. LLMLingua offered a rigorous, empirically validated answer: compress the prompt before sending it.

How LLMLingua Works

The LLMLingua approach relies on a smaller, cheaper language model (such as GPT-2 or LLaMA-7B) to estimate the importance of each token in a prompt. The process works in three stages:

Budget controller: The system allocates a compression budget across different segments of the prompt (instructions, demonstrations, and the question). Segments that contribute less to the model's perplexity receive more aggressive compression.
Token-level iterative compression: Within each segment, the smaller model computes per-token perplexity. Tokens with low perplexity (meaning they are highly predictable given their context) are candidates for removal. The intuition is simple: if a token is predictable, the target LLM can reconstruct its meaning from surrounding context.
Distribution alignment: Because the small model and the target LLM have different token distributions, LLMLingua includes an alignment step that adjusts the compression decisions to better match what the target model actually needs.

The result is a compressed prompt that preserves the semantic content the target model requires while stripping away tokens that carry little information. On benchmarks like GSM8K, BBH, and ShareGPT, the compressed prompts produced responses that were statistically indistinguishable from those generated with the full, uncompressed prompts.

The Follow-Up: LongLLMLingua and LLMLingua-2

The success of LLMLingua spawned two important follow-up papers. LongLLMLingua (2024) extended the framework to handle long-context scenarios, addressing the "lost in the middle" problem where models struggle to attend to information buried deep in long prompts. It introduced question-aware compression, ensuring that tokens relevant to the user's actual query are preserved even if they appear in low-perplexity regions.

LLMLingua-2 took a different approach entirely, training a dedicated compression model using data distilled from GPT-4. Rather than relying on perplexity as a proxy for importance, it used a trained classifier to make token-level retention decisions. This approach was faster and more accurate, but still required running a neural model for every compression pass.

Together, these papers established a clear research consensus: prompt compression works, the redundancy is real, and substantial token savings are achievable without degrading output quality.

Why Terse Does Not Use LLMLingua Directly

Given these results, the obvious question is: why not just run LLMLingua on every prompt? The answer comes down to three practical constraints that make model-based compression impractical for real-time, on-device use.

1. It Requires a Language Model

LLMLingua needs a smaller LLM (typically 7B parameters) running locally to compute token-level perplexity scores. Even quantized, this requires a GPU or significant CPU resources. Most developers writing prompts on a laptop do not have a spare 7B model running in the background, and spinning one up for each compression pass would add seconds of latency.

2. It Adds Latency

The perplexity computation is not instant. Even on a capable GPU, compressing a 2,000-token prompt through LLMLingua takes 200-500ms. For a tool that sits between the user and their LLM interface, that latency is unacceptable. Users expect their text to appear and update in real time, not after a half-second processing delay on every keystroke.

3. It Costs Tokens Itself

If you use an API-hosted smaller model instead of running one locally, you are spending tokens to save tokens. The economics only work if the compression ratio is high enough to offset the cost of the compression model's inference. For shorter prompts (under 500 tokens), the overhead can actually exceed the savings.

How Terse Adapts the Core Insight

Terse takes the fundamental finding from LLMLingua — that prompts contain 40-70% redundant information — and implements compression through an entirely different mechanism: deterministic, rule-based text transformation that runs in under 1 millisecond with zero external dependencies.

The key insight Terse builds on is that prompt redundancy is not random. It follows predictable patterns that humans consistently produce when writing natural language instructions. Filler words, hedging phrases, politeness padding, redundant qualifiers, verbose sentence constructions — these patterns account for the bulk of the redundancy that LLMLingua identifies statistically.

Terse's optimization pipeline applies over 20 distinct compression techniques across three configurable modes:

Soft mode: Typo correction and whitespace normalization only. Zero semantic change, typically 5-15% reduction.
Normal mode: Removes filler words, hedging language, politeness phrases, meta-commentary, and redundant qualifiers. 25-45% reduction on typical prompts.
Aggressive mode: Everything in Normal plus abbreviations, markdown stripping, and telegraph-style compression. 40-70% reduction, matching LLMLingua's published ranges.

Terse vs. LLMLingua: A Direct Comparison

The trade-offs between the two approaches are clear and quantifiable:

                    LLMLingua          Terse
Mechanism           Model-based        Rule-based
Latency             200-500ms          <1ms
Requires GPU        Yes (or API)       No
Runs on-device      Partially          Fully
API cost            Yes                Zero
Compression ratio   40-70%             40-70%
Handles code        Poorly             Yes (code-aware)
Real-time capable   No                 Yes
Privacy             Data leaves device Data stays local

The compression ratios converge because both approaches target the same underlying phenomenon: the predictable redundancy in human-written text. LLMLingua discovers this redundancy statistically; Terse encodes it as rules derived from analyzing thousands of real-world prompts. The end result is remarkably similar, but the runtime characteristics are fundamentally different.

Where LLMLingua has an advantage is on unusual or domain-specific text where Terse's rules may not fire. A highly technical physics prompt with no filler words will see little benefit from rule-based compression but might still benefit from LLMLingua's perplexity-based approach. In practice, however, these cases are rare — most prompts written by humans contain significant natural language scaffolding that rule-based compression handles well.

The Compounding Effect in Agent Sessions

Where Terse's approach becomes especially powerful is in agent sessions — extended interactions with tools like Claude Code, Cursor, or Windsurf where the model is called dozens or hundreds of times in a single task. In these sessions, every prompt includes context from previous turns, and that context grows with each iteration.

Consider a typical Claude Code session that runs 40 turns. If each turn's prompt averages 3,000 tokens and Terse compresses by 40%, that is 1,200 tokens saved per turn, or 48,000 tokens across the session. At Claude's pricing, that translates to real dollar savings on every coding task.

But the compounding goes deeper. Agent sessions often include the full conversation history in each new prompt. A filler phrase in turn 3 gets repeated in the context for turns 4 through 40. Compressing that phrase once, at the point of entry, eliminates its cost across all subsequent turns. This is the compounding effect: early compression yields exponential savings over the life of a session.

LLMLingua was evaluated primarily on single-shot prompts. The agent session use case — where Terse operates — amplifies the value of compression far beyond what the original paper measured.

Academic Context: Related Research

LLMLingua did not emerge in isolation. Several parallel research efforts explored prompt compression from different angles, and Terse draws insights from across this body of work:

Selective Context (Li et al., 2023) demonstrated that removing low-information sentences from prompts preserves task performance. Terse's sentence-level filtering in aggressive mode is inspired by this approach.
Gisting (Mu et al., 2023) trained models to compress prompts into shorter "gist" token sequences. While not directly applicable to rule-based systems, the finding that prompts can be represented in far fewer tokens validated the compression thesis.
AutoCompressors (Chevalier et al., 2023) used recursive summarization to compress long contexts. The key takeaway for Terse was that context compression is most valuable in multi-turn settings — exactly where Terse's agent monitor operates.
PCRL (Jung & Kim, 2024) applied reinforcement learning to learn compression policies, achieving compression ratios comparable to LLMLingua with lower computational cost. This confirmed that the redundancy patterns in prompts are learnable and predictable.

Collectively, this research establishes that prompt compression is not a hack or a trick. It is a well-studied technique with strong empirical support, and the redundancy it targets is a fundamental property of how humans write instructions for language models.

Practical Implications for Developers

If you are building with LLM APIs, the LLMLingua research has a direct implication for your costs and performance: you are almost certainly sending more tokens than the model needs. The question is how to address that in a way that fits into your workflow.

For batch processing or offline pipelines where latency is not critical, running LLMLingua (or its successor LLMLingua-2) as a preprocessing step is a viable option. The compression quality is excellent and the approach is well-validated.

For interactive use — writing prompts in real time, working with agent interfaces, iterating on instructions — Terse's rule-based approach delivers the same compression ratios without the latency penalty. It runs on every keystroke, compresses in under a millisecond, and requires no GPU, no API calls, and no model downloads. Your prompts stay on your device, compressed and optimized before they ever reach the network.

The research is clear: prompt compression works. The only question is which implementation fits your use case. For real-time, on-device, privacy-preserving compression, Terse takes the insights from LLMLingua and delivers them at the speed your workflow demands.

Compress Your Prompts Like the Research Intended

Terse applies LLMLingua's findings at zero latency. Rule-based, on-device, no API calls. Cut 40-70% of your token costs across Claude Code, ChatGPT, and every agent session.

Download Terse