How Attention Works

Large language models don’t read their input the way you read a paragraph. They don’t scan left to right with equal focus on every word. Instead, they use an attention mechanism that distributes a fixed computational budget across all the tokens in the context window.

Think of it like a spotlight with limited brightness. The more area you try to illuminate, the dimmer the light gets at any given point.

Short context means focused attention. When the window contains a few hundred tokens, every token gets strong attention. Predictions are sharp.

Long context means diluted attention. When the window grows to thousands of tokens, that same budget gets spread thin. The model starts missing things, not because the information isn’t there, but because it can’t attend to all of it equally.

The U-Shaped Attention Curve

If attention simply diluted evenly across the window, long contexts would just be uniformly noisier. The reality is worse.

Research from Liu et al. (Lost in the Middle: How Language Models Use Long Contexts, 2023) revealed a consistent pattern in how models distribute attention across their context:

Beginning of context: Strong attention. The model pays close attention to what comes first.
Middle of context: Attention drops to roughly 20% recall. Information placed here is likely to be ignored or misinterpreted.
End of context: Attention recovers. Recent information gets strong focus again.

This creates a U-shaped attention curve. Tokens at the start and end of the window get read carefully. Tokens in the middle get skimmed or skipped entirely.

Short context — Focused

Every token gets strong attention

Long context — Diluted (Lost in the Middle)

↑ beginning

↑ middle: attention drops to ~20%

↑ end

The takeaway: Adding more context doesn't help if the model can't attend to it. More tokens = less focus per token.

(Liu et al., 'Lost in the Middle', 2023)

Step 0 of 3

Lost in the Middle: The model’s attention is not uniform. Information placed in the middle of a long context has roughly a 20% chance of being recalled accurately, compared to much higher recall for information at the beginning or end.

Why This Matters for System Design

This isn’t just an academic curiosity. The Lost in the Middle phenomenon has direct consequences for anyone building systems that rely on LLMs.

Chained operations create a middle zone. When a system executes multiple steps in sequence, the results accumulate in the context window. The first step’s output sits near the beginning (decent attention). The last step’s output lands near the end (decent attention). Everything in between falls into the dead zone. The model is more likely to hallucinate about intermediate results than about the first or last ones.

Longer descriptions don’t always help. It’s tempting to pack more detail into prompts, thinking more context produces better results. But if those extra tokens land in the middle of the window, the model may attend to the first and last sentences while the paragraph in between gets lost.

Adding more context can actually hurt. There’s a point where stuffing additional information into the context window degrades performance rather than improving it, because the model simply can’t attend to everything you’ve given it.

Rule of thumb: Adding more context doesn’t help if the model can’t attend to it. More tokens equals less focus per token, and the middle of the window is where information goes to die.

Practical Implications

These constraints shape how production LLM systems need to be designed:

Position matters. Critical information should be placed at the beginning or end of the context, not buried in the middle.
Brevity wins. Concise, focused context outperforms verbose context. Every unnecessary token dilutes attention for the tokens that matter.
Structured input helps. Breaking information into clearly delineated sections gives the model better signals about where to focus.
Context management is essential. Systems that naively concatenate all available information into one long prompt will hit attention limits. Smart systems curate what goes into the window and where it’s placed.

These aren’t theoretical concerns. They show up in real production systems as hallucinations about intermediate data, missed instructions, and inconsistent behavior that’s hard to debug because the root cause is positional, not logical.

What This Means for Building on Moveworks

These attention constraints are fundamental to how all transformer-based language models work. They’re not bugs to be fixed; they’re architectural realities to be designed around.

The Agentic Reasoning Engine is built with these constraints in mind. Rather than hoping the model will attend to everything perfectly, the reasoning engine structures how information flows through the context window, managing what the model sees and when it sees it.