The Context Window Is Not a Dump Bucket
Your RAG pipeline retrieves 40 chunks. You stuff all 40 into the prompt. The model returns a confident answer that ignores the three most relevant chunks completely.
You check the chunks. They were retrieved correctly. They contained the right information. The model just... didn't use them.
You blame the retriever. You tune your embeddings. You rewrite your prompt. You switch models. The problem persists, across every variation, because you're fixing the wrong thing.
Here's what's actually happening: the relevant chunks were buried in the middle of a 40,000-token context. And LLMs, by a well-documented and reproducible failure mode, perform significantly worse on content in the middle of long contexts than on content at the beginning or end.
The retriever was fine. The embeddings were fine. The model was fine.
Your context design failed. And no amount of prompt tuning fixes a structural assembly problem.
This is Part 3 of the Harness Engineering series. Part 1 introduced the seven-layer Harness Architecture. Part 2 covered the Normalization Layer. This article goes deep on Layer 2 - Context Orchestration - covering memory architectures, retrieval strategies, context compression, and the Lost in the Middle problem.
What Context Engineering Actually Is
Most teams treat context assembly as a retrieval problem. Find the relevant documents, put them in the prompt, done.
This misses the point entirely.
Context Engineering is the discipline of deciding - precisely and intentionally - what information the model sees at each step of execution, how that information is structured, and in what order it appears. It's not retrieval. It's curation, compression, and assembly.
The context window is the model's entire working memory. Everything it knows about the current task lives there. Every decision it makes is a function of what's in that window and how it's arranged. The quality of your context assembly is a stronger predictor of output quality than the model you chose.
What goes into the context window is what the model reasons over. Nothing more. Nothing less.
This is why I treat context orchestration as a distinct engineering discipline - not a retrieval concern, not a prompt engineering concern, but its own layer in the Harness Architecture with its own failure modes, its own tools, and its own metrics.
The Three Failure Modes of Naive Context Assembly
Before building solutions, understand what breaks. Naive context assembly - retrieve, concatenate, send - fails in three specific ways.
Failure Mode 1: Lost in the Middle
Liu et al. (2023) demonstrated what practitioners had been observing anecdotally: LLMs perform significantly worse when relevant information is placed in the middle of long contexts compared to the beginning or end. The performance degradation is not linear - it's a U-shaped curve, with the middle of the context being the dead zone.
The implication is severe. If your retriever returns 20 chunks and you concatenate them in retrieval order, the most relevant chunk might land in position 10-15 - exactly where the model's attention is weakest. You've paid for 20 retrievals and the model effectively used 4.
What breaks: RAG systems that naively concatenate retrieved chunks. Summarization pipelines that dump full documents into context. Any system where content ordering is determined by retrieval rank rather than model attention dynamics.
Failure Mode 2: Context Bloat
Every token in the context window costs money and consumes attention capacity. Boilerplate, redundant content, and low-relevance chunks don't just waste tokens - they dilute the signal-to-noise ratio for the model's attention mechanism.
A 128k context window feels enormous until your agent is 30 steps into a complex task and has accumulated tool outputs, intermediate reasoning, error messages, and retry attempts. Without active context management, the window fills with stale content and the model starts losing track of the original task.
What breaks: Long-running agentic workflows. Multi-document summarization. Any task where context accumulates over multiple turns without compression.
Failure Mode 3: Memory Amnesia
Across sessions, most LLM systems start from scratch. The user's preferences established in session 1 are unknown in session 2. The decision made three conversations ago has to be re-explained. The agent that worked on a task yesterday has no memory of what it did.
This isn't just a UX problem. For complex agentic tasks that span multiple sessions, memory amnesia means the agent re-does work, re-asks questions, and loses the accumulated context that makes it effective.
What breaks: Any agentic system designed for ongoing tasks. Customer-facing assistants. Code agents working on multi-session projects.
The Context Orchestration Stack
A production context orchestration layer has four components. Each addresses a specific failure mode.
Component 1: Retrieval with Re-ranking
Basic RAG uses embedding similarity to retrieve chunks. This is a good start and a terrible finish.
Embedding similarity retrieves semantically related content - but semantic relatedness is not the same as task relevance. A chunk that mentions the same topic isn't necessarily the chunk that helps answer the specific question. Re-ranking adds a second pass that scores retrieved chunks against the specific query for relevance, not just similarity.
from typing import Listimport numpy as npclass ContextRetriever: def __init__(self, vector_store, reranker, top_k: int = 20, final_k: int = 5): self.vector_store = vector_store self.reranker = reranker self.top_k = top_k # retrieve broadly self.final_k = final_k # rerank to precise set def retrieve(self, query: str) -> List[str]: # Stage 1: broad embedding retrieval candidates = self.vector_store.similarity_search(query, k=self.top_k) # Stage 2: rerank for task relevance scored = self.reranker.score(query, candidates) ranked = sorted(scored, key=lambda x: x.score, reverse=True) return [chunk.content for chunk in ranked[:self.final_k]]The two-stage approach - retrieve broadly, rerank precisely - is now standard in production RAG systems. The reranker (cross-encoder models like Cohere Rerank, or bge-reranker) has access to both query and document together, which produces dramatically better relevance scores than embedding similarity alone.
Without re-ranking: you retrieve the 5 most semantically similar chunks. With re-ranking: you retrieve the 5 most task-relevant chunks. The difference in downstream output quality is significant.
Without this component: you're paying retrieval cost for chunks the model won't use.
Component 2: Position-Aware Assembly
Given the Lost in the Middle problem, chunk ordering is not neutral. The most important content must go at the beginning and end of the context - never the middle.
I call this the Primacy-Recency Assembly pattern: exploit the model's natural primacy and recency bias by placing the highest-value content where attention is strongest.
def assemble_context(chunks: List[str], query: str, system_prompt: str) -> str: if not chunks: return system_prompt # Most relevant chunk first, second most relevant last, # remaining chunks fill the middle - exploiting the U-curve if len(chunks) == 1: ordered = chunks elif len(chunks) == 2: ordered = chunks else: ordered = [chunks[0]] + chunks[2:] + [chunks[1]] context_block = "\n\n---\n\n".join(ordered) return f"{system_prompt}\n\n<context>\n{context_block}\n</context>\n\nQuery: {query}"This is the Primacy-Recency Assembly pattern - exploiting the model's natural primacy and recency bias by placing the highest-value content where attention is strongest. It's a small change with measurable impact on retrieval-augmented tasks.
Without this component: you've sorted your chunks by embedding rank and handed them to the model in a order optimized for your retriever, not for the model's attention.
Component 3: Context Compression
Not all content that needs to be in the context needs to be in the context at full length. Long documents can be summarized. Conversation history can be compressed. Intermediate agent reasoning can be distilled into structured checkpoints.
class ContextCompressor: def __init__(self, llm, max_tokens_per_chunk: int = 500): self.llm = llm self.max_tokens = max_tokens_per_chunk def compress_history(self, messages: List[dict], keep_last_n: int = 3) -> List[dict]: if len(messages) <= keep_last_n: return messages # Keep recent messages verbatim recent = messages[-keep_last_n:] # Compress older messages into a summary older = messages[:-keep_last_n] summary_text = self._summarize(older) summary_message = { "role": "system", "content": f"[Conversation summary - {len(older)} earlier messages]: {summary_text}" } return [summary_message] + recent def compress_document(self, document: str, focus: str) -> str: prompt = ( f"Extract only the information relevant to: {focus}\n\n" f"Document:\n{document}\n\n" "Return only the relevant excerpts, verbatim. " "If nothing is relevant, return 'No relevant content.'" ) return self.llm.call(prompt) def _summarize(self, messages: List[dict]) -> str: text = "\n".join(f"{m['role']}: {m['content']}" for m in messages) prompt = f"Summarize the key decisions, facts, and context from this conversation:\n\n{text}" return self.llm.call(prompt)The key principle: compress toward task relevance, not toward brevity. A compression that removes task-relevant detail to save tokens is worse than no compression at all. The focus parameter in compress_document ensures the compressor knows what to preserve.
Without this component: context bloat accumulates silently across turns. The model loses the thread. Long-running tasks degrade in coherence proportional to their length.
Component 4: Memory Architecture
For agents that operate across sessions, you need a memory system. Memory in this context is not conversation history - it's structured, persistent, retrievable knowledge about the user, the task, and prior decisions.
Three tiers:
Working memory - the current context window. Volatile, token-bounded, task-specific. Managed by the context assembler within each session.
Episodic memory - a structured store of past interactions, decisions, and outcomes. Retrievable by semantic similarity. Used to surface relevant prior context at the start of new sessions.
class EpisodicMemory: def __init__(self, vector_store): self.store = vector_store def save(self, session_id: str, summary: str, metadata: dict): self.store.add( id=session_id, content=summary, metadata={**metadata, "timestamp": datetime.utcnow().isoformat()} ) def recall(self, query: str, top_k: int = 3) -> List[str]: results = self.store.similarity_search(query, k=top_k) return [r.content for r in results]Semantic memory - long-term facts about the user, project, or domain. Structured key-value store, not vector search. Things that don't change session to session: the user's preferred programming language, the project's tech stack, standing decisions already made.
class SemanticMemory: def __init__(self, kv_store): self.store = kv_store def remember(self, key: str, value: str): self.store.set(key, value) def recall(self, key: str) -> str | None: return self.store.get(key) def inject_into_context(self, system_prompt: str) -> str: facts = self.store.get_all() if not facts: return system_prompt facts_block = "\n".join(f"- {k}: {v}" for k, v in facts.items()) return f"{system_prompt}\n\n<known_facts>\n{facts_block}\n</known_facts>"Without a memory architecture: every session starts blind. The agent re-asks questions, re-establishes context, and re-makes decisions that were already made. For users doing ongoing work, this is the single largest friction point in production agentic systems.
The Full Context Orchestration Pipeline
graph TD
A[Incoming Query] --> B[Semantic Memory Inject]
B --> C[Episodic Memory Recall]
C --> D[Two-Stage Retrieval]
D --> E[Context Compression]
E --> F[Position-Aware Assembly]
F --> G[Token Budget Check]
G -- Over Budget --> H[Aggressive Compression]
H --> F
G -- Within Budget --> I[Assembled Context]
I --> J[LLM Reasoning]
J --> K[Save to Episodic Memory]
style A fill:#4A90E2,color:#fff
style B fill:#9B59B6,color:#fff
style C fill:#9B59B6,color:#fff
style D fill:#7B68EE,color:#fff
style E fill:#7B68EE,color:#fff
style F fill:#7B68EE,color:#fff
style G fill:#6BCF7F,color:#fff
style H fill:#E74C3C,color:#fff
style I fill:#98D8C8,color:#000
style J fill:#FFD93D,color:#000
style K fill:#9B59B6,color:#fff
Every query goes through this pipeline. The token budget check is not optional - it's the mechanism that prevents context bloat from accumulating silently. If assembled context exceeds budget, compression runs again more aggressively before the LLM call proceeds.
Budget your context window the way you budget compute. It is a finite resource with a cost per token and a quality curve that degrades with misuse.
Token Budget Management
The context window is a budget. Spend it intentionally.
A production context budget looks like this:
| Slot | Allocation | Content |
|---|---|---|
| System prompt | 10-15% | Instructions, persona, constraints |
| Semantic memory | 5-10% | Standing facts, user preferences |
| Episodic recall | 10-15% | Relevant prior sessions |
| Retrieved context | 30-40% | RAG chunks, documents |
| Conversation history | 15-20% | Recent turns (compressed) |
| Output headroom | 15-20% | Space for model response |
These are starting percentages, not hard rules. The right allocation depends on your task. A summarization task needs more retrieved context. A conversational task needs more history. A first-turn task needs no episodic recall.
The critical constraint: output headroom must never be sacrificed. Running out of context mid-generation produces truncated, broken outputs. Always reserve it.
class TokenBudgetManager: def __init__(self, model_context_limit: int, output_reserve: int): self.limit = model_context_limit self.reserve = output_reserve self.available = model_context_limit - output_reserve def fits(self, text: str, tokenizer) -> bool: return tokenizer.count(text) <= self.available def trim_to_fit(self, components: dict, tokenizer, priorities: list) -> dict: """ Trim components in reverse priority order until context fits. priorities: list of keys in order of what to cut last. """ result = dict(components) while not self.fits(self._join(result), tokenizer): # Cut from lowest priority first for key in reversed(priorities): if key in result and result[key]: result[key] = self._trim_half(result[key]) break return result def _join(self, components: dict) -> str: return "\n\n".join(v for v in components.values() if v) def _trim_half(self, text: str) -> str: return text[:len(text) // 2] + "...[truncated]"What Observability Looks Like for This Layer
Track these metrics in production:
Context utilization rate - what fraction of your token budget is actually used on average? Consistently above 90% signals context bloat risk. Consistently below 40% signals over-allocation in retrieval - you're paying for tokens you don't need.
Retrieval hit rate - of the chunks retrieved and included in context, what fraction are actually referenced in the model's output? A low hit rate means your retriever is returning irrelevant content. Measure this by checking whether retrieved chunk IDs appear in citations or are semantically echoed in the response.
Position distribution of cited content - are citations clustering at the beginning and end of context (expected with Lost in the Middle) or distributing evenly? If they're clustering heavily at position 0, your middle content is being ignored regardless of relevance.
Memory recall precision - when episodic memory is recalled, does it actually improve output quality on the subsequent task? Track this with A/B comparison: sessions with memory recall vs. sessions without.
The Named Pattern: Context Gravity
I use the term Context Gravity to describe the pull that position exerts on model attention. Content at the beginning and end of the context window has high context gravity - the model is reliably drawn to it. Content in the middle has low context gravity - it exists in the prompt but often doesn't meaningfully influence the output.
Context Engineering is the practice of managing context gravity intentionally. You don't just fill the window - you place content where gravity will work in your favor.
High-gravity positions: beginning of context (right after system prompt), end of context (right before the query).
Low-gravity positions: middle of context, especially beyond token position 20k in a long context.
The engineering rule: never place your most critical content in a low-gravity position. If a chunk is important enough to retrieve, it's important enough to place where the model will actually attend to it.
What to Build First
First: Two-stage retrieval. Replace single-pass embedding search with retrieve-then-rerank. This is the highest-impact change to a RAG system and can be added in a day using Cohere Rerank, bge-reranker, or a cross-encoder from Sentence Transformers.
Second: Token budget enforcement. Add explicit token counting before every LLM call. Reject or compress contexts that exceed budget. This prevents silent context bloat from accumulating in long-running tasks.
Third: Position-aware assembly. Reorder your chunks using the primacy-recency pattern. Most-relevant first, second-most-relevant last, remainder in the middle.
Fourth: Conversation compression. For multi-turn systems, add history summarization. Keep the last 3-5 turns verbatim, summarize everything older into a structured summary message.
Fifth: Semantic memory. Add a lightweight key-value store for standing facts. User preferences, project context, standing decisions. Inject at session start.
Sixth: Episodic memory. Add vector-stored session summaries. Recall at session start based on query similarity to prior session topics.
The first three address immediate RAG quality issues. Four through six are what separate a good RAG system from a production agentic memory system.
The Principle
The context window is not a staging area. It's not a place to put everything you retrieved and hope the model sorts it out.
The context window is the model's mind. What you put in it, and where you put it, determines what the model thinks.
Context Engineering is the discipline of taking that seriously - of treating the token budget as a managed resource, of understanding how position affects attention, of building memory systems that make agents more effective over time rather than starting blind on every session.
You can't make the model smarter by giving it more tokens. You can make it dramatically more effective by giving it better-organized, better-prioritized, better-compressed context.
The model reasons over what you give it. Give it the right things, in the right order, at the right size.
What's Next in This Series
- Part 1: Harness Engineering - The Missing Layer - The full seven-layer Harness Architecture overview
- Part 2: Normalization and Input Defense - Prompt injection, input sanitization, and multi-surface consistency
- Part 4: Gated Execution - Policy engines, human-in-the-loop design, dry-run patterns, and budget guards for agentic systems with real-world side effects
- Part 5: Validation Layer Design - Schema validators, semantic checks, repair prompt patterns, and when to fail fast vs. recover
- Part 6: Retry, Fallback, and Circuit Breaking - Building resilient LLM infrastructure that survives model outages and latency spikes
- Part 7: State Management for Agentic Systems - Checkpoint-resume strategies, cross-session memory, and durable state for long-running agents
- Part 8: Deterministic Constraint Systems - Building tool registries and action manifests that prevent hallucinated actions in agentic systems
References
-
Liu, N. F., Lin, K., Hewitt, J., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157-173. https://doi.org/10.1162/tacl_a_00638
-
Gao, Y., Xiong, Y., Gao, X., et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997. https://arxiv.org/abs/2312.10997
-
Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085. https://arxiv.org/abs/1901.04085
-
Cohere. (2024). Rerank API Documentation. https://docs.cohere.com/reference/rerank
-
Park, J. S., O'Brien, J. C., Cai, C. J., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. ACM UIST 2023. https://arxiv.org/abs/2304.03442
-
Zhong, W., Guo, L., Gao, Q., et al. (2024). MemoryBank: Enhancing Large Language Models with Long-Term Memory. AAAI 2024. https://arxiv.org/abs/2305.10250
-
Anthropic. (2024). Claude long context tips. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/long-context-tips
Related Articles
- Deterministic Constraint Systems: Building Tool Registries That Keep Agents in Scope
- Normalization and Input Defense: Hardening the Entry Point of Your LLM System
- State Management for Agentic Systems: How to Build Agents That Don't Start Over
- Harness Engineering: The Missing Layer Between LLMs and Production Systems