Context windows are growing, but more context does not always help. Long-form analysis and summarization benefit from comprehensive context; routing and classification are different — the model must emit a discrete decision on every call.
Past a threshold, additional context degrades routing decisions before the physical window fills.
In practice, routing quality often remains stable only within a limited fraction of the context window (around 40–50% in reported setups); beyond that share, margins tend to collapse.
The 40% cliff
Qwen2.5-7B on a 128k-scale setup is one documented case: F1 falls from 0.56 to 0.30 across a narrow range of prompt lengths — a cliff, not a slow slope.
Routing quality shifts from stable to unreliable within a small increase in context, rather than degrading gradually. That curve is one model’s; it illustrates the usable-band idea from the introduction, not a guarantee for every router.
Why routing breaks differently than generation
Open-ended generation can smooth over uncertainty with generic phrasing; quality often declines gradually. Routing still forces a discrete choice every time, so logit separation matters more than fluency. The failure is not softer text but wrong or nearly tied scores among options.
Score dilution → margin collapse → routing collapse
Score dilution
Attention over the prompt is softmax-normalized: weights sum to one, so each new token spreads a fixed mass of “focus” thinner. Current architectures do not guarantee that the gap between the correct logit and distractors grows with context length. As length increases, probability on the intended class shrinks even when the routing instruction is still present — context acts as structured noise for the routing decision, not neutral background.
Margin collapse
When routing instructions sit mid-context instead of at the edges, accuracy falls by over 30% relative to edge placement; adding a few hundred distractor tokens can suppress reasoning traces by up to 50% (Qwen3.5-27B). For routers, that reads as a shrinking top-1 minus top-2 margin: adjacent intents or model tiers stop separating cleanly.
Routing collapse
When the top-1 vs top-2 margin becomes too small (e.g. < 0.10), the router stops separating options and defaults to the safest path.
Hard negatives — retrieved chunks that appear relevant but imply the wrong route — act as additional distractors, further narrowing margins when placed alongside correct context.
Minimal production rules
Hard cap at ~40%
Start with a conservative fraction of MODEL_CONTEXT_WINDOW and adjust it based on observed margin stability; when over budget, truncate oldest history or documents first.
# For 128k model: ~50k routing budget
ROUTING_CONTEXT_CAP_FRACTION = 0.4
MAX_ROUTING_TOKENS = int(MODEL_CONTEXT_WINDOW * ROUTING_CONTEXT_CAP_FRACTION)
def prepare_routing_context(query, history, retrieved_docs):
context = build_prompt(query, history, retrieved_docs)
tokens = tokenizer.encode(context)
if len(tokens) > MAX_ROUTING_TOKENS:
# Truncate oldest history first
context = build_prompt(
query,
history[-N_RECENT_TURNS:],
retrieved_docs[:TOP_K_DOCS],
)
return context
Confidence-aware cascading
When the margin between the top two routes is small, treat the decision as unreliable and escalate to a stronger model instead of accepting a noisy route.
def route_with_cascade(query, context):
scores = router.predict_proba(query, context)
best, second_best = sorted(scores, reverse=True)[:2]
margin = best - second_best
if margin < CONFIDENCE_THRESHOLD: # e.g., 0.10
return FALLBACK_MODEL
return argmax(scores)
When context can be larger
The ~40% band (of declared capacity) applies when margin quality drives system behavior — routing, intent classification, model selection. Different tasks face different constraints:
- Document QA — When retrieval is high-precision, comprehensive context helps; the bottleneck is finding the right passage, not attention dilution alone
- Code generation — Cross-file dependencies may justify near-full windows when structure and imports matter more than a single routing decision
- Summarization — Errors are gradual, not binary; models can smooth over lost detail without catastrophic failure
The distinction: routing failures are costly and discrete (wrong model or route wastes budget or breaks the request). Generation failures are often recoverable through iteration or reranking.
Summary
Routing quality collapses before context windows fill. Operate within the first ~40% of declared capacity, monitor top-1 vs top-2 margins, and escalate to stronger models when scores are unstable — not when the window is full.
| Rule | Purpose | Typical threshold |
|---|---|---|
| Hard cap at ~40% | Prevent cliff-like degradation | ~50k for 128k models |
| Confidence-aware cascading | Escalate uncertain routes | Margin below 0.10 |