Few-shot prompting is a standard technique in NLP engineering. While we frequently discuss selection strategies (choosing which examples to use), the ordering of these examples often receives less attention. We typically default to random ordering or static lists. However, given the known positional biases in Large Language Models (LLMs), it makes sense to treat example ordering as an optimization variable rather than a constant.
The mechanism: primacy, recency, and the U-curve
The assumption that models process all context equally is often incorrect. Research into attention mechanisms reveals a U-shaped performance curve (often referred to as the “Lost in the Middle” phenomenon). Although newer long-context architectures are mitigating this issue, information at the boundaries (beginning and end) is still typically prioritized over the middle.
The direction of the bias depends on the architecture:
- Primacy Bias: Causal masking creates an inherent preference for early tokens. Larger, more capable models often favor the first examples presented
- Recency Bias: Mechanisms like Rotary Positional Embeddings (RoPE) and attention weight decay can cause the model to favor the most recent tokens. This is often more pronounced in smaller models or when the semantic quality of options is low
Experiment: the best-last strategy
For context-dependent tasks, such as intent classification or complex reasoning, we can test the hypothesis that maximizing recency helps the model.
Instead of random placement, we can order examples by semantic similarity. The logic is to place the examples most similar to the current query closest to the generation point (the end of the prompt).
from sklearn.metrics.pairwise import cosine_similarity
def order_examples_best_last(query_embedding, example_embeddings, examples):
"""
Sorts examples by similarity to the query in ascending order.
The most relevant example appears last (closest to generation).
"""
similarities = cosine_similarity(
[query_embedding],
example_embeddings
)[0]
# Sort ascending: least similar first, most similar last
sorted_indices = similarities.argsort()
return [examples[i] for i in sorted_indices]
# Usage
ordered = order_examples_best_last(
query_emb,
few_shot_embs,
few_shot_examples
)
prompt = f"{instruction}\n" + "\n".join(ordered) + f"\n{query}"
Observations and impact
Research on intent recognition datasets shows this “best last” strategy increased accuracy by 0.2–1.7%.
It is important to note two constraints:
- Semantic Variance: The effect correlates with variance; if all examples are equidistant from the query, ordering impact diminishes
- Model Size: As noted in the mechanism section, larger models might benefit from a “best first” strategy due to primacy bias
For tasks with limited context, this difference seems marginal. However, in production environments, prompt ordering is a zero-cost hyperparameter. It requires no model fine-tuning and no additional inference latency.
Broader application
This principle extends beyond classification. Instruction-tuned models can flip preferences based solely on the order of presentation. This positional bias is particularly evident at higher temperatures ($T=1$), where adherence to ambiguous instructions can fluctuate.
Conclusion
Positional bias is an architectural reality. It is not necessarily a bug to be fixed, but a behavior to be characterized.
If you observe inconsistent performance in your few-shot tasks, don’t assume the model attends equally to all examples. Verify the distribution of your examples. If you feel that the model ignores your instructions, check the order — your best examples might be lost in the middle.