Mitigating positional bias in LLM-as-a-judge evaluation: the swapping technique

LLM-as-a-Judge evaluation pipelines often suffer from a subtle but critical issue: position bias. In a standard pairwise comparison task, where a model evaluates two responses and selects the superior one, there’s a consistent pattern. One position — typically the first — gets favored. For instance, in one observed scenario, Response A was favored in 68% of evaluations, irrespective of actual quality difference.

This phenomenon, known as position bias, is one of the most pervasive and misleading issues in automated LLM evaluation.

Mechanisms driving positional preference

LLM judges display a consistent tendency to favor the answer presented in the first positional slot. Analysis indicates that judge models select the first response in 68% of comparisons, even when human annotators clearly prefer the second option. This happens because:

Training Patterns: The model’s training data may contain patterns where preferred or correct items are more likely to appear early in sequences
Attention Mechanisms: The inherent functionality of transformer architectures can lead to a disproportionate weighting or prioritization of tokens that appear earlier in the input sequence
Architectural Preferences: The model’s internal structure may encode a functional preference for the first positional representation during comparison tasks

Position bias is a widespread issue, affecting a variety of LLMs, with documented bias rates in the literature typically ranging from 60% to 75%.

The fix: position swapping methodology

The most effective and robust mitigation strategy involves performing the comparison twice, systematically swapping the positions of the responses between runs, and aggregating the resulting judgments conservatively.

def evaluate_with_position_swap(judge_llm, response_a, response_b, prompt):
    """Evaluate with position bias mitigation. Returns: 'A', 'B', or 'tie'"""

    # First run: A first, B second
    result_1 = judge_llm.compare(prompt, response_a, response_b)

    # Second run: B first, A second
    result_2 = judge_llm.compare(prompt, response_b, response_a)

    # Aggregation Logic: Only declare a winner if it prevails in both positions.
    if result_1 == 'response_1' and result_2 == 'response_2':
        return 'A'  # A wins in both positions
    elif result_1 == 'response_2' and result_2 == 'response_1':
        return 'B'  # B wins in both positions
    else:
        return 'tie'  # Conflicting results (e.g., R1 wins both times) are declared a tie

A genuinely superior response should prevail irrespective of the positional slot it occupies. This two-step methodology effectively isolates and neutralizes the systematic positional preference.

Results

Position swapping delivered substantial improvements:

Agreement with Human Raters: Increased from 65% to 77%
Observed Position Bias Rate: Reduced from 68% to 51% (demonstrating near-random positional choice)
Tie Rate: Increased from 8% to 19% (providing a more accurate reflection of comparison ambiguity)

The primary operational constraint is the 2x increase in API calls per comparative judgment. For most use cases, this increased cost is justified by the resulting improvement in signal fidelity and reliability.

Alternative: prompt-based mitigation

For cost-sensitive scenarios, try explicit anti-bias instructions:

system_prompt = """
    You are an impartial judge.
    Evaluate both responses solely on merit.

    CRITICAL: Ignore the order in which responses appear.
    Position should have NO influence.
"""

This reduced the position bias from 68% to 58%. While a positive shift, it is empirically less robust and reliable than the structural solution provided by position swapping.

Conclusion and takeaway

Position bias is a major source of error in LLM-as-a-Judge evaluations. Position swapping is the most reliable fix, transforming a biased pipeline into a trustworthy evaluation tool.