Accelerating Unbiased LLM Evaluation
via Synthetic Feedback

Zhaoyi Zhou, Yuda Song, Andrea Zanette

Carnegie Mellon University

TL;DR We propose Control Variates Evaluation, an unbiased LLM evaluation method with reduced human annotations.

Motivation

Accurately evaluating the performance of large language models (LLMs) is crucial before large-scale deployment. Human evaluation remains the gold standard, but this approach demands substantial time and financial resources. It may also diminish user experience when conducted with active system users,

Synthetic evaluation, on the other hand, generates annotations using a reward model or LLM (e.g., GPT-4). Though reducing the need for extensive human involvement, synthetic evaluators can not perfectly reflect human preference and often introduce bias, undermining the evaluation reliability.

In this work, we propose Control Variates Evaluation to integrate human and synthetic feedback, mitigating reliance on human annotations while maintaining unbiased evaluations.

Figure 1: Control Variates Evaluation makes use of a possibly inaccurate synthetic evaluator to reduce the variance of evaluation, reducing the need of human annotations while preserving unbiasedness.

Saving Is Predictable

Control variates theory indicates that the percentage of human annotation is predictable without actually running the evaluation process. This is demonstrated in the figure below, in which we shift the human evaluation curve according to theoretical prediction, and the shifted curve overlaps with the practical control variates evaluation curve.

Figure 2: Averaged mean-square error versus number of human annotations for Skywork-8B (finetuned) on Chatbot Arena. The $x$-coordinate of curves "Human" and "Ours" correspond to the number of human annotations. The curve "Human (shifted)" is derived by horizontally scaling the Human Evaluation curve according to the theoretically predicated saving. The averaged mean-square error of Control Variates Evaluation (ours) converges to near 0, indicating that it has negligible bias in contrast with the high bias of synthetic evaluation. The human annotation saving ratio aligns perfectly with the actual variance relationship between Human Evaluation and Control Variates Evaluation.

Finetuning Improves Saving

On many popular LLM evaluation benchmarks such as Chatbot Arena and MT Bench, there are abundant off-the-shelf human annotations for pre-generated language model responses. Now suppose we have a new LLM and we want to compare it with the existing ones in the benchmark. We can make use of these existing human annotations to finetune the synthetic evaluator before running the Control Variates Evaluation.

Note that the dataset used for finetuning the synthetic annotator contains responses generated by LLMs that are different from the LLMs that we wish to evaluate, i.e., the responses in evaluation dataset are out of distribution w.r.t. the finetuning data. Nevertheless, we show that the finetuned model still generalizes well in terms of the correlation coefficient to the human annotations, with a significant increase in human annotation saving.

Figure 3: Averaged human annotation saving ratio before and after fine-tuning for GRM-Gemma-2B-sftreg and Skywork-Reward-Llama-3.1-8B-v0.2 on Chatbot Arena and MT-Bench. Under all setups, we observe at least 5% increase in the saving ratio.

Algorithm

Our method is based on control variates, a classical variance reduction method. Given human preference score $ z $ and synthetic preference score $ \hat z $ on one response pair, Control Variates Evaluation outputs a variance reduced estimate \[ z^{\mathsf{cv}; \alpha} = z - \alpha (\hat z - \mu_{\hat z}), \] Here $\mu_{\hat z}$ is the synthetic win rate, computed by the mean of synthetic preference score overall samples in the evaluation dataset. $\alpha$ controls the variance of evaluation, which has optimal value \[ \alpha^* = \frac{\mathrm{Cov}[z, \hat z]}{\mathrm{Var}[\hat z]}. \] In practice, $\alpha^*$ is estimated on response pairs with both human and synthetic evaluations. The output win rate is the mean of $z^{\mathsf{cv}; \alpha}$ over all human-annotated response pairs.

BibTeX

@article{zhou2025accelerating,
      author    = {Zhou, Zhaoyi and Song, Yuda and Zanette, Andrea},
      title     = {Accelerating Unbiased LLM Evaluation via Synthetic Feedback}, 
      booktitle = {ArXiv Preprint},
      year      = {2025},
    }