TL;DR We propose Control Variates Evaluation, an unbiased LLM evaluation method with reduced human annotations.
Accurately evaluating the performance of large language models (LLMs) is crucial before large-scale deployment. Human evaluation remains the gold standard, but this approach demands substantial time and financial resources. It may also diminish user experience when conducted with active system users,
Synthetic evaluation , on the other hand, generates annotations using a reward model or LLM (e.g., GPT-4). Though reducing the need for extensive human involvement, synthetic evaluators can not perfectly reflect human preference and often introduce bias, undermining the evaluation reliability.
In this work, we propose Control Variates Evaluation to integrate human and synthetic feedback, mitigating reliance on human annotations while maintaining unbiased evaluations.
Figure 1: (Left) Control Variates Evaluation makes use of a possibly inaccurate synthetic evaluator to reduce the variance of evaluation, reducing the need of human annotations while preserving unbiasedness. (Right) Averaged mean square error v.s. number of human annotations for Human Evaluation, Synthetic Evaluation and Control Variates Evaluation using the finetuned Skywork-8B evaluator on Chatbot Arena. The Synthetic Evaluation has high bias, while the bias of Human and Control Variates Evaluations are negligible. Control Variates Evaluation reduces the variance of Human Evaluation.
Our method is based on control variates, a classical variance reduction method. Suppose we have an evaluation dataset consisting of prompts and unannonated responses pairs generated from two different LLMs, and we want to estimate the win rate between the two LLMs. We gather synthetic annotations on all samples and human annotations on a subset of the samples.
Given human preference score \( z \) and synthetic preference score \( \hat z \) on one response pair, Control Variates Evaluation outputs a variance reduced estimate \[ z^{\mathsf{cv}; \alpha} = z - \alpha (\hat z - \mu_{\hat z}), \] Here $\mu_{\hat z}$ is the synthetic win rate, computed by the mean of synthetic preference score overall samples in the evaluation dataset. $\alpha$ controls the variance of evaluation, which has optimal value \[ \alpha^* = \frac{\mathrm{Cov}[z, \hat z]}{\mathrm{Var}[\hat z]}. \] In practice, $\alpha^*$ is estimated on response pairs with both human and synthetic evaluations. The output win rate is the mean of $z^{\mathsf{cv}; \alpha}$ over all human-annotated response pairs.
Control Variates Evaluation applies to a variety of benchmarks and synthetic evaluators, ranging from small 2B reward models to large language models. Our experiments (as shown in the table below) demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. All these savings can be achieved with an easy-to-deploy reward model at nearly no cost.
We can predict the percentage of human annotation saving in Control Variates Evaluation without actually running the evaluation process. In fact, control variates theory indicates that the ratio of the number of human annotations in Human Evaluation and Control Variates Evaluation should be $1:(1-\rho^2)$ so that they have the same variance, where $\rho$ is the correlation coefficient between human and synthetic preference and can be estimated with a small subset of evaluation data. This is demonstrated in the figure below, in which we shift the human evaluation curve according to theoretical prediction, and the shifted curve overlaps with the practical control variates evaluation curve
Figure 2: Averaged mean-square error versus number of human annotations for Skywork-8B (finetuned) on Chatbot Arena. The $x$-coordinate of curves "Human" and "Control Variates" correspond to the number of human annotations. The curve "Human (shifted)" is derived by horizontally scaling the Human Evaluation curve according to the theoretically predicated saving. The averaged mean-square error of Control Variates Evaluation converges to near 0, indicating that it has negligible bias. The human annotation saving ratio aligns perfectly with the actual variance relationship between Human Evaluation and Control Variates Evaluation.
On many popular LLM evaluation benchmarks such as Chatbot Arena and MT Bench , there are abundant off-the-shelf human annotations for pre-generated language model responses. Now suppose we have a new LLM and we want to compare it with the existing ones in the benchmark. We can make use of these existing human annotations to finetune the synthetic evaluator before running the Control Variates Evaluation.
Note that the dataset used for finetuning the synthetic annotator contains responses generated by LLMs that are different from the LLMs that we wish to evaluate, i.e., the responses in evaluation dataset are out of distribution w.r.t. the finetuning data. Nevertheless, we show that the finetuned model still generalizes well in terms of the correlation coefficient to the human annotations, with a significant increase in human annotation saving.
Figure 3: Averaged human annotation saving ratio before and after fine-tuning for GRM-Gemma-2B-sftreg and Skywork-Reward-Llama-3.1-8B-v0.2 on Chatbot Arena and MT-Bench. Under all setups, we observe at least 5\% increase in the saving ratio.