I agree. Also, ELO rankings can artificially deflate strong performers when having too few binary comparisons. The Guanaco paper points that out, giving instead a 326-point lead to GPT-4 over Guanaco-65B across 10k orderings, based on GPT-4’s opinion, which corresponds to Guanaco winning about 13% of the time. But even that only relies on the Vicuna benchmark set.