Hacker News new | past | comments | ask | show | jobs | submit login

I agree. Also, ELO rankings can artificially deflate strong performers when having too few binary comparisons. The Guanaco paper points that out, giving instead a 326-point lead to GPT-4 over Guanaco-65B across 10k orderings, based on GPT-4’s opinion, which corresponds to Guanaco winning about 13% of the time. But even that only relies on the Vicuna benchmark set.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: