Peter Gostev is head of AI capabilities at Arena (LMArena), the community-based platform where millions of real people vote in blind tests to rank AI models, born out of research at UC Berkeley. Before Arena, Peter was Head of AI at Moonpig and built a large following sharing hands-on explorations of what the latest models can actually do. He joins Georgie Healy from London for a genuinely nerdy, insider look at how models are judged and where the frontier is heading.
In this episode, Peter explains the difference between static benchmarks and human judgment, and why a model can pass every test you write and still produce something that looks completely awful. He breaks down the current state of the leaderboards, why Anthropic's models are dominating and how that tracks with real world adoption, and gives a sharp comparison of the top Western models, including why Anthropic's non-reasoning models are exceptional while OpenAI's strength lies in deep reasoning. Georgie and Peter get into why people aren't using Chinese models more despite their quality, the economics behind AI pricing and how enterprise usage is priced very differently from consumer subscriptions, why release cadence matters as much as capability, and what the wave of data centre investment means for the models arriving next. Along the way there's a fond detour on Opus 3 as the model you could talk to for hours, and why better models can sometimes feel worse.
Tune in for a clear-eyed, hype-free guide to how AI models are really evaluated, straight from someone who watches the charts move in real time.



