In The Blink Of AI with Georgie Healy Presented by Deel

How AI Models Are Really Judged, with Peter Gostev (Arena / LMArena)

3 July 2026

Guest Peter Gostev Head of AI Capabilities, Arena (LMArena)

Show Host Georgie Healy Podcast Host, Google for Startups

Apple Podcasts Spotify YouTube

Topics AI

“A model can pass every test you write and still produce something that looks completely awful.”

— Peter Gostev

Peter Gostev is head of AI capabilities at Arena (LMArena), the community-based platform where millions of real people vote in blind tests to rank AI models, born out of research at UC Berkeley. Before Arena, Peter was Head of AI at Moonpig and built a large following sharing hands-on explorations of what the latest models can actually do. He joins Georgie Healy from London for a genuinely nerdy, insider look at how models are judged and where the frontier is heading.

In this episode, Peter explains the difference between static benchmarks and human judgment, and why a model can pass every test you write and still produce something that looks completely awful. He breaks down the current state of the leaderboards, why Anthropic's models are dominating and how that tracks with real world adoption, and gives a sharp comparison of the top Western models, including why Anthropic's non-reasoning models are exceptional while OpenAI's strength lies in deep reasoning. Georgie and Peter get into why people aren't using Chinese models more despite their quality, the economics behind AI pricing and how enterprise usage is priced very differently from consumer subscriptions, why release cadence matters as much as capability, and what the wave of data centre investment means for the models arriving next. Along the way there's a fond detour on Opus 3 as the model you could talk to for hours, and why better models can sometimes feel worse.

Tune in for a clear-eyed, hype-free guide to how AI models are really evaluated, straight from someone who watches the charts move in real time.

Proudly presented by

A Day One® show

In The Blink of AI is produced with Day One — the podcast network for founders, investors and operators. Want a show like this for your company?

Work with us →

Other shows worth a listen

Episodes from across the network exploring similar themes — part of the same Day One conversation.

Are Your People More Productive — or Just Faster? — Claudia Barriga-Larriviere, Startup Flamingo

▶

Building Tech Teams with James MacDonald

Are Your People More Productive — or Just Faster? — Claudia Barriga-Larriviere, Startup Flamingo

20 July 2026

What's Your Moat When AI Can Copy Your Product in 48 Hours? | Dilip Jacob from Pitchberry

▶

Pick My Brain with Alan Jones

What's Your Moat When AI Can Copy Your Product in 48 Hours? | Dilip Jacob from Pitchberry

7 July 2026

Taryn Williams: Building and Exiting 6 Companies — and the Cost No One Talks About

▶

Perspective X with Pauline Fetaui

Taryn Williams: Building and Exiting 6 Companies — and the Cost No One Talks About

2 July 2026

Produced by W2D1 Media

Turn podcasting into pipeline

We're the team behind the Day One Network and Blackbird's Wild Hearts. We help founders, funds and operators build trust, authority and deal flow with a show tailored to their market.

Investors

Win better deals and stay top‑of‑mind with founders.

Book a call →

Founders & Operators

Close more deals and build a category you own.

Book a call →

How AI Models Are Really Judged, with Peter Gostev (Arena / LMArena)

Other shows worth a listen

Are Your People More Productive — or Just Faster? — Claudia Barriga-Larriviere, Startup Flamingo

What's Your Moat When AI Can Copy Your Product in 48 Hours? | Dilip Jacob from Pitchberry

Taryn Williams: Building and Exiting 6 Companies — and the Cost No One Talks About

Turn podcasting into pipeline

Investors

Founders & Operators

Sponsors

Get more content like this

Related episodes

"The SaaS Apocalypse Is A Myth" with Freddie McKenzie, Co-Founder & CEO of Manifest

Mira Murati, the Quiet Superpower in the AI Race

Inside ElevenLabs: Voice AI, Cloning Ethics, and the End of the Call Centre?

"AI Should Bring Us Closer Together, Not Make Us More Lonely" with Akshay Kothari Co-Founder of Notion

Building AI at Scale: Inside Australia's Largest Bank with Blair Hudson

You Can. But Should You? | AI and Ethics with Dr Simon Longstaff

Other shows worth a listen

Are Your People More Productive — or Just Faster? — Claudia Barriga-Larriviere, Startup Flamingo

What's Your Moat When AI Can Copy Your Product in 48 Hours? | Dilip Jacob from Pitchberry

Taryn Williams: Building and Exiting 6 Companies — and the Cost No One Talks About

Turn podcasting into pipeline

Investors

Founders & Operators

Sponsors

Get more content like this