How AI Pen Testing Actually Works (and Where It Breaks)

How AI Pen Testing Actually Works (and Where It Breaks)

How AI Pen Testing Actually Works (and Where It Breaks)

0:00/1:34

Episode Summary

AI is starting to change penetration testing, but most people are asking the wrong question. In this episode of Secured, Cole Cornford sits down with Brendan Dolan-Gavitt, AI researcher at XBOW and former NYU professor, to unpack what autonomous pen testing really is, what it can reliably do today, and what still needs humans.

They explore why AI agents are great at scaling the boring parts of testing, like authenticated workflows and broad vulnerability coverage across huge attack surfaces, and why that does not automatically translate to deep, context-aware exploitation. The conversation also gets into the messy parts: AI systems overclaiming “serious” findings, business logic flaws that are hard to verify, audit expectations, and why scope control needs real guardrails, not vibes. From agent traces and validation models to cost curves and creative exfiltration tricks, this episode is a grounded look at where AI helps AppSec and where it can still cause damage if you trust it too much.

Chapters:

00:00 – Intro

03:10 – From academia to building autonomous security tools

05:00 – Human pen testers vs AI agents: what is actually different

06:40 – Where AI helps most: boring tasks and low hanging fruit

08:30 – Scale: a thousand targets vs hiring a thousand testers

10:20 – Accessibility, economics, and Jevons paradox

12:30 – Accountability: audit evidence, traces, and “who signs off”

14:40 – Scope control: avoiding prod and preventing out-of-scope actions

16:20 – Safety checkers, overseer agents, and persuasion resistance

18:40 – The cost question: VC money, inference pricing, and efficiency

21:20 – When AI wastes money and why prioritisation matters

23:50 – Failure mode: overclaiming business “vulnerabilities”

26:10 – Validation agents and adversarial peer review

28:40 – The scary clever stuff: exfiltrating files as images

31:00 – What AI finds well: XSS, SQLi, file traversal, hard proof bugs

33:10 – What AI struggles with: business logic and contextual judgement

35:20 – Hype vs skepticism and why nobody has a crystal ball

Related Posts

Let's work together

We help founders scale their voice

Discover how we can help you build a media engine for your startup

Let's work together

We help founders scale their voice

Discover how we can help you build a media engine for your startup

Let's work together

We help founders scale their voice

Discover how we can help you build a media engine for your startup

Day One exists to help founders and startup operators make better business decisions more often

Subscribe for helpful content from other successful founders, operators and investors

© Copyright W2D1 Media Pty Ltd. All rights reserved. 2025

Day One exists to help founders and startup operators make better business decisions more often

Subscribe for helpful content from other successful founders, operators and investors

© Copyright W2D1 Media Pty Ltd. All rights reserved. 2025

Day One exists to help founders and startup operators make better business decisions more often

Subscribe for helpful content from other successful founders, operators and investors

© Copyright W2D1 Media Pty Ltd. All rights reserved. 2025