Our QA team says AI made them eight times faster. Here's what I actually believe.

My QA team told me their test documentation and test execution throughput is up around 800% since we gave them Claude working alongside Playwright. That is their number, not mine. We haven't independently instrumented it, and I'm naturally suspicious of any improvement that ends in two zeros.

But I've watched what changed, and I believe the shape of it even where I hold the exact figure loosely. This post is about what actually changed, what the published research says about trusting AI-generated tests, and where I think the line currently sits between real leverage and coverage theatre.

What we actually did

Some context. Our QA function is small. A QA manager, two manual testers and one test automation engineer, inside an engineering team that runs a governed AI pilot with human approval gates and peer review on everything that moves towards production.

The bottleneck was never the testing itself. It was everything around the testing. Writing test documentation, turning exploratory sessions into repeatable scripts, and keeping Playwright suites in step with a UI that changes every sprint. Our manual testers spent most of their week producing artefacts about testing rather than testing.

What we want from AI here is simple. Manual testers describe behaviour in plain language, Claude turns that into structured test documentation and draft Playwright specs, and the automation engineer reviews, corrects and owns what goes into the suite. The human stays the author of record. The model does the typing.

That last part matters more than any tooling decision. The 800% isn't Claude writing tests unsupervised. It's three people no longer hand-writing documentation that a model can draft in seconds, and an automation engineer reviewing ten drafts in the time it took to write two from scratch.

This stopped being a hack in 2025

When we started, wiring an LLM to a browser was something you assembled yourself. That era is over, and it's worth being precise about why.

Microsoft now ships an official MCP server for Playwright, @playwright/mcp. The design choice that makes it work is that it drives the browser through Playwright's accessibility tree rather than screenshots. No vision model, no pixel guessing, just structured data about what's actually on the page. That single decision removes a whole class of flakiness people associate with "AI clicking around a browser".

Then Playwright went further and made agents a first-party feature. Playwright's test agents ship as a planner, a generator and a healer. The planner explores your running app and produces a Markdown test plan. The generator turns that plan into executable tests, and this is the important bit, it verifies selectors and assertions live against the app as it performs the scenarios. It's designed specifically against the two classic failure modes of naive LLM test generation, hallucinated assertions and made-up selectors.

Setup is one command:

npx playwright init-agents --loop=claude

When the framework vendor ships the agent workflow themselves, with named support for Claude Code and VS Code, you're no longer early-adopting. You're just adopting.

What the research says about trusting this

Here's where I want to push against my own team's number, because the published evidence is more sobering than any vendor demo.

The best study we have is Meta's TestGen-LLM paper, from deploying LLM test generation at test-a-thons for Instagram and Facebook. Their funnel: 75% of generated test cases built correctly, 57% passed reliably, and only 25% actually increased coverage. It improved 11.5% of the classes it was applied to.

Read those numbers again. At Meta, with serious engineering investment wrapped around the model, three quarters of raw generated tests did not add coverage. The system was still worth deploying, and 73% of its recommendations were accepted by Meta engineers for production. But every single one passed through a human gate first, and the gate rejected a quarter of what it saw.

Academic evaluations point the same direction. The TestPilot study ran LLM unit-test generation across 25 npm packages and 1,684 API functions and reached a median statement coverage of 70.2%. Useful, genuinely. But nobody who has read these papers would let a model commit tests without review.

So when my team says 800%, the research tells me how to interpret it. The gain is real and it lives in a specific place, the drafting and documentation layer, where a wrong first attempt costs a review cycle rather than a production incident. It is not a claim that the model produces correct tests three quarters of the time, because at the state of the art, it doesn't.

The failure mode that worries me most

Everyone talks about hallucinated assertions. They're real, but they're also the failure you catch, because a wrong assertion fails loudly in review or in CI.

The one I watch for is quieter. Playwright's healer agent, when it can't fix a failing test, will produce what the docs describe as a passing test or a skipped test if it believes the functionality is broken. Think about that in a team that has stopped paying attention. An agent that can mark tests as skipped is an agent that can silently shrink your active coverage while the dashboard stays green.

That's the general pattern to defend against, and I'd give it a name: coverage theatre. Tests that run, pass and verify nothing. A suite that grows impressively while the set of behaviours it would actually catch regressions in stays flat. LLMs are spectacular at producing the appearance of testing, and if your only metric is test count or coverage percentage, you will reward exactly that.

Our defences are boring and organisational rather than clever and technical. Every generated test gets a named human owner before it enters the suite. The automation engineer reviews drafts the way you'd review a junior's pull request, with the explicit question "what would this fail to catch?". And anything an agent skips or patches gets surfaced to a person, not buried in a passing run.

What I'd tell another engineering leader

If you run engineering and your QA team hasn't touched this yet, my honest view is that you're leaving the cheapest productivity gain of the current cycle on the table. But sequence matters.

Start where wrongness is cheap. Documentation, test plans and draft specs are the right entry point, because a bad draft costs minutes. Letting a model author your regression suite unsupervised is the wrong entry point, because the Meta funnel tells you most raw output doesn't earn its place.

Keep humans as the gate, and make the gate real. 73% acceptance at Meta means 27% rejection by people who were paying attention. If your review gate never rejects anything, it isn't a gate, it's a rubber stamp with extra steps.

And measure something before you brag about anything. Our 800% is a team-reported figure and I present it as exactly that. The next step for us is instrumenting it properly, cycle time on test authoring, escaped defects, and how often generated tests are the ones that catch a real regression. When I have those numbers I'll write the follow-up, including the parts that turn out worse than the headline.

Because that's the honest state of this in mid-2026. The leverage is real, the tooling has gone first-party, and the research says the model is nowhere near trustworthy without a human holding the gate. All three of those things are true at once, and the teams getting value are the ones acting like it.