Stand Up Agentic Tests On Your Lunch Break -- Bashmatica!

The GTM bets that shouldn't have worked, and did

One grew revenue 50x after half his team quit over the strategy. One brought in 50K signups in a single day with no paid budget. One generated 100M+ views from a stunt that took 50 hours to conceive. One asked every prospect to demo the product themselves instead of demoing it for them.

None of them followed the safe playbook. They treated GTM like an experiment, moved before they had proof, and made bets most founders would never get approved.

Let's Call It Cruise-Control-SDET Engineering

Most of what I write about agentic tooling lands on the side of caution, and for good reason; I've spent enough issues taking apart the failures that a reader could be forgiven for thinking I'd never let an agent near a test suite. That's not the case, and it never was. Agentic test generation is the holy grail we dreamed of for years, strived for relentlessly, and it's now, finally, happening. You can set up a robust self-managing agentic test stack on your lunch break.

So let's build the thing, end to end.

We're going to create the generation layer, get an agent pointed at a web app, guardrail the specs it drafts, and set up the one deterministic gate that makes the framework trustworthy and repeatable.

The build itself will be fairly vanilla. We'll install an MCP server that gives an agent eyes and hands on a real browser, point the agent at your application, and let it explore and write specs. Then you put a deterministic check between what the agent generates and what CI treats as a passing gate. Generation is the part the agent is good at; ratification is the part it can't be trusted with, so you hand that to a script. Let the agent generate; let a deterministic gate ratify. When we stick the landing on establishing that philosophy, the rest is mechanics.

Companion script for this issue: gate-keeper. Drops between your generation step and your CI gate. Runs three deterministic checks the authoring agent can't fake, an assertion-density floor, a state-manipulation scan, and an independent double-replay, then exits non-zero if any of them refuse the test. Hand-raiser keyword: GATEKEEPER. Wiring and exit-code behavior in the Quick Tip below.

For Further Reading

Playwright MCP (Playwright docs). The official setup for the MCP server that hands an agent a structured accessibility snapshot of the page and lets it drive a real browser.
@playwright/mcp (npm). The package itself; npx @playwright/mcp@latest fetches and starts it on demand.
Test generator (codegen) (Playwright docs). The record-and-replay layer, useful for seeding a baseline before you let the agent explore.
Bashmatica! #007: The Trust Decay Problem. Why the agent that writes a test can't be the thing that certifies it passed. The reason step four of this build exists.
Bashmatica! #008: When the Dashboard Says 94%. Coverage that passes while the meaningful gap stays open, and where gate-keeper's assertion-density floor comes from.

Step One: Install The Generation Layer

The generation layer is Playwright's MCP server, and the install is about as light as a real tool gets. It's an npm package, @playwright/mcp, and it runs on demand rather than needing a global install. All you need is Node 18 or newer.

If you're driving the agent through Claude Code, registering the server is a single command:

claude mcp add playwright npx @playwright/mcp@latest

That writes the server config into your client and persists it, so the next session has the Playwright tools available without re-registering. If you're wiring it into a different MCP client by hand, the config block is standard stdio:

{"mcpServers":{"playwright":{"command":"npx","args":["@playwright/mcp@latest"]}}}

One more command after the server is registered, to pull the browser binaries Playwright drives:

npx playwright install

That's the install.

No build step, no daemon to keep running; the server starts on demand when the agent calls a browser tool and stops when it's done.

The design choice worth understanding before you point an agent at it is how the server feeds the page to the model. It doesn't hand over screenshots, and it doesn't dump thousands of lines of raw HTML. It hands the agent the page's accessibility tree as a structured snapshot: buttons, inputs, headings, and ARIA roles, each tagged with a reference ID. The agent navigates by reference ID rather than pixel-hunting or guessing at brittle DOM selectors, which is cheaper in tokens and steadier across builds than vision-based approaches. The practical upshot is that the selectors the agent reaches for tend to be the semantic ones a careful developer would have written, getByRole and friends, instead of the generated IDs a record-and-replay tool grabs that break on the next deploy.

Step Two: Point The Agent At Your App And Let It Roll

With the server registered, the workflow is conversational. Start a session with Playwright, give the agent a running instance of your application to navigate, and describe the flow you want covered in plain language. Something like "open the app, complete a checkout with a test card (4111-1111-1111-1111), and verify the order-confirmation modal shows the right total." The agent navigates to the page, takes a snapshot of the accessibility tree, picks the next action by reference ID, performs it, snapshots again, and works the flow the way a user who reads the page's semantics directly would. You're describing intent; it's discovering the steps.

Two things make it worth the bother. First, the selectors: because it navigates by role and accessible name, the locators it writes are the stable, semantic kind, so the specs it drafts are less brittle than anything a recorder emits. Second, the discovery: open-ended exploration finds flows you'd never have recorded because you never clicked them, edge paths and error states and the fourth checkout option nobody remembered to test. As a way to find what's worth testing and draft the first version of the spec, it earns its keep.

A note on the older layer, since it still has a place. Playwright's npx playwright codegen <url> is the record-and-replay tool: you click through the app and it emits a runnable spec mirroring your path. It's deterministic and fast, and it's a good way to seed a baseline spec before you hand the agent the more difficult exploratory work. The codegen output is a starting point the docs themselves tell you to harden, not production test code; treat it as a starting frame for the agent to build on, not as the finished suite.

What you have at the end of step two is a directory of agent-drafted specs, written with semantic selectors, covering flows you may not have thought to cover, every one of which the agent ran and reported as passing. Which is exactly where the caveat lives.

Step Three: The One Caveat Worth Knowing

Here's the single thing you have to internalize before any of these specs gates a merge: the agent that wrote the test is also the thing that reported it passed. That's the self-grading gap, the grader and the graded collapsed into one process, and I spent all of Issue #7 on why it bites (an agent optimizing for a green result will reach past the application layer to manufacture one, and do it with total confidence, because it's incentivized to do so). The short version is that an agent-generated test the authoring agent certifies as passing isn't a verified test; it's the agent's opinion of its own work, formatted to look like a result. That's not a reason to skip agentic generation. It's the reason for step four. Generate freely; just don't let the generator be the judge.

Step Four: Ratify The Tests Deterministically In CI

The fix here is a deterministic wall between generation and the gate; gate-keeper is a single bash script you drop in front of any agent-generated spec before CI treats it as passing. It runs three checks, each mapping onto one way an agent ships a green test that proves nothing, and exits non-zero if any of them refuse the test.

The first check is an assertion-density floor: a spec that asserts almost nothing relative to its length is a coverage prop, not a gate, and this is the Issue #8 check catching the agent that confirms a page loaded without confirming it's correct. The second is a state-manipulation scan: if the spec reaches for page.evaluate, executeScript, a direct DOM write, a storage override, or a stubbed response to arrange the conditions of its own success, that's the Issue #7 failure, and the scan flags the exact line for a human (it refuses outright under --strict). The third is the one the static scans can't do: an independent double-replay. It runs the test twice, decoupled from the agent that wrote it, and refuses any verdict that doesn't reproduce. A test that passes once and fails once was never a gate; it was a coin flip the agent won on the run it watched.

Wiring it into CI is one line per spec, placed after generation and before the merge gate:

./gate-keeper.sh --test ./tests/checkout.spec.ts --run "npx playwright test checkout"

Exit 0 means RATIFIED and the test may gate; exit 1 means NOT RATIFIED and the merge stops; exit 2 is a usage error. Wire that exit code into your pipeline's gate condition and an unratified spec can't merge, full stop. Point --run at an isolated runner so the replay is genuinely independent of the session that authored the test, and the thing that writes the spec and the thing that certifies it are never one process.

Step Five: Where TestScout Fits As You Build This Out

I'm building a small MCP suite around exactly this discipline, in the open and early, so take these as worked examples of the direction rather than finished products.

The furthest along is LightScout MCP, and its check_threshold tool is a clean instance of a deterministic CI gate by contract. You give it a URL, it runs the page against performance thresholds (defaulting to Google's "poor" boundaries on Core Web Vitals), and it returns a pass or fail. It only fails when the numbers are genuinely bad, and the verdict is deterministic: no agent's opinion in the loop, just a measured page held to a fixed boundary. That's the same shape as gate-keeper, applied to performance instead of test integrity.

The one still on the spec board is testscout-maintain, where the build-it-yourself version of this issue is headed. The architecture is a Planner agent that categorizes a suite's failures, specialized Worker agents that fix them in isolated git worktrees, and a separate Judge agent that merges the fixes, re-runs the suite, and decides whether the result auto-commits, goes to a PR, or escalates. The Judge is a different agent from the Workers that proposed the fixes, the same two-key arrangement gate-keeper enforces for generation, applied to test/spec maintenance. The thing that proposes a fix is never the thing that ratifies it.

10x the context. Half the time.

Speak your prompts into ChatGPT or Claude and get detailed, paste-ready input that actually gives you useful output. Wispr Flow captures what you'd cut when typing. Free on Mac, Windows, and iPhone.

Quick Tip: Agentic Gate-Keeper

The fastest way to feel what the script does is the bundled demo, which ships a spec with low assertion density, an injected state write, and a deliberately non-deterministic runner:

./gate-keeper.sh --example

# DENSITY   checkout.spec.ts: 1 assertion per 29 lines (floor: 1 per 25)  FAIL
# INJECTION checkout.spec.ts: state manipulation on line 33               FLAG
# REPLAY    checkout.spec.ts: pass / fail across 2 runs (non-deterministic) FAIL
#
# Verdict: NOT RATIFIED (2 refusal(s), 1 flag(s))

Then run it against a real spec with your own run command wired in, and tune the floor with --density N (lower is stricter) or harden the injection scan into a hard failure with --strict. Full implementation, including the configurable density floor and the injection-pattern list, in the bashmatica-scripts repo.

Quick Wins

🟢 Easy (15 min): Install the generation layer and nothing else. Run claude mcp add playwright npx @playwright/mcp@latest, then npx playwright install, then start a session and ask the agent to open one page of your app and describe what it sees. You'll have the accessibility-tree workflow in your hands in a quarter hour, with zero commitment to letting it gate anything yet.

🟡 Medium (1 hour): Point the agent at one real flow, let it draft a spec, then run that spec through gate-keeper with a real run command. Treat every DENSITY and REPLAY failure as a spec that was never gating anything, and every INJECTION flag as a line to read assuming the worst. You'll learn more about the agent's output in one ratification pass than in an hour of reading diffs.

🔴 Advanced (half day): Put the deterministic gate between your generation step and your CI gate as a hard requirement. No agent-authored spec counts as a passing gate until it clears the density floor, survives the injection scan, and reproduces its verdict across an independent double-replay. Wire the exit code into your pipeline's gate condition and run the replay decoupled from the agent and its report, so the thing that writes the test and the thing that certifies it are never the same process.

Next Week

We stay in the test layer and go one floor down, to the config file that grounds all of this: the context an agent needs to maintain a suite without guessing what each test was supposed to verify. Next week, the Planner-Workers-Judge architecture for keeping a suite green as the app drifts underneath it, why the Judge has to be a separate agent from the Workers that propose the fixes, and where the same ratify-don't-trust discipline shows up in self-healing maintenance.

Agentic test generation is here, it's installable in an afternoon, and it's incredibly effective at the part it's good at: exploring an app, finding flows you'd have missed, and drafting specs with the semantic selectors that don't shatter on the next deploy. The main thing it can't be trusted with is the verdict on its own work, and the fix for that isn't to distrust its creativity; it's to refuse its certification and hand that one job to a deterministic gate that runs the same rules on every spec regardless of how confident the agent was.

Install the server, point the agent at your app, and let it generate freely. Then put gate-keeper between what it writes and what your pipeline trusts, so the green checkmark means a test that asserts something, manipulates nothing, and reproduces its own result. Let the agent generate; let a deterministic gate ratify.

That's the build, and it holds up.

P.S. I'm building the TestScout MCP suite in the open, and it's early; LightScout MCP is the piece furthest along (187 downloads on Chrome Web Store since April!), and its check_threshold tool is the deterministic-gate philosophy from this issue made concrete: a CI pass/fail check that holds a page to Core Web Vitals boundaries and fails the build only when the numbers are genuinely bad. The next tool on the board, testscout-maintain, carries the same two-key split into suite maintenance. If this issue helped you stand something up, forward it. If someone forwarded this to you, subscribe at bashmatica.com.

I can help you or your team with:

Production Health Monitors
Optimize Workflows
Deployment Automation
Test Automation
CI/CD Workflows
Pipeline & Automation Audits
Fixed-Fee Integration Checks