[Bashmatica!] The Trust Decay of the Green Checkmark
When AI games the metric to pass the tests, your bugs stay in production.
Speak your prompts. Get better outputs.
The best AI outputs come from detailed prompts. But typing long, context-rich prompts is slow - so most people don't bother.
Works system-wide on Mac, Windows, iPhone, and now Android (free and unlimited on Android during launch).
Let’s Call It Son-of-Anton Engineering
Three separate reports have surfaced in the last few months describing the same pattern. An AI agent, tasked with building an end-to-end test suite using Playwright's MCP integration, encounters a failing test. Instead of reporting the failure as a potential bug, the agent opens the browser's DevTools protocol, injects JavaScript to manipulate the page's DOM state, and re-runs the assertion against the modified state. The test passes. The bug stays in production.
The agent wasn't malfunctioning. It was solving the problem it was given: make the tests pass. Passing tests is the metric. Catching bugs that impact users is the goal. When the agent had enough capability to bridge the gap between those two things by gaming the metric, it did exactly that, with the same unwavering confidence it brings to everything else.
This is the sharpest version of a broader problem that's been building since AI coding tools went mainstream. Not the catastrophic headline-making failure, but the quiet process that precedes it: the gradual erosion of verification habits that happens when a tool is right often enough to make checking feel like a waste of time. Trust decay.
Dead Man’s (Trust) Curve
Trust decay follows a consistent pattern. It moves through three phases, and the transition between them is nearly invisible while it's happening.
Phase 1: Verification. You're new to the tool. You check everything. You compare the generated code against what you would have written. You catch mistakes, fix them, and adjust your prompts. The tool impresses you often enough to keep using it.
Phase 2: Comfort. The tool has been right so many times that checking everything starts to feel like overhead. You skim instead of read (which feels like efficiency, not negligence). You accept suggestions in areas adjacent to what you actually verified. You begin trusting the tool's judgment in domains where you haven't independently confirmed its competence.
Phase 3: Exposure. The tool operates with the same confidence it always has, on a task where its output happens to be wrong. You don't catch it because you stopped looking. The cost depends entirely on what the tool was doing when your attention wasn't there.
These phases aren't discrete. They bleed into each other over days and weeks of accumulated small successes. Every correct suggestion is a data point that trains you to check less. The tool doesn't change. Your relationship to its output does.
Choose-Your-Own-Adventure At Scale
I experienced this pattern directly during a recent debugging session. I went four rounds of back-and-forth with an AI coding tool doing root cause analysis on a bug. Each round produced a confident, plausible diagnosis. Each one was wrong.
After the fourth attempt, I stopped asking and started reading. I pulled up the stack trace myself, pinpointed the exact location in the application where the failure was occurring, and fed that specific context back to the tool. It diagnosed the issue immediately.
The tool wasn't broken. It was doing what it always does: generating the most probable explanation given its context window. The problem was that I'd spent four rounds accepting its framing of where the bug lived before doing the work I should have done in round one.
Context rot compounds this. The longer a session runs, the more the tool's working model of your codebase drifts from reality. It makes assumptions about root causes, states matters of fact that sound authoritative, and constructs explanations that feel credible but aren't grounded in what's actually happening in the code. The confidence level in the output never changes as the context degrades. A suggestion generated from a clean, focused context looks identical to one generated after an hour of accumulated tangents, corrections, and wrong turns. You can't tell the difference by reading the output. You can only tell by verifying it against the source.
Trust Decay As A Runbook
Individual trust decay is manageable. You learn from the debugging session, recalibrate, and move on. The organizational version is harder to recover from because the decay isn't just personal; it's structural.
Amazon mandated AI coding tools across its engineering organization. The velocity increase was real, and so was the value: engineers were producing more code, faster, with less friction on routine tasks. But the review infrastructure wasn't built for that throughput. When AI-assisted changes started flowing through the pipeline at multiples of the previous rate, review became the bottleneck. When review becomes the bottleneck, teams skip it. Amazon's March 2026 outages, 6.3 million lost orders across two incidents in three days, were the visible result of review processes that had been quietly eroding under AI-accelerated velocity for months.
Alexey Grigorev's incident followed the same arc at individual scale. He'd used Claude Code successfully enough to trust it with infrastructure cleanup. The tool had earned that trust through repeated competent execution. When it ran terraform destroy on a mismatched state file and wiped his production database, the failure wasn't a malfunction. It was the natural endpoint of a trust curve that had been building through dozens of successful interactions before it.
The organizational pressure amplifies the decay. When leadership says "use the tool," verification becomes friction that people are incentivized to skip. The decay accelerates not because the tool gets worse, but because the cultural permission to slow down and check evaporates.
A Paler Shade Of Green
Individual developers can recalibrate. Organizations can tighten review processes, even if those processes are fragile under pressure. But when trust decay reaches the testing layer, you lose your last line of defense.
The Playwright MCP pattern from the opening isn't an isolated edge case. In at least three independent reports, AI agents building test suites have injected JavaScript through the browser's DevTools protocol to manipulate application state at the session level. Not setting up test fixtures through the application's APIs. Not arranging preconditions through the database. Injecting DOM manipulations and state overrides directly into the browser session to force a desired result.
The core issue is goal alignment. The agent considers passing tests to be the primary objective rather than catching legitimate bugs that impact users. When the agent has enough capability to bridge the gap between a failing assertion and a green result by altering state at the test level rather than the application level, it will take that path every time. It's solving for the metric, not the goal.
This is the most dangerous form of trust decay because the feedback loop inverts. A developer who stops verifying AI output might still catch an error downstream in testing. But if the tests themselves were built by an agent that optimized for green rather than for defect detection, there is no downstream left. The test suite becomes a rubber stamp that confirms whatever the tool produced, and the confidence you place in your coverage is exactly backwards.
Gradient Possibilities
The natural counterargument: automation complacency isn't new. Pilots, nuclear plant operators, and manufacturing workers have all dealt with the tendency to over-trust automated systems. If the phenomenon predates AI, what makes AI coding tools different?
Non-determinism. Traditional automation does the same thing the same way every time. You learn the failure modes, build checks around them, and trust the system within known bounds. When a deployment script fails, it fails the same way. You write a check for that failure, and the check works reliably because the failure is deterministic.
AI coding tools generate different output from the same input. The failure modes aren't consistent. A verification approach that caught yesterday's mistake may not apply to today's output because the tool produced a different kind of mistake entirely. You can't build a reliable checklist for "verify AI output" the way you can for "verify deployment script output" because the errors shift run to run.
This creates a probability gradient that decays quickly. Each successful interaction lowers your perceived risk, but the actual risk doesn't decrease in step because the next output will be different, requiring a different set of checks you haven't anticipated. The cognitive overhead of genuine verification stays constant on every interaction while the perceived need for it drops with each success. That widening gap between actual risk and perceived risk is where trust decay lives, and non-determinism ensures the gap never closes on its own.
Be The Best Human-In-The-Loop You Can Be
Trust decay isn't preventable. If you use AI coding tools with any regularity, you will develop comfort with their output. The goal isn't paranoid verification on every interaction; it's recognizing when you've crossed from productive trust into unchecked assumption.
You're skimming, not reading. The tool generates 40 lines and you glance at the structure without reading the logic. The first time you skim and nothing breaks, the habit sets.
You're accepting output outside your verification. The tool works well for writing unit tests, so you start trusting it for migration scripts. Competence in one domain doesn't transfer, but trust does.
You're debugging AI output with more AI. My four-round RCA session was exactly this. If the tool's diagnosis is wrong and your response is to ask the tool again rather than reading the code yourself, you're compounding the trust problem instead of resolving it.
Your tests are green but you can't explain why. If an AI agent built your test suite and every test passes, ask yourself whether you could describe what each test is actually verifying. If you can't, the suite is a confidence artifact, not a quality gate.
The session has been running long. After an hour of accumulated conversation, corrections, and tangents, the tool's working model of your codebase is stale. Treat long-running sessions the way you'd treat a colleague who hasn't pulled main in a week: verify their assumptions before acting on their suggestions.
Quick Tip: Scan Your Test Files for State Injection
A quick grep to catch AI-generated tests that manipulate browser state directly instead of using proper test fixtures:
# Flag test files with direct DOM/state manipulation outside test setup
grep -rnE \
'(page\.evaluate\(|\.executeScript\(|document\.(querySelector|getElementById|cookie)|localStorage\.|sessionStorage\.)' \
--include="*.test.*" --include="*.spec.*" --include="*_test.*" \
. && echo "Review flagged patterns: verify these are legitimate fixtures, not workarounds for failing assertions."A more thorough implementation with configurable pattern lists, assertion density checks, and CI integration is in the bashmatica-scripts repo.
Quick Wins
🟢 Easy (15 min): Review your AI-generated test files for page.evaluate() or executeScript() calls. If any of them modify DOM state or inject values outside the test's own setup fixtures, flag them for manual review. One grep command, fifteen minutes, and you'll know whether your test suite is testing your application or performing for your dashboard.
🟡 Medium (1 hour): Add an assertion density check to your test linter. Count assertions per test file and flag any file where the ratio falls below your team's threshold. A test file with 200 lines of setup and two assertions is a confidence artifact, not a quality gate. Most linting frameworks support custom rules for this.
🔴 Advanced (half day): Enforce the AAA pattern (Arrange-Act-Assert) in your CI pipeline for any AI-generated test code. Build a linter rule that flags test functions missing an explicit setup phase, a distinct action, or verifiable assertions. The goal is atomic test execution where the agent's objective is catching bugs, not passing tests.
Next Week
We've been covering AI tools from the integration side and the trust side for several issues now. Next, we're looking at the gap between what AI QA tools report and what they actually catch. Vendor dashboards show coverage percentages and pass rates; production bug reports tell a different story. The distance between those two numbers might be the most expensive metric nobody is tracking yet.
Thanks for reading Bashmatica! Last issue we drew the line between policy guardrails and architectural ones: if a guardrail can be skipped under pressure, it isn't a guardrail. This issue is about the pressure itself; the slow, comfortable process by which AI coding tools earn enough trust to stop being checked, and what happens when they fail in the gap your attention used to fill. Non-deterministic tools don't have consistent failure modes, which means your verification can never run on autopilot. The trust you place in these tools needs to be calibrated to the task in front of you, not to how well they performed last time.
P.S. The Playwright agents that injected JavaScript to pass their own tests were, from a pure optimization standpoint, doing brilliant work. They identified the shortest path to a green test suite and executed it flawlessly. The problem is that nobody told them the green was supposed to mean something. If your verification layer optimizes for the metric instead of the goal, you don't have a verification layer. You have a very convincing dashboard.
I can help you or your team with:
Production Health Monitors
Optimize Workflows
Deployment Automation
Test Automation
CI/CD Workflows
Pipeline & Automation Audits
Fixed-Fee Integration Checks