[Bashmatica!] Amazon Lost 6.3M Orders. AI Was The Catalyst, Not The Culprit.

AI didn’t introduce a new risk. It overwhelmed controls that were already too weak.

Your Docs Deserve Better Than You

Hate writing docs? Same.

Under the hood, AI agents study your codebase before writing a single word. They scrape your README, pull brand colors, analyze your API surface, and build a structural plan first. The result? Docs that actually make sense, not the rambling, contradictory mess most AI generators spit out.

Parallel subagents then write each section simultaneously, slashing generation time nearly in half. A final validation sweep catches broken links and loose ends before you ever see it.

What used to take weeks of painful blank-page staring is now a few minutes of editing something that already exists.

Try it on any open-source project you love. You might be surprised how close to ready it already is.

Let’s Call It Terminal Velocity Engineering

On March 2, Amazon's AI coding assistant Q pushed a production change to its e-commerce platform. The change triggered 1.6 million website errors and 120,000 lost customer orders. Three days later, on March 5, a second AI-assisted deployment hit the same platform harder: a six-hour outage that knocked out checkout, login, and product pricing across North American marketplaces. 6.3 million orders, gone. Internal memos from Amazon's SVP of e-commerce, Dave Treadwell, described the incidents as having "high blast radius" and linked them to "Gen-AI assisted changes."

Amazon's response was immediate. Treadwell announced a 90-day safety reset across 335 critical Tier-1 systems. The new rules: mandatory two-person review for all production deployments, formal documentation and approval processes enforced with audit trails, and a requirement that junior and mid-level engineers get senior sign-off before deploying any AI-assisted code changes. His diagnosis was direct: the root cause was "a production change, deployed without the mandatory, formal documentation and approval process."

That diagnosis is correct, and it reveals something more important than the outages themselves. The review process that would have caught those changes wasn't missing because of AI. It was already inadequate for the velocity at which code was being produced. AI just removed the last remaining friction that had been masking the gap.

Burndown Rate as a Success Metric

Amazon's internal memos traced the pattern back to Q3 2025, six months before the March outages. A "trend of incidents" with high blast radius, linked to AI-assisted changes, had been accumulating for months before the system broke visibly enough to force an all-hands meeting.

The pattern is straightforward. When a coding assistant generates changes at ten times the speed of manual development, existing review processes become the bottleneck. And when review becomes the bottleneck, teams skip it. Not maliciously, not even consciously in most cases; the velocity of production creates its own pressure, and the path of least resistance is to trust the output that looks correct and ship it.

I've seen this exact dynamic even without AI in the loop. Early in my career, a web platform release crashed on deployment because management deemed stress testing and load testing as bottleneck roadblocks that were "not necessary." The application code looked fine and passed testing in QA and staging environments after multiple rounds of bug-fixes. Management was already fatigued by a process that, from the outside, looked like it'd already caught the bugs. Yet as we know, not all bugs are at the application level; often infrastructure needs its time under the microscope as well. No one tested whether the infrastructure could handle the traffic patterns the new application would produce. The review gates that would have caught it were scoped, planned, then skipped because they were deemed too slow; the blast radius was a frantic rollback, and a weeks-long deep-dive that could have been prevented by an afternoon of load testing and a day's worth of reconfiguration.

AI didn't invent the "skip the verification step" failure mode. It just increased the volume of changes flowing through a verification process that was already running at capacity. If your review infrastructure breaks under 10x the throughput, the problem was the infrastructure, not the 10x.

The Policy Band-Aid On The Hemorrhaging Architecture

Amazon's immediate response was a policy fix: require senior sign-off, mandate two-person review, enforce documentation. Those are the right reflexes. They are also insufficient.

Policy guardrails depend on human compliance under pressure. They work when the team is calm, well-rested, and not staring at a deployment queue that's backed up because every change now needs two reviewers and a senior engineer's signature. They fail when the pressure builds, when the deadline hits, when someone decides that this particular change is low-risk enough to expedite. Policy is a rule that says "always review destructive commands." Architecture is a system that makes destructive commands impossible without explicit unlocking.

Last issue covered Alexey Grigorev's DataTalksClub incident: Claude Code wiped his entire production stack by executing terraform destroy on a mismatched state file. The model never hesitated, never flagged the risk, never said "this will delete everything." It constructed a plausible plan and executed it with full confidence.

Grigorev's post-incident response wasn't a policy change. He implemented six guardrails, and every one of them was architectural:

  1. Deletion protection at both Terraform and AWS levels

  2. S3-based Terraform state with versioning enabled

  3. Automated daily restore testing via Lambda: restore from backup, verify with read queries, every night

  4. S3 backup versioning outside the Terraform lifecycle

  5. Separate dev and prod AWS accounts for complete environment isolation

  6. Manual review gates for destructive commands (the only policy-level guardrail in the set)

Five out of six are infrastructure-level constraints. They don't depend on anyone remembering to follow a process. They make the failure mode physically harder to reach, regardless of how confident or fast the tool operating within the environment happens to be.

The contrast with Amazon's response is worth sitting with. Amazon's 90-day reset is a policy overlay: more reviewers, more documentation, more sign-offs. Grigorev's reset was an architecture change: deletion protection, state locking, environment isolation, automated verification. Amazon's approach slows down the entire pipeline. Grigorev's approach makes the dangerous paths harder to reach without slowing down the safe ones.

A Fail For Every Scale

The Amazon outages and the Grigorev incident are the high-profile examples, but the same pattern has been playing out quietly across individual teams for months. A growing collection of incidents in Claude Code's GitHub repository tells the same story at developer scale:

Issue #27063: Claude Code ran drizzle-kit push --force and wiped 60+ database tables. The --force flag bypasses all confirmation prompts. The tool had the permissions, the flag was available, and no architectural gate existed to prevent it.

Issue #14411: Claude Code executed prisma db push --accept-data-loss, destroying production data. The flag name literally describes what it does, and the tool used it without hesitation.

Issue #26913: Claude Code ran alembic downgrade base, rolling back every migration and dropping 21 tables. The command was available, the permissions were sufficient, and the model treated it as a reasonable step in its plan.

Issue #23913: Claude Code deleted 2,229 files in a single operation. No confirmation gate, no dry-run step, no architectural constraint that would have limited the blast radius.

Every one of these incidents shares the same root cause: the tool had unrestricted write access to a system where destructive operations were available without an explicit unlock step. The model didn't malfunction. It operated within the permissions it was given, with the confidence it always has, on a path that happened to be catastrophic. The guardrail that would have prevented each one isn't "tell the model to be more careful." It's removing the --force flag from the tool's available options, requiring a dry-run before any schema migration, and scoping permissions to exclude destructive operations by default.

Procedural Lane-Assistance

The guardrail patterns that survive real-world pressure share three characteristics: they are automated, they are default-on, and they constrain the blast radius before the tool reaches the destructive path.

Scoped permissions: The tool should not have access to operations it doesn't need. If your AI coding assistant is generating application code, it doesn't need DROP TABLE permissions on the production database. If it's managing Terraform, it doesn't need destroy capabilities outside a sandboxed environment. Principle of least privilege isn't new; it's just newly urgent.

Dry-run defaults: Any operation that modifies state should require an explicit opt-in to execute. Terraform plans before applies. Migration previews before pushes. Diff reviews before commits. The default should be "show me what you're going to do," not "do it."

Destructive-operation gates: Separate the ability to propose a destructive action from the ability to execute it. A tool can suggest terraform destroy all day; the gate is that execution requires a separate, human-initiated confirmation step outside the automated workflow.

Environment isolation: Dev, staging, and production should be separate accounts, separate credentials, separate blast radii. If a tool wipes dev, you lose an afternoon. If a tool wipes production because dev and prod share an account, you lose 6.3 million orders.

Automated verification: Grigorev's nightly restore test is the gold standard. Don't trust that your backups work; verify it automatically, on a schedule, with a check that fails loudly if the restoration doesn't produce a valid state. The backup that hasn't been tested is the backup that doesn't exist.

None of these patterns are AI-specific. Every one of them is standard infrastructure hygiene that should exist regardless of whether an LLM, a junior engineer, or a senior engineer is making the changes. AI didn't create the need for these guardrails. It revealed that the need was always there, masked by the slower velocity of manual development.

Your AI is resolving tickets. Is it keeping customers?

Quick Tip: Scan Your Diffs for Destructive Patterns

When a CI pipeline fails with a cryptic error, you can use Claude's API directly from the command line to get an explanation:

# Check staged changes for destructive patterns
git diff --cached --unified=0 | grep -iE \
  '(--force|--hard|--accept-data-loss|terraform destroy|DROP TABLE|DELETE FROM|rm -rf|downgrade base|push --force)' \
  && echo "BLOCKED: Destructive pattern detected in staged changes. Review before committing." \
  && exit 1

A full implementation with configurable pattern lists, override flags, and CI integration is in the bashmatica-scripts repo. It works as a Git pre-commit hook or a standalone CI check, and it catches every destructive pattern from the incidents cited in this issue.

Quick Wins

🟢 Easy (15 min): Audit the permissions your AI coding tools have in your development environment. List every database, cloud service, and system they can reach. If the list is "everything I can reach," that's your first guardrail to install.

🟡 Medium (1 hour): Add a pre-commit hook that scans for destructive patterns in staged diffs (see Quick Tip above). Configure it with patterns specific to your stack: your ORM's force flags, your migration tool's destructive commands, your infrastructure tool's destroy operations.

🔴 Advanced (half-day): Implement environment isolation between your AI tool's working context and production systems. Separate credentials, separate accounts if your cloud provider supports it, and a promotion pipeline that requires human approval to move changes from dev to prod.

Next Week

Five issues of Integration Strategies in six weeks. Time to shift gears. Next time we'll take a look at what AI coding tools actually do well, where the promises fall apart, and circumscribe the gap between the vendor demo and everyday use. Amazon mandated AI coding tools for its engineers and got 6.3 million lost orders. Grigorev trusted Claude Code with infrastructure cleanup and lost a production database. Both of them also found real value from these tools before and after the guardrails failed. The trick is knowing where the value stops and the risks begin in self-churning hype-machine that doesn’t blink for a second.

Thanks for reading Bashmatica! Last week we defined the trust boundary: summarize, trust it; diagnose, verify it. Today we defined the operational boundary: if your guardrails depend on someone remembering to follow a process, they will fail at the exact moment the pressure to skip them is highest. Architecture over policy. Constraints over compliance. The blast radius of AI-assisted code isn't a new problem. It's an old problem at a new velocity.

P.S. Dave Treadwell's quote about the root cause is worth reading twice: "a production change, deployed without the mandatory, formal documentation and approval process." The mandatory process existed. The formal documentation existed. The approval requirement existed. None of it prevented a change that moved faster than the process could absorb it. If a guardrail can be skipped under pressure, it isn't a guardrail. It's just a suggestion.

I can help you or your team with:

  • Production Health Monitors

  • Optimize Workflows

  • Deployment Automation

  • Test Automation

  • CI/CD Workflows

  • Pipeline & Automation Audits

  • Fixed-Fee Integration Checks