Page 2 of 6

Your Criteria Are Already Tests¶

The Honest Tradeoff¶

In Lesson 2, you learned criteria-based review: walking through each acceptance criterion, checking the actual output, making a clear pass/fail call. That discipline works. But you also heard the honest tradeoff: manual review is slow, and it doesn't catch regressions.

Now feel it concretely. Every feature reviewed by hand. Every acceptance criterion walked through individually. And the question that keeps growing: when you add a new feature, did it break something you already verified? You won't know unless you re-check everything. And re-checking everything doesn't scale.

The Two-Week Cliff¶

Here's what happens when manual review is your only safety net:

Week 1: You build the OFAC sanctions screening feature. Every vessel in the traffic display gets matched against the sanctions list by MMSI. Sanctioned vessels are flagged immediately. The analyst can click any flagged vessel and see the full OFAC entry. It works. Your team demos it. Everyone is happy.

Week 2: You add gap event history so the analyst can see whether a vessel has gone dark in the past. But the gap event data uses vessel identity records (with names and flags over time), not just MMSIs. Your AI coding assistant builds it in a different data structure than the sanctions feature used. Now you have two features referencing vessel data differently, and neither one knows about the other.

The cliff: A teammate asks, "Can the analyst see both sanctions status and gap history for the same vessel?" You try it. The sanctions flags stop showing up when gap data loads. The two features collide because they make different assumptions about how vessel data is organized. Fixing it means restructuring both features, but you have no tests to tell you whether the sanctions screening still works after the restructuring. You are debugging two features at once with no safety net.

Had you written tests for the sanctions screening feature before building the gap event feature, you would know immediately whether your restructuring broke anything. The tests are not extra work. They are the thing that lets you add complexity without losing what already works.

This is the two-week cliff: rapid progress followed by a collapse when changes silently break things that used to work. As one practitioner put it: "You can make insane progress in about a week. I have yet to see that code function beyond the two-week mark." The pattern is consistent: AI can build features fast, but without automated checks, one change can undo a week of verified work.

Manual verification is a point-in-time check. It tells you "this worked when I looked at it." It doesn't tell you "this still works after the last three changes." That gap between what you checked and what's still true is the validation gap, and it grows with every change you make.

The safety net comparison - with and without automated tests

The Insight You Already Have¶

Here's the good news: you already know how to write test specifications. You've been writing them since Lesson 1.

Look at an acceptance criterion from Lesson 2:

Given the traffic display is loaded, when vessels with MMSIs matching the OFAC list are present, then those vessels are visually distinct from unsanctioned traffic.

This is already almost a test. But "visually distinct" is ambiguous. Does it mean a different color? An icon? A separate section? The acceptance criterion tells you the intent; the test makes it verifiable.

Here is a concrete test derived from that criterion:

Test: Sanctioned vessels are visually distinct
Given: The traffic display is loaded with vessel data that includes
       MMSI 636021459 (which appears in the OFAC sanctions list)
When:  The display renders
Then:  The vessel with MMSI 636021459 has a visual indicator that
       vessels not in the OFAC list do not have

The MMSI is real: 636021459 is in your OFAC data file. The test uses specific data your application actually processes. And "visual indicator that other vessels do not have" is verifiable without dictating the design choice.

This is the "double duty" concept from Lesson 2 taken one step further. In Lesson 2, your criteria served as spec (telling AI what to build) and manual test (telling you what to check). Now they serve as spec and automated test, a test that checks itself every time you make a change.

The Shift¶

	Manual Review (Lesson 2)	Automated Tests (Lesson 3)
Who checks	You, walking through each AC	Code that runs through them automatically
When it checks	When you remember to	Every time anything changes
What it catches	What you look at right now	Regressions across the entire project
How it scales	It doesn't: more features = more manual work	It does: more tests = more coverage, same effort to run

You're not replacing your judgment. You're extending it. You still write the acceptance criteria. You still decide what "done" means. But instead of being the only one who can verify, you teach the machine to verify for you.

Your Biggest Fear

Team Discussion | ~2 minutes total

Think back to Challenge 2. You built features, verified them manually, and moved on.

Discuss: What's the feature you're most worried about breaking when you add something new in Challenge 3? Why? What would it take to feel confident that it still works after every change, without re-checking it by hand?

Key Insight

Your acceptance criteria are already test specifications. The Given/When/Then format you learned in Lesson 1 and practiced in Lesson 2 maps directly to automated test structure: setup, action, check. The only difference is who runs the check: you (manual review) or code (automated tests). In the next section, you'll hand your criteria to AI and get that code back.