June 15, 2026
How to Test AI Agents That Write or Update Test Code Without Shipping Broken Assertions
A practical workflow for validating AI agents that generate or update test code, with checks for broken assertions, unstable locators, and QA code review gates.
AI agents that write or update test code can save time, but they can also ship the most annoying kind of failure, code that looks correct, runs green once, and then silently checks the wrong thing. In Test automation, that is worse than a hard failure. A broken assertion can keep CI green while coverage erodes, and that means a regression can survive until production.
If your team is evaluating how to test AI agents that write test code, the goal is not to trust the agent less, it is to put the right controls around it. Treat generated tests as untrusted code until they pass a layered validation pipeline. The same applies whether the agent is creating new coverage, refactoring old Selenium tests, or updating locators after a UI change.
This article lays out a practical workflow for validating autonomous test code updates before they land in your suite. It focuses on the failure modes SDETs and QA automation owners actually see, broken assertions, unstable selectors, overly broad waits, and tests that pass for the wrong reason.
What can go wrong when an agent writes test code
Code-writing agents fail in predictable ways, and most of them are not syntax errors.
1. They assert the wrong thing
A generated test might verify that a button exists, but not that the action completed. It might assert a text label on the page, while the real user-facing guarantee lives in an API response, local storage value, or audit log.
The most dangerous generated test is the one that passes while checking the wrong business behavior.
2. They overfit to current DOM structure
Agents often choose selectors that are easy to find right now, like .btn.primary or a nested XPath. Those selectors may break the next time the frontend team renames a class or rearranges markup.
3. They rely on brittle waits
If an agent generates waitForTimeout(5000) or a similar sleep-based pattern, the test may pass locally and still become flaky under CI load. Good test code waits for a condition, not a guessed duration.
4. They rewrite tests into something less meaningful
When asked to update a test, a code agent may preserve the mechanics but lose the intent. For example, it may keep a login flow but stop validating the post-login state that originally mattered.
5. They subtly weaken assertions
This is the hardest failure to catch. A strict equality check becomes a contains, a visibility check becomes a presence check, or a success criterion gets replaced with a generic page load.
Start with a review model, not a trust model
Before you let an agent edit anything in the repository, define what level of autonomy it has.
Recommended autonomy levels
- Suggest only, the agent proposes a patch or test plan, a human applies it.
- Draft and review, the agent opens a pull request, but cannot merge.
- Controlled write access, the agent can update test files in a branch, but CI and review gates must pass before merge.
- Full autopilot, only appropriate for low-risk, heavily constrained test maintenance, and still rare.
For most teams, draft and review is the sweet spot. It gives you speed without giving up authorship. It also creates a natural place to inspect the agent’s reasoning, the selectors it chose, and the assertions it changed.
If you want a lower-maintenance path for agent-created coverage, tools like Endtest, an agentic AI test automation platform,’s AI Test Creation Agent are worth a look because they generate editable, platform-native test steps rather than dropping opaque code into your repo. That is not a replacement for review, but it can reduce framework upkeep for teams that want coverage without constant locator babysitting.
Build a validation pipeline for generated test code
Think of validation in layers. Each layer catches a different class of failure.
Layer 1, syntax and linting
This is the cheapest gate. It catches malformed code, unused variables, accidental imports, and broken formatting. It will not catch semantic mistakes, but it should block obvious noise.
For a Playwright suite, that usually means running type checks and lint rules in CI.
name: test-code-validation
on: pull_request:
jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm run lint - run: npm run typecheck
Layer 2, structural review
This is where you validate the shape of the test before executing it.
Check for:
- selectors that match stable user-facing semantics, not implementation details
- assertions that describe user outcomes, not just DOM presence
- unnecessary sleeps or retry loops
- tests that became longer but less specific after the update
- accidental broadening, such as replacing a single targeted check with a generic page title assertion
A good agent workflow should make this review easy. If the change is hard to understand, the agent did too much.
Layer 3, dry run with controlled fixtures
Run the generated or updated test against a known-good environment with predictable data. If the test depends on email, payment, feature flags, or asynchronous jobs, stub or fixture those dependencies.
The point is to answer two questions:
- Does the test run at all?
- Does it validate the intended behavior, not just the happy path shell around it?
Layer 4, mutation-style negative checks
If possible, run a small negative test where the underlying behavior is intentionally wrong, and confirm the assertion fails. This is one of the best ways to catch broken assertions.
For example, if the test checks that a discount was applied, run it once with a fixture that omits the discount and confirm the failure is explicit.
Ask the agent for intent, not just code
A useful pattern is to require the agent to produce a short explanation alongside the patch:
- what behavior it thinks the test is protecting
- which selectors or signals it chose and why
- what it could not verify automatically
- what assumptions it made about the app state
That extra context gives reviewers something to challenge.
For example, if the agent updates a checkout test, its explanation should say something like:
- it verifies the order confirmation page, not just the cart page
- it uses role-based locators for buttons and headings
- it validates the confirmation message and order total
- it assumes a seeded test user and a known product catalog
If that explanation sounds vague, the code probably is too.
Review AI-generated assertions as a separate risk surface
Assertions deserve more scrutiny than navigation steps. A generated click path can still be useful even if it is slightly imperfect, but a weak assertion can turn a test into an expensive smoke screen.
When reviewing AI-generated assertions, focus on three questions:
1. Is the assertion specific enough?
Bad:
typescript
await expect(page.locator('body')).toContainText('Success');
Better:
typescript
await expect(page.getByRole('heading', { name: 'Order confirmed' })).toBeVisible();
await expect(page.getByTestId('confirmation-number')).toHaveText(/^[A-Z0-9-]+$/);
2. Is the assertion tied to the right signal?
A UI message may be incidental. The actual proof might live in a network response, a cookie, or a persisted value.
3. Does the assertion fail for the right reason?
If an assertion fails, the debugging path should be obvious. “Text not found” is too vague when the real issue was that the request never completed.
A strong assertion tells you what broke. A weak assertion just tells you the test stopped agreeing with itself.
Prefer behavior checks over implementation checks
Agents often overuse selectors because selectors are the easiest thing to generate. Human reviewers should push back toward behavior-oriented checks.
Good signals include:
- accessible roles and names
- visible text that users actually see
- URL changes that represent navigation state
- server responses for critical flows
- stored state for app-specific outcomes
Less desirable signals include:
- deep CSS paths
- brittle nth-child selectors
- generic class names tied to styling
- arbitrary sleeps after actions
A practical rule is this, if the frontend team can refactor markup without changing the user experience, your test should probably survive too.
Use a test review checklist for agent output
A repeatable review checklist keeps the process from becoming subjective.
Code and selector checks
- Does the test use stable locators?
- Are there any raw XPath expressions that could be replaced with roles or test IDs?
- Did the agent introduce duplicated waits?
- Are there condition checks instead of fixed delays?
Assertion checks
- Does every assertion prove something meaningful?
- Are there any checks on superficial text that could mask failure?
- Did any assertion become less strict without justification?
- Are success and failure states both considered where appropriate?
Data and environment checks
- Is the test independent of local machine state?
- Are fixture assumptions documented?
- Does the test require seeded data, feature flags, or auth state?
- Is cleanup handled if the test creates records?
Maintainability checks
- Would a human understand this test six months from now?
- Is the generated code aligned with the suite’s conventions?
- Are helper functions reused instead of duplicated?
- Is the test narrow enough to debug quickly?
A lightweight pattern for validating changed tests in Playwright
If you use Playwright, the most practical safety net is a combination of type checking, targeted test execution, and failure inspection.
import { test, expect } from '@playwright/test';
test('order confirmation is shown after checkout', async ({ page }) => {
await page.goto('/shop');
await page.getByRole('button', { name: 'Add to cart' }).click();
await page.getByRole('link', { name: 'Checkout' }).click();
await page.getByRole('button', { name: 'Place order' }).click();
await expect(page.getByRole(‘heading’, { name: ‘Order confirmed’ })).toBeVisible(); await expect(page.getByTestId(‘order-total’)).toContainText(‘$’); });
When an agent updates this file, review whether it preserved the test’s core claim. If it changed the heading assertion to a generic text check, that is a regression in test quality even if the code still runs.
For agent-generated updates, I also recommend a diffs-first habit:
- inspect the changed assertion lines first
- then inspect selectors
- then inspect helper reuse and waits
- only then run the test
That order catches the highest-risk issues before you spend time on execution.
Add a fail-fast gate for suspicious changes
Some changes deserve automatic escalation to human review, even if the agent produced valid code.
Trigger a manual review when the agent:
- removes an assertion without replacing it
- changes a strict assertion into a weaker one
- introduces new sleep-based waits
- rewrites a stable locator into a brittle XPath
- modifies both the test and the helper it depends on in the same patch
- touches authentication, payment, or destructive flows
In practice, this can be a simple policy in your PR automation. The point is to make weakening changes visible.
Separate creation from maintenance
There are two distinct agent problems:
- creating coverage from scratch
- maintaining existing coverage as the app changes
Creation is easier to supervise because the human reviewer can ask, “does this test cover a real user journey?” Maintenance is harder because the test may already be embedded in CI, and a small change can accidentally reduce coverage.
For maintenance workflows, require the agent to answer:
- what changed in the app
- which locator or assertion failed
- why the proposed update is the minimal safe change
- whether the change affects test meaning, not just syntax
If the answer says “I replaced the old selector with a new one,” that is not enough. A good maintenance agent should explain whether the semantics stayed the same.
Where autonomous browser test creation fits
Not every team wants to manage code-generation prompts, framework conventions, and locator strategies in the repo itself. Some teams prefer a platform that can generate browser tests in a more controlled surface and keep them editable there. Endtest’s AI Test Creation Agent is one example of that model, and its documentation shows how it creates web tests from natural-language instructions.
That approach can be useful when the main problem is framework upkeep, not just test generation. If your team spends too much time repairing selectors and browser-driver glue, a lower-maintenance authoring model can reduce the amount of custom code you need to review. It is still important to validate the behavior, but the operational burden is different.
Consider AI-assisted assertions for fragile UI checks
Sometimes the brittle part is not the test flow, it is the assertion itself. This is where natural-language checks can help, especially for visual or semantic conditions that are annoying to express with a single DOM selector.
For example, a standard assertion might ask whether a specific element contains a string. A more robust human review question is whether the page shows the expected state, language, banner color, or error condition. Endtest’s AI Assertions are built around that idea, and the docs describe validating complex conditions in natural language, with scope over page content, cookies, variables, or logs. If your suite has a few checks that are consistently fragile, this kind of abstraction can reduce maintenance without removing review discipline.
The important caveat is that flexibility can hide ambiguity. If a natural-language assertion is too loose, it can accept the wrong UI state. Keep strictness aligned with risk, critical validations should be explicit, while lower-risk visual checks can be more tolerant.
A practical release gate for agent-written tests
A good production gate for autonomous test updates usually looks like this:
- agent produces a patch or a test draft
- static checks run automatically
- reviewer inspects assertions first, then locators
- test runs against controlled fixtures
- negative check confirms the assertion fails when behavior is wrong
- PR merges only after the reviewer signs off on intent and risk
This pipeline is not especially glamorous, but it is durable. It assumes the agent is helpful, not infallible.
Decision criteria for your team
If you are deciding how much autonomy to allow, use these questions:
- How expensive is a false green in this suite?
- How often do UI changes invalidate locators?
- Do reviewers have time to verify assertion intent?
- Is the app behavior easy to seed and control in test environments?
- Would a lower-code or platform-native authoring model reduce maintenance burden?
If the answer to most of those questions is “we need less framework upkeep,” then an agentic platform can be a good fit. If the answer is “we need very precise, code-level control,” then your workflow should emphasize review gates, diff inspection, and negative testing.
The core rule: trust the agent to draft, not to decide
The best way to test AI agents that write test code is to make their output inspectable, executable, and falsifiable. Do not ask whether the code looks smart. Ask whether it still proves the user behavior you care about.
If a generated test can be changed by the agent without anyone noticing that the assertion became weaker, your process is too loose. If every update requires a full human rewrite, your process is too rigid. The right middle ground is a reviewable pipeline where the agent handles the boilerplate and the team owns the meaning.
That is what keeps autonomous test code updates useful, instead of quietly turning your suite into a collection of confident lies.