How to Test AI Agents That Write or Update Test Code Without Shipping Broken Assertions

AI agents that write or update test code can save time, but they can also ship the most annoying kind of failure, code that looks correct, runs green once, and then silently checks the wrong thing. In Test automation, that is worse than a hard failure. A broken assertion can keep CI green while coverage erodes, and that means a regression can survive until production.

If your team is evaluating how to test AI agents that write test code, the goal is not to trust the agent less, it is to put the right controls around it. Treat generated tests as untrusted code until they pass a layered validation pipeline. The same applies whether the agent is creating new coverage, refactoring old Selenium tests, or updating locators after a UI change.

This article lays out a practical workflow for validating autonomous test code updates before they land in your suite. It focuses on the failure modes SDETs and QA automation owners actually see, broken assertions, unstable selectors, overly broad waits, and tests that pass for the wrong reason.

What can go wrong when an agent writes test code

Code-writing agents fail in predictable ways, and most of them are not syntax errors.

1. They assert the wrong thing

A generated test might verify that a button exists, but not that the action completed. It might assert a text label on the page, while the real user-facing guarantee lives in an API response, local storage value, or audit log.

The most dangerous generated test is the one that passes while checking the wrong business behavior.

2. They overfit to current DOM structure

Agents often choose selectors that are easy to find right now, like .btn.primary or a nested XPath. Those selectors may break the next time the frontend team renames a class or rearranges markup.

3. They rely on brittle waits

If an agent generates waitForTimeout(5000) or a similar sleep-based pattern, the test may pass locally and still become flaky under CI load. Good test code waits for a condition, not a guessed duration.

4. They rewrite tests into something less meaningful

When asked to update a test, a code agent may preserve the mechanics but lose the intent. For example, it may keep a login flow but stop validating the post-login state that originally mattered.

5. They subtly weaken assertions

This is the hardest failure to catch. A strict equality check becomes a contains, a visibility check becomes a presence check, or a success criterion gets replaced with a generic page load.

Start with a review model, not a trust model

Before you let an agent edit anything in the repository, define what level of autonomy it has.

Recommended autonomy levels

Suggest only, the agent proposes a patch or test plan, a human applies it.
Draft and review, the agent opens a pull request, but cannot merge.
Controlled write access, the agent can update test files in a branch, but CI and review gates must pass before merge.
Full autopilot, only appropriate for low-risk, heavily constrained test maintenance, and still rare.

For most teams, draft and review is the sweet spot. It gives you speed without giving up authorship. It also creates a natural place to inspect the agent’s reasoning, the selectors it chose, and the assertions it changed.

If you want a lower-maintenance path for agent-created coverage, tools like Endtest, an agentic AI test automation platform,’s AI Test Creation Agent are worth a look because they generate editable, platform-native test steps rather than dropping opaque code into your repo. That is not a replacement for review, but it can reduce framework upkeep for teams that want coverage without constant locator babysitting.

Build a validation pipeline for generated test code

Think of validation in layers. Each layer catches a different class of failure.

Layer 1, syntax and linting

This is the cheapest gate. It catches malformed code, unused variables, accidental imports, and broken formatting. It will not catch semantic mistakes, but it should block obvious noise.

For a Playwright suite, that usually means running type checks and lint rules in CI.

name: test-code-validation

on: pull_request:

jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm run lint - run: npm run typecheck

Layer 2, structural review

This is where you validate the shape of the test before executing it.

Check for:

selectors that match stable user-facing semantics, not implementation details
assertions that describe user outcomes, not just DOM presence
unnecessary sleeps or retry loops
tests that became longer but less specific after the update
accidental broadening, such as replacing a single targeted check with a generic page title assertion

A good agent workflow should make this review easy. If the change is hard to understand, the agent did too much.

Layer 3, dry run with controlled fixtures

Run the generated or updated test against a known-good environment with predictable data. If the test depends on email, payment, feature flags, or asynchronous jobs, stub or fixture those dependencies.

The point is to answer two questions:

Does the test run at all?
Does it validate the intended behavior, not just the happy path shell around it?

Layer 4, mutation-style negative checks

If possible, run a small negative test where the underlying behavior is intentionally wrong, and confirm the assertion fails. This is one of the best ways to catch broken assertions.

For example, if the test checks that a discount was applied, run it once with a fixture that omits the discount and confirm the failure is explicit.

Ask the agent for intent, not just code

A useful pattern is to require the agent to produce a short explanation alongside the patch:

what behavior it thinks the test is protecting
which selectors or signals it chose and why
what it could not verify automatically
what assumptions it made about the app state

That extra context gives reviewers something to challenge.

For example, if the agent updates a checkout test, its explanation should say something like:

it verifies the order confirmation page, not just the cart page
it uses role-based locators for buttons and headings
it validates the confirmation message and order total
it assumes a seeded test user and a known product catalog

If that explanation sounds vague, the code probably is too.

Review AI-generated assertions as a separate risk surface

Assertions deserve more scrutiny than navigation steps. A generated click path can still be useful even if it is slightly imperfect, but a weak assertion can turn a test into an expensive smoke screen.

When reviewing AI-generated assertions, focus on three questions:

1. Is the assertion specific enough?

Bad:

typescript

await expect(page.locator('body')).toContainText('Success');

Better:

typescript

await expect(page.getByRole('heading', { name: 'Order confirmed' })).toBeVisible();
await expect(page.getByTestId('confirmation-number')).toHaveText(/^[A-Z0-9-]+$/);

2. Is the assertion tied to the right signal?

A UI message may be incidental. The actual proof might live in a network response, a cookie, or a persisted value.

3. Does the assertion fail for the right reason?

If an assertion fails, the debugging path should be obvious. “Text not found” is too vague when the real issue was that the request never completed.

A strong assertion tells you what broke. A weak assertion just tells you the test stopped agreeing with itself.

Prefer behavior checks over implementation checks

Agents often overuse selectors because selectors are the easiest thing to generate. Human reviewers should push back toward behavior-oriented checks.

Good signals include:

accessible roles and names
visible text that users actually see
URL changes that represent navigation state
server responses for critical flows
stored state for app-specific outcomes

Less desirable signals include:

deep CSS paths
brittle nth-child selectors
generic class names tied to styling
arbitrary sleeps after actions

A practical rule is this, if the frontend team can refactor markup without changing the user experience, your test should probably survive too.

Use a test review checklist for agent output

A repeatable review checklist keeps the process from becoming subjective.

Code and selector checks

Does the test use stable locators?
Are there any raw XPath expressions that could be replaced with roles or test IDs?
Did the agent introduce duplicated waits?
Are there condition checks instead of fixed delays?

Assertion checks

Does every assertion prove something meaningful?
Are there any checks on superficial text that could mask failure?
Did any assertion become less strict without justification?
Are success and failure states both considered where appropriate?

Data and environment checks

Is the test independent of local machine state?
Are fixture assumptions documented?
Does the test require seeded data, feature flags, or auth state?
Is cleanup handled if the test creates records?

Maintainability checks

Would a human understand this test six months from now?
Is the generated code aligned with the suite’s conventions?
Are helper functions reused instead of duplicated?
Is the test narrow enough to debug quickly?

A lightweight pattern for validating changed tests in Playwright

If you use Playwright, the most practical safety net is a combination of type checking, targeted test execution, and failure inspection.

import { test, expect } from '@playwright/test';

test('order confirmation is shown after checkout', async ({ page }) => {
  await page.goto('/shop');
  await page.getByRole('button', { name: 'Add to cart' }).click();
  await page.getByRole('link', { name: 'Checkout' }).click();
  await page.getByRole('button', { name: 'Place order' }).click();

await expect(page.getByRole(‘heading’, { name: ‘Order confirmed’ })).toBeVisible(); await expect(page.getByTestId(‘order-total’)).toContainText(‘$’); });

When an agent updates this file, review whether it preserved the test’s core claim. If it changed the heading assertion to a generic text check, that is a regression in test quality even if the code still runs.

For agent-generated updates, I also recommend a diffs-first habit:

inspect the changed assertion lines first
then inspect selectors
then inspect helper reuse and waits
only then run the test

That order catches the highest-risk issues before you spend time on execution.

Add a fail-fast gate for suspicious changes

Some changes deserve automatic escalation to human review, even if the agent produced valid code.

Trigger a manual review when the agent:

removes an assertion without replacing it
changes a strict assertion into a weaker one
introduces new sleep-based waits
rewrites a stable locator into a brittle XPath
modifies both the test and the helper it depends on in the same patch
touches authentication, payment, or destructive flows

In practice, this can be a simple policy in your PR automation. The point is to make weakening changes visible.

Separate creation from maintenance

There are two distinct agent problems:

creating coverage from scratch
maintaining existing coverage as the app changes

Creation is easier to supervise because the human reviewer can ask, “does this test cover a real user journey?” Maintenance is harder because the test may already be embedded in CI, and a small change can accidentally reduce coverage.

For maintenance workflows, require the agent to answer:

what changed in the app
which locator or assertion failed
why the proposed update is the minimal safe change
whether the change affects test meaning, not just syntax

If the answer says “I replaced the old selector with a new one,” that is not enough. A good maintenance agent should explain whether the semantics stayed the same.

Where autonomous browser test creation fits

Not every team wants to manage code-generation prompts, framework conventions, and locator strategies in the repo itself. Some teams prefer a platform that can generate browser tests in a more controlled surface and keep them editable there. Endtest’s AI Test Creation Agent is one example of that model, and its documentation shows how it creates web tests from natural-language instructions.

That approach can be useful when the main problem is framework upkeep, not just test generation. If your team spends too much time repairing selectors and browser-driver glue, a lower-maintenance authoring model can reduce the amount of custom code you need to review. It is still important to validate the behavior, but the operational burden is different.

Consider AI-assisted assertions for fragile UI checks

Sometimes the brittle part is not the test flow, it is the assertion itself. This is where natural-language checks can help, especially for visual or semantic conditions that are annoying to express with a single DOM selector.

For example, a standard assertion might ask whether a specific element contains a string. A more robust human review question is whether the page shows the expected state, language, banner color, or error condition. Endtest’s AI Assertions are built around that idea, and the docs describe validating complex conditions in natural language, with scope over page content, cookies, variables, or logs. If your suite has a few checks that are consistently fragile, this kind of abstraction can reduce maintenance without removing review discipline.

The important caveat is that flexibility can hide ambiguity. If a natural-language assertion is too loose, it can accept the wrong UI state. Keep strictness aligned with risk, critical validations should be explicit, while lower-risk visual checks can be more tolerant.

A practical release gate for agent-written tests

A good production gate for autonomous test updates usually looks like this:

agent produces a patch or a test draft
static checks run automatically
reviewer inspects assertions first, then locators
test runs against controlled fixtures
negative check confirms the assertion fails when behavior is wrong
PR merges only after the reviewer signs off on intent and risk

This pipeline is not especially glamorous, but it is durable. It assumes the agent is helpful, not infallible.

Decision criteria for your team

If you are deciding how much autonomy to allow, use these questions:

How expensive is a false green in this suite?
How often do UI changes invalidate locators?
Do reviewers have time to verify assertion intent?
Is the app behavior easy to seed and control in test environments?
Would a lower-code or platform-native authoring model reduce maintenance burden?

If the answer to most of those questions is “we need less framework upkeep,” then an agentic platform can be a good fit. If the answer is “we need very precise, code-level control,” then your workflow should emphasize review gates, diff inspection, and negative testing.

The core rule: trust the agent to draft, not to decide

The best way to test AI agents that write test code is to make their output inspectable, executable, and falsifiable. Do not ask whether the code looks smart. Ask whether it still proves the user behavior you care about.

If a generated test can be changed by the agent without anyone noticing that the assertion became weaker, your process is too loose. If every update requires a full human rewrite, your process is too rigid. The right middle ground is a reviewable pipeline where the agent handles the boilerplate and the team owns the meaning.

That is what keeps autonomous test code updates useful, instead of quietly turning your suite into a collection of confident lies.