May 28, 2026
How to Test AI Agents for Tool Use, Memory, and Recovery Paths
A practical framework for testing AI agents for tool use, memory retention, retries, and recovery paths, with concrete strategies for QA and engineering teams.
AI agents fail in ways that are very different from classic software. A web app usually fails when a button breaks, an API changes, or a timeout is too short. An agent can fail before it even reaches the tool, by choosing the wrong action, forgetting a critical instruction, hallucinating a tool result, or looping through retries until it burns through budget and confidence.
If you are responsible for shipping agentic systems, the hardest part is not proving that the model can answer a question. It is proving that the agent can do the right thing repeatedly under messy real-world conditions. That means validating tool use, memory retention, fallback logic, and recovery paths as first-class behaviors, not as incidental side effects of a demo prompt.
This article lays out a practical framework for teams that need to test AI agents for tool use, agent memory testing, and failure recovery. It is designed for QA engineers, ML engineers, platform teams, and engineering leads who need tests that hold up in production, not just in notebooks.
What makes agent testing different
Traditional software testing assumes that inputs are deterministic and the system under test follows a predictable code path. Agentic systems are more dynamic. A single user request may go through prompt interpretation, tool selection, tool invocation, result interpretation, memory retrieval, and retry logic, all while the model can branch differently on each run.
That creates a few testing problems:
- The same prompt may produce different intermediate decisions.
- Tool calls can be malformed, missing, duplicated, or issued in the wrong order.
- Memory can be stale, irrelevant, over-applied, or ignored.
- Recovery logic often depends on the specific failure mode, not just whether an error happened.
- Side effects matter, because a wrong tool call can send an email, mutate data, or open a ticket.
In practice, agent testing is closer to validating a distributed workflow than a pure model response. It helps to think in terms of observable behaviors, contracts, and fault tolerance. That framing is consistent with broader software testing and continuous integration, but the assertions need to extend beyond response text.
A good agent test does not ask, “Did the model sound right?” It asks, “Did the system take the safe, correct, and recoverable action for this state?”
Start with the agent contract
Before writing tests, define the agent contract. This is the set of behaviors the system must satisfy regardless of model version, prompt changes, or backend implementation details.
A useful contract usually includes:
- Allowed tools, and when each can be used
- Required tool arguments and schema constraints
- Disallowed actions, such as deleting records without confirmation
- Memory sources, retention windows, and overwrite rules
- Retry policy, including max attempts and backoff rules
- Escalation behavior when the agent is uncertain or blocked
- Logging and traceability requirements
This is important because many agent bugs are not model bugs, they are contract bugs. For example, if a support agent should only issue a refund after verifying eligibility, then the test is not “did it mention refund policy?” The test is “did it collect the right facts, call the eligibility tool, and refuse to proceed when the tool returned a negative result?”
Treat these requirements like product rules, not prompt advice. Once they are explicit, they can be tested, monitored, and versioned.
Build your test matrix around failure modes
The most effective agent test suites are organized by failure mode, not by page or feature. For each tool and each memory source, identify what can go wrong.
A practical matrix might look like this:
Tool use failure modes
- Wrong tool selected
- Correct tool selected with wrong parameters
- Tool called at the wrong time
- Tool called twice when once is enough
- Tool result ignored or misread
- Tool output hallucinated when the call failed
- Tool dependency order violated
Memory failure modes
- Important memory not retrieved
- Irrelevant memory retrieved and overused
- Memory applied to the wrong user, tenant, or conversation
- Old state persisted after user correction
- Memory conflicts with fresh tool output
- Summarized memory loses critical detail
Recovery failure modes
- Transient failure not retried
- Permanent failure retried too many times
- Retry uses the same bad input again
- Fallback path is too broad and unsafe
- Partial success not handled correctly
- User is not informed when escalation is required
A test suite organized this way tends to be more stable than one organized by surface area, because it maps directly to the kinds of incidents you want to prevent.
Testing tool use: verify the decision, the call, and the outcome
When teams say they want to test tool use, they often focus only on whether the tool was invoked. That is necessary, but not sufficient. You need to verify three layers:
- The agent chose the correct tool.
- It passed the correct arguments.
- It handled the result correctly.
1) Assert on tool selection
Test that the agent chooses the right tool for the task. If a user asks for order status, the agent should not use a billing tool. If the request is ambiguous, the agent should ask a clarifying question instead of guessing.
A useful test pattern is to mock the tools and inspect the call trace. In a Playwright-based harness, for example, you might validate the agent output while capturing tool invocation events from your orchestration layer.
import { test, expect } from '@playwright/test';
test('chooses the order status tool for shipment questions', async () => {
const trace = await runAgent({
message: 'Where is my order 12345?'
});
expect(trace.toolCalls[0].name).toBe(‘get_order_status’); expect(trace.toolCalls[0].args.orderId).toBe(‘12345’); });
If you do not have a central trace object, add one. Without structured traces, tool testing becomes brittle text parsing.
2) Validate tool arguments strictly
Argument validation is where many agent errors become expensive. The tool may be right, but the inputs may be incomplete or malformed. Tests should verify schema adherence and business rule adherence.
For example:
- Required IDs are present
- Dates use the expected timezone or format
- Numeric ranges are valid
- Tenant identifiers match the active session
- Sensitive operations require explicit confirmation flags
When tool arguments are derived from natural language, edge cases matter. Ask questions like:
- What happens if the user provides two candidate order IDs?
- Does the agent normalize dates from the user locale?
- If the user says “next Friday,” is the date resolved relative to the current conversation time or system time?
3) Validate the post-tool behavior
Even when the tool call succeeds, the agent may respond incorrectly. It may summarize stale data, ignore a warning from the tool, or present a partial result as final.
Test that the agent:
- Uses the tool output, not an invented answer
- Preserves critical constraints from the tool response
- Escalates when the tool reports uncertainty or failure
- Does not transform a failure into a success message
A common mistake is to accept a fluent natural-language response as evidence of correctness. That is risky. A response can read well and still encode the wrong entity, wrong status, or wrong action.
Use deterministic tool stubs before full integration tests
For agent testing, there is a strong case for a layered approach. Start with deterministic stubs for tool behavior, then move to integration tests against real services.
Deterministic stubs help you test:
- Correct tool selection
- Argument formation
- Retry logic under known error codes
- Fallback branches
- Memory interaction without external noise
A stubbed test can simulate a 429, a timeout, a schema validation error, or a partial response. That gives you control over the failure mode and makes the test repeatable.
Then add integration tests for the risky parts that stubs cannot represent well:
- Real auth and permission failures
- Rate limiting behavior under actual service constraints
- Changes in upstream schemas
- Unexpected payloads or response shapes
- Multi-step workflows across systems
Stubs tell you whether the agent logic is plausible. Integration tests tell you whether it survives contact with the real system.
Agent memory testing needs both relevance and isolation
Memory is one of the easiest agent features to overestimate. A memory store that “works” in a demo can become a source of subtle errors in production if it retrieves stale, overbroad, or cross-user context.
Test relevance
The first question is whether memory retrieval returns the right context. If the agent remembers that the user prefers English and a 24-hour clock, that is useful. If it retrieves a preference from another project, that is dangerous.
Create tests for:
- User profile memory
- Conversation summary memory
- Task-specific working memory
- Long-term preference memory
- Tenant-scoped or workspace-scoped memory
For each memory type, verify that the agent retrieves the correct item and ignores near-matches that should not apply.
Test isolation
Memory leaks across users and tenants are high-severity bugs. Write tests that simulate two users with overlapping names, similar requests, or shared history patterns.
You want to prove that the agent does not:
- Pull preferences from the wrong account
- Reuse old ticket context after a new conversation starts
- Merge state across sessions without authorization
- Let one user’s correction affect another user’s workflow
Test memory overwrites and corrections
Users correct themselves. The agent must update its internal state when the user changes a date, product, address, or goal.
Test cases should include:
- “Use my office address” followed by “Actually, use my home address”
- “The meeting is on Tuesday” followed by “No, next Tuesday”
- “I’m asking about order A” followed by “Sorry, I meant order B”
The test should verify that old memory is not incorrectly preferred over the correction.
Test memory summarization loss
If your agent compresses conversation history into a summary, you should test whether critical facts survive summarization. Summaries tend to lose exact numbers, negations, and exceptions.
This matters for things like:
- Refund eligibility
- Compliance constraints
- Escalation status
- User confirmations
- Temporary overrides
A good memory test suite includes “summary fidelity” cases, where the latest state must remain accurate after several turns.
Recovery path testing is about policy, not just retries
Recovery paths are where agent systems become production-grade. A retry policy that sounds reasonable in design docs can still fail in practice if it repeats the same bad action, escalates too late, or hides the failure from the user.
Classify failures first
Do not treat every failure the same way. Recovery depends on the failure class:
- Transient tool timeout, retry may help
- Authentication failure, retry usually will not help
- Validation error, agent should correct inputs
- Authorization failure, escalate or stop
- Empty result, maybe ask the user for more context
- Partial result, may require a follow-up tool call
Each class should have a defined response. Tests should assert the response policy, not just the presence of a retry.
Test safe retries
A retry should not duplicate side effects unless the tool is explicitly idempotent. That means you need to test whether the agent recognizes idempotent and non-idempotent actions.
For example:
- Safe to retry, fetch status, search, read-only lookup
- Risky to retry, submit form, create ticket, issue refund, send notification
If your system includes retry logic in orchestration, make sure it tracks whether the previous attempt may have succeeded after a timeout. Otherwise, the agent can accidentally duplicate work.
Test fallback paths
Fallback logic is often under-tested because it is not part of the happy path. But it is exactly what users experience when a dependency fails.
Your tests should answer:
- Does the agent ask the user for missing information when tools are unavailable?
- Does it degrade to a safer mode, such as read-only assistance?
- Does it hand off to a human when confidence is too low?
- Does it preserve context for escalation so a human can continue the task?
A strong fallback does not simply say “Something went wrong.” It explains what happened at the right level of detail and offers the next safe action.
Measure behavior with traces, not just final answers
Final response assertions are useful, but they miss most of the interesting behavior. For agent testing, traces are usually more valuable than end output because they expose the decision sequence.
A useful trace can include:
- User input
- Retrieved memories
- Model plan or intent classification
- Tool call names and arguments
- Tool responses and error codes
- Retry attempts
- Final response
- Escalation or guardrail triggers
Once you have traces, you can write targeted assertions.
For example, check that:
- The agent did not call a destructive tool before confirmation
- A memory lookup occurred before composing the final answer
- A failed tool call led to a specific recovery branch
- A retry count did not exceed policy
- The final response references the actual tool result
Trace-based testing is especially important when debugging “almost correct” runs. Those are the hardest failures to catch from output alone.
Use contract tests for tool schemas and guardrails
Tool-calling validation should include contract tests at the boundary between the agent and the tool layer. These tests are cheap to run and catch a lot of production regressions.
Examples include:
- Schema validation for required fields
- Enum validation for allowed values
- String length limits
- Type coercion rules
- Authorization prerequisites
- Confirmation flags for destructive actions
If the agent framework produces structured tool calls, validate both the raw call and the serialized payload that the downstream service receives.
A minimal Python example for a schema check might look like this:
from pydantic import BaseModel, ValidationError
class RefundRequest(BaseModel): order_id: str amount_cents: int confirmed: bool
payload = {“order_id”: “A-100”, “amount_cents”: 5000, “confirmed”: True}
try: request = RefundRequest(**payload) assert request.confirmed is True except ValidationError as e: raise AssertionError(f”Invalid tool payload: {e}”)
The key idea is that agent tests should not trust the model to obey the interface. They should verify it.
Design test cases from incidents you expect, not from prompts you like
Many agent test suites start by covering sample prompts. That is a useful beginning, but it is not enough. The best cases usually come from failure analysis:
- Ambiguous user requests
- Repeated tool failures
- Incorrectly cached context
- Timeout during side-effecting operations
- User correction after the agent has already planned
- Conflicting signals from memory and live tool output
- Permission boundaries between users or teams
If your system has not shipped yet, derive these from design reviews and threat modeling. Ask each team what they fear the agent will get wrong.
A good test catalog often includes:
- Happy path with tool call
- Ambiguous request requiring clarification
- Invalid parameter correction
- Tool timeout with safe retry
- Tool timeout with idempotency risk
- Memory hit with correct preference
- Memory hit with stale preference
- Memory miss with fallback to user question
- Escalation after repeated failure
- Recovery after user correction
Automate agent testing in CI, but keep the suite layered
Agentic systems are not suited to one giant end-to-end suite. You want layers, because the failure surface is too broad and the runtimes are too expensive.
A practical CI stack may include:
Fast checks
- Prompt and tool schema validation
- Contract tests for tool payloads
- Deterministic stub-based behavior tests
- Memory isolation checks
Medium checks
- Multi-step workflow tests
- Recovery path tests with controlled failures
- A/B checks across prompt versions or agent configs
Slow checks
- Real integration tests against live dependencies
- Cross-service workflows
- Human-in-the-loop escalation paths
Here is a simple GitHub Actions pattern that keeps fast agent tests in the main pipeline and leaves the slower ones for scheduled runs or a separate job:
name: agent-tests
on: pull_request: schedule: - cron: ‘0 3 * * 1’
jobs: fast: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npm ci - run: npm test – –grep “agent-fast”
slow: if: github.event_name == ‘schedule’ runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npm ci - run: npm test – –grep “agent-slow”
This division matters because agent tests often involve stochastic components, external services, and long-running workflows. If everything runs on every commit, the suite becomes noisy and teams start ignoring failures.
Make failures explainable
A failing agent test should tell you more than “expected false, got true.” The artifact should point to the decision point that broke.
Useful debug signals include:
- Which memory items were retrieved
- Which tool candidates were considered
- Why a tool was selected or skipped
- Which retry branch fired
- Which guardrail blocked execution
- Whether the final response was generated from tool output or fallback text
If you store traces, include a stable test ID and the seed, if your setup supports seeding. Even when the model itself is not fully deterministic, trace metadata can help you compare runs and isolate regressions.
Common mistakes teams make
A few anti-patterns show up repeatedly in agent test programs:
Testing only the final text
This misses wrong tool calls, hidden retries, and memory misuse.
Using only happy-path prompts
The agent will look good until it meets a real user with ambiguity, corrections, or partial data.
Letting memory become a black box
If you cannot inspect what was retrieved and why, you cannot test it well.
Skipping negative tool tests
You need to know what happens when tools fail, return partial data, or reject inputs.
Treating retries as harmless
Retries can amplify side effects if the workflow is not idempotent.
Overfitting tests to a single model version
Agent behavior can change with prompt edits, routing changes, or model updates. Focus on contracts and traceable behavior, not brittle wording.
A practical checklist for agent test coverage
Use this checklist when reviewing an agent testing plan:
- Does the suite cover each critical tool, including failures?
- Are tool arguments validated against schemas and business rules?
- Are memory retrieval, memory updates, and memory isolation tested?
- Do tests cover stale memory, corrections, and cross-user leakage?
- Are retries bounded and differentiated by failure type?
- Are fallback and escalation paths tested under real failure conditions?
- Are side-effecting actions protected by confirmation and idempotency checks?
- Do tests capture traces, not just final answers?
- Are fast deterministic tests separated from slower integration tests?
- Can a failure be diagnosed from the test artifact alone?
If the answer to most of these is no, the suite is probably evaluating prompt quality more than agent reliability.
Closing thought
The goal of agent testing is not to make the model look intelligent, it is to make the system dependable when the model is uncertain. Tool use, memory, and recovery paths are where that dependability is won or lost.
If you treat those behaviors as contracts, test them at the trace level, and force the system through realistic failure modes, you will catch the issues that matter most before users do. That is the difference between a demo agent and a production one.