The problem: it is too hard to understand and improve GenAI quality, and yet organizations are moving ahead regardless. For AI engineers it’s hard to:
- Increase accuracy due to lack of repeatable & representative testing
- Understand reliability: know how, why, or when an agent will fail.
This leads to poor reliability and accuracy, which:
- Increases operational costs and can increase reputational damage
- Erodes user trust, reduces customer engagement, and increases churn
- Reduces business confidence, slowing down AI adoption
In this talk I will discuss the limitations of how we are current testing AI agents, and why this means we are not adequately ensuring the safety of agentic AI systems. With non-deterministic systems like Generative/Agentic AI, we need to simulate a large number of inputs (millions) and measure the outputs using judge agents to find the statistical success rate. This a process that is more similar to how we traditionally do load testing rather than the simple functional testing we’re using with AI right now.
I will explain how you can instead use tools like AgentCore to create orchestration agents that build other types of agent to make this new type of non-deterministic testing possible. This approach will be for GenAI what traditional automated tests are for deterministic code:
- Auto generate representative testing material
- Orchestrate tests against real AI endpoints
- Judge outputs (minimum standards, accuracy quantification)
- Improve accuracy and reliability
Key takeaways:
- Current functional testing techniques are inadequate for testing agentic/generative AI systems
- What does it mean to use LLM as Judge agents? What are input agents?
- How can you create an AI testing orchestration pipeline for testing AI agents