The problem: it is too hard to understand and improve GenAI quality, and yet organizations are moving ahead regardless.For AI engineers it’s hard to:- Increase accuracy due to lack of repeatable & representative testing- Understand reliability: know how, why, or when an agent will failThis leads to poor reliability and accuracy, which:- Increases operational costs and can increase reputational damage- Erodes user trust, reduces customer engagement, and increases churn- Reduces business confidence, slowing down AI adoptionIn this talk I will discuss the limitations of how we are current testing AI agents, and why this means we are not adequately ensuring the safety of agentic AI systems. With non-deterministic systems like Generative/Agentic AI, we need to simulate a large number of inputs (millions) and measure the outputs using judge agents to find the statistical success rate. This a process that is more similar to how we traditionally do load testing rather than the simple functional testing we’re using with AI right now. I will explain how you can instead use tools like AgentCore to create orchestration agents that build other types of agent to make this new type of non-deterministic testing possible.This approach will be for GenAI what traditional automated tests are for deterministic code- Auto generate representative testing material- Orchestrate tests against real AI endpoints- Judge outputs (minimum standards, accuracy quantification)- Improve accuracy and reliability
Key takeaways:
- Current functional testing techniques are inadequate for testing agentic/generative AI systems
- What does it mean to use LLM as Judge agents? What are input agents?
- How can you create an AI testing orchestration pipeline for testing AI agents