Traditional software testing assumes deterministic behaviour: predictable inputs produce expected outputs. Agentic AI systems shatter this assumption. These autonomous agents make independent decisions, learn from interactions, and exhibit emergent behaviours that render traditional unit and integration testing insufficient.This talk examines critical testing challenges through three real-world case studies:Voice AI Agent: Deployed across 20+ corporate environments, this system processes natural speech, maintains conversational context, and autonomously decides what additional information to provide. Traditional testing covered individual components but missed integration issues where the agent would correctly understand "Q3 sales figures" but autonomously add irrelevant market trend analysis.Phone Caller Agent: Handling 5,000+ patient interactions for healthcare appointment scheduling and reminders. Standard integration tests passed, but the agent failed in production when encountering background noise, elderly patients requiring slower conversations, or unexpected human responses that weren't in test scenarios.Chat Agent: Processing 100+ daily customer service conversations with multi-session context retention. While individual NLP components performed well, the integrated agent exhibited unexpected behaviours during complex, multi-issue conversations that spanned several sessions.These case studies reveal five critical testing gaps:Non-deterministic behavior validation – the same inputs can produce different valid outputsContextual decision testing – validating autonomous choices about escalation, information depth, and communication styleMulti-modal integration complexity – components work individually but fail in integrated agent workflowsContinuous learning validation – ensuring agent improvements don't introduce biases or degrade existing capabilitiesReal-world variability simulation – testing across acoustic environments, human communication patterns, and infrastructure variationsThe presentation introduces a practical testing framework specifically designed for agentic systems: Behavioural Goal Testing (testing achievement rather than outputs), Probabilistic Validation (acceptable outcome ranges vs. exact matches), Adversarial Scenario Generation (systematic edge case creation), and Contextual Journey Simulation (multi-session user interactions).
Key takeaways:- How to test non-deterministic AI systems with confidenceParticipants will learn how to move beyond exact assertions and design test oracles based on intent, semantics, and properties, enabling reliable validation of probabilistic LLM and agent outputs.Practical frameworks for validating LLMs and multi-agent architectures
- Attendees will gain hands-on experience testing AI systems across layers, including orchestration, inference, and inter-agent communication, using structured frameworks and real-world scenarios.
- Actionable tools to operationalize AI quality in productionThe workshop equips participants with Python-based evaluators, red teaming techniques, and automated quality metrics that can be integrated into CI/CD pipelines and governance strategies immediately.