Abstract

Evaluating complex conversational agents can be challenging, as static benchmarks or manual testing may lack the scale and behavioral diversity to detect some failure modes in agentic systems, and also because of the internal non-determinism and multi-step complexity of agentic workloads. A system is described for the automated generation of synthetic, persona-driven dialogues. A multi-agentic process can generate detailed user personas, which can then condition a large language model (LLM) to synthesize multi-turn conversations. A search algorithm can explore conversational paths, and a separate LLM acting as a judge can evaluate the dialogues for realism and persona adherence. This process can produce a large-scale and curated corpus of diverse synthetic test data. The corpus can be used as a testbed to quantitatively assess the performance and stability of conversational agents, potentially facilitating the identification and measurement of complex failure modes.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS