Abstract
Presented herein is a system for automated generation, deployment, and evaluation of test cases for artificial intelligence (AI)-powered agents performing network operations. The system evaluates long-running agent trajectories rather than only single-turn outputs, validates whether required tool calls were executed, assesses whether a reasonable diagnostic process was followed, and checks whether conclusions were accurate and helpful. Procedural documentation, such as skills, is used as a proxy for coverage so that key use cases can be prioritized and test cases can scale with domain growth. The system also generates and pre-validates synthetic responses to anticipated tool calls, thereby avoiding physical networks or full-stack network clones during evaluation. By combining skill-driven test generation, lightweight mock network data, semantic model-based judging, trace collection, and reflective improvement, the system reduces the barrier to comprehensive validation of deep network agents.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Holland, Ryan; Ganesan, Elango; Salman, Samer; and Zhao, Yao, "AI AGENT EVALUATION WITH SKILL-DRIVEN TEST CASES AND LIGHTWEIGHT NETWORK SIMULATION", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10420