Abstract
In data-driven software testing, there is no universally accepted best practice for test data generation, which is challenging due to the high-dimensional interacting features and parameters of critical user journeys. This disclosure describes methods that leverage natural language processing (NLP) techniques to downsample a high-dimensional data space to ensure the test coverage of important usage patterns and parameter interactions, even when they constitute edge cases. A horizontally scalable, NLP-inspired dataflow recognizes multidimensional patterns from structured logs, and then samples the logs to cover those patterns. The pattern recognition and sampling stages can be augmented by a preceding sessionization stage, which groups related log entries into sessions. Test data sampling is framed as an optimization problem constrained by a snippet coverage requirement, where each snippet represents a pattern that a machine learning model identifies as worthy of testing. An information-theoretic score measures test coverage. Originating from the domain of natural language processing, the described techniques apply to software testing and generally to situations where behavioral and usage patterns can be mined from structured logs to improve software reliability and guide business intelligence.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Gao, Yifan; Yang, Dan; Liu, Yongtai; Zhou, Haotian; Bernstein, Alon; and Qian, Zhenzhi, "Obtaining Test Data Using a Bag-of-Words Model on Structured Logs", Technical Disclosure Commons, (October 07, 2024)
https://www.tdcommons.org/dpubs_series/7411