Testing the performance of database management systems is often accomplished using synthetic data and workload generators such as TPCH and TPCC. However most synthetic benchmarks don’t fully match customer database configurations. Customer database configuration data-sets are typically hard to obtain due to their sensitive nature and prohibitively very large sizes. As a result, oftentimes the data management systems are not thoroughly tested, and performance related bugs are commonly discovered after deployment, where the cost of fixing is very high. We propose a scalable data generator called XGen, an approach to generating data-sets out of customer metadata information, including integrity constraints and histogram statistics. Handling multiple referential integrity constraints is a very hard problem and we handle it in a very novel way by indirectly encoding the column dependencies so that we can still independently generate the column data for scalable data generation.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.