Abstract
Evaluation of conversational artificial intelligence systems can depend on diverse datasets, but using raw user data may introduce privacy risks, and some anonymization techniques, such as redaction, can degrade data utility. A system for generating synthetic data can operate by transforming original text prompts. For example, a process can involve tokenizing a prompt, generating vector embeddings for the tokens, and performing a semantic search against a vector database. This database may contain embeddings for terms from a controlled, privacy-safe lexicon. Original tokens can be replaced with semantically similar counterparts from this lexicon to assemble a new, synthetic prompt. This technique can produce privacy-preserving datasets that approximate the semantic intent of the original interactions by using a pre-approved, non-sensitive vocabulary, which may be suitable for development and evaluation environments.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Manickavelu, Paramesh; Rutowski, Tomek; Mirza, Saeed; Hsu, Paul; and Malesevic, Stevan, "Generation of Synthetic Data via Embedding-Based Semantic Replacement From a Controlled Lexicon", Technical Disclosure Commons, (April 06, 2026)
https://www.tdcommons.org/dpubs_series/9710