Abstract

Evaluation of conversational artificial intelligence systems can depend on diverse datasets, but using raw user data may introduce privacy risks, and some anonymization techniques, such as redaction, can degrade data utility. A system for generating synthetic data can operate by transforming original text prompts. For example, a process can involve tokenizing a prompt, generating vector embeddings for the tokens, and performing a semantic search against a vector database. This database may contain embeddings for terms from a controlled, privacy-safe lexicon. Original tokens can be replaced with semantically similar counterparts from this lexicon to assemble a new, synthetic prompt. This technique can produce privacy-preserving datasets that approximate the semantic intent of the original interactions by using a pre-approved, non-sensitive vocabulary, which may be suitable for development and evaluation environments.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS