Abstract

Recent years have seen a proliferation of systems designed to respond to voice, e.g., smart speakers, smartphones, and other voice-activated devices. Such systems are convenient and intuitive to use. However, they also open the possibility for malicious actors to gain control of sensitive information through fraudulent use of speech commands. For example, a malicious actor can spoof the voice recognition system through a recording of the user’s voice or by using a computer to synthesize the speech mimicking the user. The techniques of this disclosure thwart such attacks by providing a random phrase for the user to repeat and verifying the resulting audio and transcription for authenticity. Spoofed audio sounds choppy and exhibits sharp transitions, while authentic audio is smooth and lacks unnatural artifacts. A trained machine learning model is used to tell the difference.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS