Detecting query language is an important task for applications such as search engines, virtual assistants, etc. It is difficult to correctly determine the language for short text. This task is even more difficult when the text includes references to entities, brands, or other words that are common across multiple languages. Language classifiers can be used to identify query language as well as to determine the intent language of the users. This disclosure relates to techniques to define and identify language agnostic tokens and phrases from text. The techniques described herein can be used to identify language agnostic text (e.g., in a user query) and to select a language model that works best for such text. The techniques can help reduce systematic bias towards English language content for users that are multilingual. The techniques enable better detection of language of a given text by reducing noise through identification of language agnostic queries.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Sareen, Shubhi; Garg, Vijay; Khandelwal, Abhinav; Gupta, Nidhi; Varanasi, Srinivas; and Singh, Aditya, "Language Identification for Language Agnostic Text", Technical Disclosure Commons, (March 24, 2023)