This disclosure describes techniques that receive text queries and additional visual input from a user to improve video search. The text input and visual input are separately encoded using a dual encoder model and combined into a query fingerprint vector. The dual encoder model is trained using contrastive loss such that when performing feature comparison, the feature comparison space enables the use of simple L2 distances to identify videos that match the input query. The fingerprint vector is in the same feature space as pre-computed vectors for available videos, enabling fast comparison and ranking. The identified matching videos are ranked, and a list is displayed in response to the user query

This work is licensed under a Creative Commons Attribution 4.0 License.