Abstract
Users can search their photo libraries based on the visual content of their photos, with image-to-text encoder models, object recognition, etc. being used to index photos. However, search techniques that do not utilize user-provided captions or other text associated with photos may not return photos that match a search query when the visual content does not match the query. This disclosure describes the use of a visual-language model (VLM) encoder to encode photos and associated captions (when available) into the same encoder space as that for a search query. The VLM encoder is trained using contrastive loss at training time. Comparisons between photo embeddings and query embeddings can be made based on the metric distance to identify photos that match the query. The described techniques can identify photos that match a search query based on either or both the visual content of the photo and the text caption associated with the photo.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Shin, D, "Joint Photo and Caption Semantic Search Using a Visual-Language Model Encoder", Technical Disclosure Commons, (February 12, 2024)
https://www.tdcommons.org/dpubs_series/6680