Abstract

Users can search their photo libraries based on the visual content of their photos, with image-to-text encoder models, object recognition, etc. being used to index photos. However, search techniques that do not utilize user-provided captions or other text associated with photos may not return photos that match a search query when the visual content does not match the query. This disclosure describes the use of a visual-language model (VLM) encoder to encode photos and associated captions (when available) into the same encoder space as that for a search query. The VLM encoder is trained using contrastive loss at training time. Comparisons between photo embeddings and query embeddings can be made based on the metric distance to identify photos that match the query. The described techniques can identify photos that match a search query based on either or both the visual content of the photo and the text caption associated with the photo.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Shin, D, "Joint Photo and Caption Semantic Search Using a Visual-Language Model Encoder", Technical Disclosure Commons, (February 12, 2024)
https://www.tdcommons.org/dpubs_series/6680

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Joint Photo and Caption Semantic Search Using a Visual-Language Model Encoder

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Joint Photo and Caption Semantic Search Using a Visual-Language Model Encoder

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information