Abstract

Many online platforms permit various entities to post multimedia content, e.g., text together with visual content. For example, merchants in online stores, digital maps, etc. can add merchant posts to their account to highlight their merchandise and special offers. Matching text queries provided by users against only the text portion of the multimedia content fails to take into account the visual portion (image) of the merchant post. This disclosure describes the use of dual encoders that allow matching user query embeddings against embeddings obtained from textual content and embeddings obtained from the visual content of the same multimedia post to obtain respective relevance scores. Search quality is improved by incorporating information from different modalities. A multimodal ranker is used to rank merchant posts based on both the text and image relevance scores and on post metadata such as freshness, user reviews for the post, etc. The dual encoders can be trained using human-labeled as well as LLM-generated data.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS