Abstract
Photo search technologies can identify the content of an image but cannot answer queries relating to the spatial relationships between entities in an image. This disclosure describes computational architectures and techniques for spatial compositional photo search and retrieval. Subjective, spatially rich natural language queries are translated into strict geometric and relational constraints, which are evaluated efficiently against a large corpus of pre-computed spatial indices. Upon ingestion, an image is processed to generate a highly compressed, pre-computed spatial scene graph G = (V, E, C), where V is a set of multimodal entities and bounding boxes; E is a set of directed spatial relational edges; and C is a set of semantic background and compositional global features. When a user submits a natural language query, a large language model (LLM) parser translates the unstructured text into a structured, machine-readable spatial layout specification. A retrieval engine evaluates candidate spatial scene graphs against the generated spatial layout specification.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Yakar, Tamar and Labzovsky, Ilia, "Semantic Photo Search with Compositional Spatial Reasoning", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10133