Abstract
This document describes techniques that can enhance efficiency and privacy of multimodal queries by utilizing intelligent on device context buffer (also known as local cache, edge computing architecture, or edge storage) for frame selection and collation. When user interacts with voice assistant or artificial intelligence (AI) agent using natural language query, local computing system can determine whether query utilizes visual context. Rather than continuously streaming raw video to remote server, system can maintain privacy aware zero retention buffer operating as rolling in memory buffer or rolling video loop that temporarily stores recently captured video frames alongside associated semantic metadata and image embeddings. If system determines that visual context is applicable, vehicle agent orchestrator can map semantic intent of user to cached semantic metadata or image embeddings to identify and select relevant frames. These specific frames can then be bundled and sent to backend server, such as Vision Language Model, Visual Language Model (VLM), or Large Vision Model (LVM), to fulfill user request. By intelligently selecting and transmitting targeted subset of frames based on user intent, this approach can reduce network bandwidth usage, lower query latency, and minimize excessive retention of ambient visual data to protect user privacy.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Ayalasomayajula, Shishir and Arora, Ankit, "ON-DEVICE CONTEXT BUFFERING AND FRAME SELECTION FOR MULTIMODAL QUERIES", Technical Disclosure Commons, (June 16, 2026)
https://www.tdcommons.org/dpubs_series/10468