Abstract
This disclosure describes a system for helping Artificial Intelligence (AI) models , such as foundation models, generative AI, or large language models (LLMs), watch and understand continuous video streams or non-stop broadcasts without exceeding their context window or losing track of time. Normally, AI models have a limited memory (context window) and can suffer from temporal hallucination, becoming confused about when events occurred if large portions of an uninterrupted video are skipped. The system solves temporal hallucinations by using a dynamic saliency transformer acting as a smart filter to remove boring, low saliency, or repetitive parts of a video based on an auto-calibrating reference point. To keep the timeline accurate, the system inserts special time-gap tokens into the data stream. These tokens act as digital bookmarks, time lags, or temporal offsets that tell the neural network exactly how much time passed during the skipped sections. Time-gap tokens allow the model to analyze long video streams efficiently while maintaining a precise understanding of the temporal distance and total duration of the event.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Yakar, Tamar and Labzovsky, Ilia, "LATENT TEMPORAL REASONING IN MULTIMODAL LLMS USING SPECIALIZED TIME-GAP TOKENS AND SPATIOTEMPORAL SALIENCY PRUNING", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10516