Abstract
A multimodal annotator is proposed herein that provides multimodal signals including slide presentations, facial expressions, speech tones, sentiment analysis, and join/leave metadata in addition to ordinary text meeting transcripts. The multimodal annotator may enable real-time multimodal Large Language Model (LLM) applications and can be implemented as a drop-in replacement for generating text transcripts in existing LLM applications. Thus, the multimodal annotator as proposed herein can enable more advanced, multimodal artificial intelligence (AI) assistant features for that can be used across a variety of use cases, such as for video conference use cases, call center use cases, presentation video recordings, or the like.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Hsiao, Yu-Chung, "A UNIFIED MULTIMODAL VIDEO CONFERENCE ANNOTATOR FOR REAL-TIME LLM APPLICATIONS", Technical Disclosure Commons, (November 06, 2024)
https://www.tdcommons.org/dpubs_series/7507