Abstract

Providing real-time contextual action suggestions in communication environments can involve computational cost and latency associated with running large-scale artificial intelligence models. A system can address these considerations using a cascaded, multi-layered modeling architecture. The system may first process contextual data, such as live meeting transcripts, through a lightweight, computationally inexpensive classification model. If this model returns a prediction with a confidence score below a predetermined threshold, the request can be escalated to a more resource-intensive model, such as a large language model, for a more nuanced analysis. This gating mechanism can balance predictive accuracy with computational efficiency, which may reduce the operational cost and latency of deploying proactive assistance features at scale by selectively using the more resource-intensive model for ambiguous cases.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS