Abstract

Organizations in regulated or disconnected environments may face challenges in deploying high-performance generative models due to data residency or connectivity constraints, while local models can exhibit a performance gap for certain complex tasks. A hybrid architecture can augment a local student generative model using a dynamic distillation cache. This system can capture outputs, such as final answers, and underlying reasoning patterns from a remote teacher model. When a new query is received, the system can use semantic similarity to retrieve relevant cached reasoning and provide it as context to the local student model. This method of in-context distillation may allow the local model's performance to improve based on live usage. This approach can help to bridge capability gaps while supporting operational resilience and data sovereignty requirements.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS