Abstract

A workload-parameterized performance prediction framework is described for triaging embedding cache optimizations across a fleet of machine-learning models. Workload parameters including a distribution parameter (alpha), reuse, table size N, batch size, and embedding dimension are input to a closed-form causal-chain model that predicts unique indices, cache misses across multiple cache levels, persistent-store reads, and storage I/O operations. A predicted latency is computed as a weighted combination of predicted access counts and latency terms, and a bottleneck class is identified. Each optimization is associated with a targeted segment of the causal chain and a maximum efficiency, enabling estimation of predicted speedups subject to applicability conditions. Fleet-wide prioritization computes a fleet impact score by weighting predicted speedup by resource footprint (e.g., GPU count) and outputs a ranked list. Selective benchmarks provide actual speedups used to compute a median-based multiplicative correction factor for calibration of subsequent predictions.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Anonymous, "Workload-Parameterized Performance Prediction Framework for Embedding Cache Optimization Triage", Technical Disclosure Commons, (June 30, 2026)
https://www.tdcommons.org/dpubs_series/10705

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Workload-Parameterized Performance Prediction Framework for Embedding Cache Optimization Triage

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Workload-Parameterized Performance Prediction Framework for Embedding Cache Optimization Triage

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information