Abstract
A workload-parameterized performance prediction framework is described for triaging embedding cache optimizations across a fleet of machine-learning models. Workload parameters including a distribution parameter (alpha), reuse, table size N, batch size, and embedding dimension are input to a closed-form causal-chain model that predicts unique indices, cache misses across multiple cache levels, persistent-store reads, and storage I/O operations. A predicted latency is computed as a weighted combination of predicted access counts and latency terms, and a bottleneck class is identified. Each optimization is associated with a targeted segment of the causal chain and a maximum efficiency, enabling estimation of predicted speedups subject to applicability conditions. Fleet-wide prioritization computes a fleet impact score by weighting predicted speedup by resource footprint (e.g., GPU count) and outputs a ranked list. Selective benchmarks provide actual speedups used to compute a median-based multiplicative correction factor for calibration of subsequent predictions.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Workload-Parameterized Performance Prediction Framework for Embedding Cache Optimization Triage", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10705