Abstract

Generative models employing a sequential internal reasoning process with a fixed computational budget may experience latency or produce lower-quality outputs. A system for parallel speculative decoding with variable budgets can analyze an input prompt to predict a distribution of potential computational budgets and subsequently initiate multiple decoding streams in parallel, where each stream may be conditioned on a different budget from the distribution. An early termination mechanism can monitor the parallel streams and select a response from a stream that finishes its generation process, without waiting for more computationally intensive streams to complete. This technique can manage the trade-off between response latency and output quality by concurrently exploring multiple reasoning paths of varying computational cost and dynamically allocating resources according to task complexity.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Carbune, Victor and Hartmann, Florian, "Parallel Speculative Reasoning with Variable Budgets in Generative Models", Technical Disclosure Commons, (December 21, 2025)
https://www.tdcommons.org/dpubs_series/9065

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Parallel Speculative Reasoning with Variable Budgets in Generative Models

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Parallel Speculative Reasoning with Variable Budgets in Generative Models

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information