Abstract

Generative models employing a sequential internal reasoning process with a fixed computational budget may experience latency or produce lower-quality outputs. A system for parallel speculative decoding with variable budgets can analyze an input prompt to predict a distribution of potential computational budgets and subsequently initiate multiple decoding streams in parallel, where each stream may be conditioned on a different budget from the distribution. An early termination mechanism can monitor the parallel streams and select a response from a stream that finishes its generation process, without waiting for more computationally intensive streams to complete. This technique can manage the trade-off between response latency and output quality by concurrently exploring multiple reasoning paths of varying computational cost and dynamically allocating resources according to task complexity.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS