Abstract

Large language models (LLMs) generate output tokens auto-regressively, which places constraints on the rate of token generation and can lead to underutilized hardware and higher latencies for generating responses. Speculative decoding techniques can be used to guess multiple future tokens or a sequence of candidate tokens to improve hardware utilization and reduce the overall latency. However, when speculative tokens are generated using a drafter model that is a lightweight approximation of a primary LLM, the drafter model can itself become a computational bottleneck, consuming non-trivial time and/or underutilizing hardware. This disclosure describes the use of tiered smaller drafters where one or more sub-drafters can be heuristic drafters. This can improve hardware utilization and reduce latency.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Ro, Jae Hun; Aharoni, Asaf; Suresh, Ananda Theertha; and Butler, Michael, "Improving LLM Speculative Decoding Drafting Using Multi-Tier Chained Drafter", Technical Disclosure Commons, (November 13, 2025)
https://www.tdcommons.org/dpubs_series/8864

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Improving LLM Speculative Decoding Drafting Using Multi-Tier Chained Drafter

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Improving LLM Speculative Decoding Drafting Using Multi-Tier Chained Drafter

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information