Abstract
Large language models (LLMs) generate output tokens auto-regressively, which places constraints on the rate of token generation and can lead to underutilized hardware and higher latencies for generating responses. Speculative decoding techniques can be used to guess multiple future tokens or a sequence of candidate tokens to improve hardware utilization and reduce the overall latency. However, when speculative tokens are generated using a drafter model that is a lightweight approximation of a primary LLM, the drafter model can itself become a computational bottleneck, consuming non-trivial time and/or underutilizing hardware. This disclosure describes the use of tiered smaller drafters where one or more sub-drafters can be heuristic drafters. This can improve hardware utilization and reduce latency.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Ro, Jae Hun; Aharoni, Asaf; Suresh, Ananda Theertha; and Butler, Michael, "Improving LLM Speculative Decoding Drafting Using Multi-Tier Chained Drafter", Technical Disclosure Commons, (November 13, 2025)
https://www.tdcommons.org/dpubs_series/8864