Improving LLM Speculative Decoding Drafting Using Multi-Tier Chained Drafter

Jae Hun Ro
Asaf Aharoni
Ananda Theertha Suresh
Michael Butler

Abstract

Large language models (LLMs) generate output tokens auto-regressively, which places constraints on the rate of token generation and can lead to underutilized hardware and higher latencies for generating responses. Speculative decoding techniques can be used to guess multiple future tokens or a sequence of candidate tokens to improve hardware utilization and reduce the overall latency. However, when speculative tokens are generated using a drafter model that is a lightweight approximation of a primary LLM, the drafter model can itself become a computational bottleneck, consuming non-trivial time and/or underutilizing hardware. This disclosure describes the use of tiered smaller drafters where one or more sub-drafters can be heuristic drafters. This can improve hardware utilization and reduce latency.