Abstract
Tool calls in large language models are traditionally issued only after all arguments are fully decoded. This sequential process results in increased latency, particularly for tools requiring multiple parameters where some arguments are not immediately necessary for initial execution steps. This disclosure describes a method for the asynchronous execution of tool calls through partial argument decoding. Tools are decomposed into sub-tasks to identify the specific order and timing for each argument requirement. The model is trained to generate arguments in this optimized sequence. During inference, tool execution is initiated as soon as the first required arguments are decoded. Parallel processing is maintained as subsequent arguments are decoded while the tool operation proceeds. This approach reduces overall latency by parallelizing the decoding process with tool execution.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Hartmann, Florian and Carbune, Victor, "Asynchronous Execution of Large Language Model Tool Calls via Partial Argument Decoding", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10000