LLMCompiler - Towards parallel function calling
a huge step forward for using LLMs in practical applications
Read the original paper here.
Large Language Models (LLMs) have several inherent limitations such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data.
To overcome these limitations, we use techniques like Retrieval Augmented Generation (or RAG) that queries a vector or structured database to add contextually relevant results to the prompt. We use prompt engineering techniques like Zero-shot or Few-shot prompting or Chain/Tree-of-thought to enhance the LLMs reasoning capability.
Innovations like Toolformer and ReAct have improved LLMs by letting them use external functions for complex problem-solving. The ability of LLMs to integrate various tools and function calls could enable a fundamental shift in how we develop LLM-based software. There are several challenges with the current approach (specifically ReAct):
Accuracy: Concatenating intermediate observations can affect the execution flow of the LLM, potentially reducing accuracy.
Serial Execution: It is not possible to run multiple tools in parallel.
Reliability: Intermediate results can affect the LLM's ability to keep track of the task.
Testability: It is hard to create unit tests for specific paths of the code.
Long-Term Planning: Current LLMs are not good at long-term planning.
Debugging: It requires manually reading intermediate thoughts/observations and reasoning why the LLM got the wrong results.
Fault Tolerance: It is hard to recover from incorrect LLM decisions (no replanning)
LLMs excel at simple function calling with few calls.
💡 What if we could divide the problem into more manageable function calls?
This mirrors what skilled programmers do with large-scale code:
Break the code into smaller, manageable pieces for easier reasoning, debugging, and testing.
Write controller logic that correctly calls these pieces, incorporating try/except for error handling.
This is exactly the approach proposed by the authors of a new framework called LLMCompiler.
LLMCompiler is the first framework to optimize the orchestration of LLM function calling that can not only improve latency and cost, but also accuracy by minimizing interference from the outputs of intermediate function calls and optimizes the parallel function calling performance of LLMs.
At a high level, this is achieved by introducing three key components:
an LLM Planner that identifies an execution flow;
a Task Fetching Unit that dispatches the function calls in parallel;
an Executor that executes the dispatched tasks using the associated functions.
Overview of LLMCompiler
LLM Planner
The LLM Planner generates a task sequence and their dependencies, forming a directed acyclic graph. It identifies tasks, inputs, and interdependencies, using LLMs' reasoning. If a task depends on a previous one, it uses a placeholder variable to be replaced later with the actual output from that task.
The Planner leverages LLMs’ reasoning capability to decompose tasks from natural language inputs by incorporating a pre-defined prompt that guides it on how to create dependency graphs and to ensure correct syntax
Task Fetching Unit
The Task Fetching Unit fetches tasks to the Executor as soon as they are ready for (parallel) execution based on a greedy policy. It replaces variables with the actual outputs from preceding tasks, which were initially set as placeholders by the Planner. In the example above, the variable $1 and $2 in Task $3 would be replaced with the actual market cap of Microsoft and Apple after the search tasks are done.
Executor
The Executor asynchronously and concurrently executes tasks fetched from the TFU. The Executor is equipped with user-provided tools, and it delegates the task to the associated tool. These tools can be simple functions like a calculator, Wikipedia search, or API calls, or they can even be LLM agents that are tailored for a specific task. Each task has dedicated memory to store its intermediate outcomes, similar to what typical sequential frameworks do when aggregating observations as a single prompt. When the task is completed, the final results are forwarded as input to the tasks dependent on them.
Dynamic Replanning
The execution graph may need to adapt based on intermediate results that are previously unknown. A similar analogy in programming is branching, where the path of execution is determined only during runtime, depending on which branch conditions are satisfied. Such dynamic execution patterns can also appear with LLM function calling. In replanning, the Executor sends the intermediate results back to our LLM Planner. Based on that, the Planner produces a new set of tasks with their associated dependencies and dispatches them to the Task Fetching Unit and then the Executor. This process is repeated until the final result is achieved.
How to use it?
In LLMCompiler, users only need to supply:
Tool Definitions: Names of the tools, their types, and arguments, similar to other frameworks (ReAct, OpenAI function calling, etc.).
In-context examples for the Planner: Examples of how the Planner should behave, which help the LLM Planner generate the appropriate dependency graph in the correct format for incoming inputs.
Performance Results
LLMCompiler Latency improvements
LLMCompiler reduces latency significantly by avoiding sequential reasoning and function calls, recording speedups of up to 1.8x and 3.7x on various datasets over ReAct.
Remarkably, LLMCompiler also outpaces OpenAI's parallel function calling by up to 35%. This improvement may stem from reduced overhead in validating function names and arguments behind the scenes.
LLMCompiler Accuracy improvements
LLM Compiler outperforms ReAct in accuracy. ReAct's drawbacks, such as redundant function calls and premature termination highlighted in its original paper, are mitigated in LLMCompiler through pre-planned execution and minimising the disruption from intermediate findings during each reasoning-action cycle.
LLMCompiler matches the accuracy of OpenAI's parallel function calling.
Conclusion
The LLMCompiler framework represents a transformative advancement in leveraging Large Language Models (LLMs) for more effective problem-solving, addressing inherent challenges like knowledge cutoffs, arithmetic limitations, and data access restrictions. By optimising LLM function calls for improved latency, cost, and accuracy, LLMCompiler introduces a novel approach that mirrors skilled programming practices—breaking complex tasks into manageable units and employing dynamic replanning. It not only significantly enhances LLM utility in practical applications, but also opens up new horizons for the future of AI and machine learning technologies, promising systems that are more powerful, adaptable, and capable of addressing complex challenges.