From Function Calling to Agentic Reasoning: Evaluating Tool Use in Modern LLMs
Introduction
Tool calling has emerged as a foundational capability for enabling agentic behavior in Large Language Models (LLMs). By allowing models to interface with external tools and APIs, this functionality transforms LLMs from static knowledge engines into dynamic agents capable of reasoning, acting, and interacting with their environments. Building on our prior work in model reasoning, we explored whether recent reinforcement learning techniques - specifically GRPO (Generalized Reinforcement Policy Optimization) - could further enhance a model’s ability to generate accurate and well-structured tool calls. To that end, we did a comparative study involving the base Qwen3 0.6B model and its GRPO trained counterpart.
Our results demonstrate a significant uplift in tool-calling performance, achieving a ~10% increase in accurate tool calling on our internal evaluation dataset. More critically, this places our Qwen3-0.6B-ToolGRPO model as the top performer within the 1B parameter space on the Berkeley Function Calling Leaderboard (BFCL)’s Live subset. This achievement stands out when contrasted with the Hammer2.1-0.5B (FC) model, the previous SOTA in the segment. Hammer is a native function-calling (FC) model trained on an extensive corpus of dedicated function-calling datasets, which generally affords such models an inherent advantage. Conversely, our model is prompt-based, trained with a modest 500 synthetic examples where the reward mechanism incentivised correct and structured selection of tools. Despite these differing training paradigms and data scales, our 0.6B model outperformed Hammer2.1-0.5B across nearly every comparable metric on public benchmarks. This outcome underscores the remarkable effectiveness of RL-based approaches in agentic frameworks.
Experimental Setup
We explored whether reinforcement learning (RL) can improve a model’s ability to select and correctly format tool calls in response to natural language prompts—a foundational skill for tool-augmented reasoning and agentic systems. Rather than aiming for full tool execution, we focused specifically on tool identification and call formatting. To isolate this capability, we designed a small, intuitive toolset of three domain-agnostic functions: extractNumbers, filterByKeyword, and sortByLength. By keeping the tool library simple and interpretable, we aimed to test whether a small model could confidently route inputs to the right tool—mirroring real-world agentic settings where the first step is often identifying the correct callable interface.
Dataset
We constructed a synthetic dataset tailored to this problem, available on Hugging Face. It contains 600 examples, each of which is tailored like so:
- A natural language prompt
- An answer column containing the correct tool call wrapped in <tool>[..]</tool> tags, following the reasoning generated by the model in <think>..</think> tags.
Out of the 600 examples, we reserved 120 prompts for evaluation, ensuring zero-leakage between training and testing.
Reward Functions
Since correctness here would be structural, and semantic alignment, we built reward functions around format, intent and tool-identity:
- formatReward: 1.0 if both <think> and <tool> tags are present.
- accuracyReward: 1.0 if the predicted tool name and arguments match the ground truth.
- partialReward: 0.5 if the function name is correct but the structure or arguments are off.
These reward signals were combined and used as the reward for each generated completion in the GRPO loop.
Training Details
We trained our base policy Qwen3-0.6B base model using Group Relative Policy Optimisation (GRPO), which compares sample completions within a prompt group and reinforces the relatively better ones.
Training was performed for a single epoch on a single T4 GPU, which lasted 5 hours, amounting to a little over $6 dollars, using the exceptional library provided by folks at unsloth.
Results And Observations
After training the Qwen3-0.6B model with GRPO for a single epoch on our synthetic tool-calling dataset, we evaluated both the base model and the GRPO-trained variant on the held-out 120 example test set.
Evaluation Criteria
The evaluation focused solely on the model’s ability to produce a correctly formatted function call, given a natural language instruction. In particular, the model had to:
- Identify the correct tool and pass the right arguments.
- Wrap the tool call within <tool>..</tool> tags.
In essence, the model was required to parse intent, map it to a semantic tool and correctly invoke it with the correct arguments.
Performance Summary

Despite the compact model size and a rather small dataset, GRPO yielded a ~10 point improvement - underscoring the potential of reinforcement learning in aligning model outputs with downstream protocols.
Common Failure Modes (Base Model)
While the base model often “understood” the prompt, it frequently failed to comply with structured tool-calling constraints, we observed two dominant failure models:
- Missing <tool> tags
Where the model correctly selected the tool arguments, but failed to wrap them in required tags:
- filterByKeyword(data=[...], keyword=’apple’) (case of missing tags)
- Truncated tool calls
A non trivial number of completions were cut off midway, for example:
- <tool>extractNumbers(text = “There are 15 cats, 8) (Incomplete completion)
These errors are especially interesting because they reflect execution fragility, not necessarily a misunderstanding of the task. And this is where GRPO stepped in: By rewarding structure-aware completions, it improved output robustness.
Reasoning Traces
All evaluations were run with thinking enabled, i.e., the models were expected to generate <think>...</think> section before they emitted a tool call.
- The base model’s thoughts were often logically sound, but inconsistent.
- The GRPO-trained model produced much more structured reasoning.
Validating On a Public Benchmark
While these results on our custom dataset are indeed promising, the real test of an agentic model’s robustness lies in its performance on diverse, unseen and real world challenges. To validate our findings, we benchmarked our GRPO trained model against the Live subset of the Berkeley-Function-Calling-Leaderboard (BFCL) - a benchmark comprising real-world crows-sourced user questions that often involve complex, multi-step reasoning.
Our model demonstrated superior performance across nearly every category of the BFCL Live test against the Hammer model. This confirms that improvements from GRPO generalises well beyond our synthetic data - which had no overlap - in terms of complexity, or criteria, with this evaluation set.

As you can see in the chart, above, GRPO enhanced model outperformed the prompt (Qwen3 Base model) by:
+0.84% Overall (on the live subset, an unweighted average of all the different categories)
+3.49% on Simple tasks (a single tool call)
-7.97% on Multiple tasks (selecting the right tool from several options)
+6.25% on Parallel tasks (calling the same tool multiple times)
+4.17% on Parallel multiple tasks (the most complex scenario, requiring a mix of tool selection and multiple calls)
As the data illustrates, our GRPO-enhanced model outperformed Hammer2.1-0.5b (FC) in overall accuracy, simple tasks, parallel tasks, and the highly complex parallel multiple tasks. While Hammer showed a stronger performance in the 'multiple' category, our model’s widespread superiority across the majority of critical benchmarks - especially given its distinct training methodology (prompt-based with limited synthetic data versus native FC with a massive corpus) - provides compelling evidence that the gains observed on our synthetic dataset are not isolated successes. This firmly supports our hypothesis that GRPO genuinely enhances a model’s core ability to reason about tool selection, a foundational skill for any advanced agentic system
Road Ahead
What’s especially exciting is that this experiment naturally connects to several active research directions across academia. For instance: Given a homogeneous toolset, can models learn to prioritize tools based on past success rates or task complexity? In a large and diverse tool zoo, can they intelligently prune their options rather than relying on excessive external calls – a behavior known as cognitive offloading? And as workflows grow more complex, how can future reinforcement learning algorithms optimise for multi-turn interactions where reasoning unfolds across a sequence of interleaved tool invocations?
These questions barely scratch the surface of the rapidly evolving landscape, and we’re genuinely excited to see where this momentum leads.
This model is available on Huggingface and will soon be available on our Developer Platform.
Follow PhroneticAI on LinkedIn for more such blogs.