APIGen-MT: Structured Multi-Turn Data via Simulation

Intro: Why this matters

Picture this: You’re building an AI agent to help users book travel, troubleshoot software, or manage finances. One task leads to another. The user changes their mind halfway. New context unfolds mid-conversation. How do you train your AI to navigate all of that without losing its mind—or the thread?

That’s the challenge APIGen-MT tackles: creating rich, realistic multi-turn training data that prepares AI agents for real-world, back-and-forth dialogue.

The problem with current approaches

Today’s AI agents are trained on data that’s too clean, too shallow, and too rigid. Most training sets rely on static, one-shot prompts with canned responses. They don’t reflect how humans actually interact—messy, changing, and full of context.

Even worse, many of these datasets aren’t verifiable. There’s no clear ground truth to judge whether an agent’s action was "correct." That’s a problem if you want your agent to do more than just guess and bluff.

What is APIGen-MT?

APIGen-MT is a two-phase framework for synthesizing realistic, multi-turn agent training data. Think of it as a production line where every task starts with a solid blueprint and then evolves into a full dialogue between a simulated human and AI agent.

But it’s not just about generating chat logs. APIGen-MT ensures:

Every task is grounded in executable APIs
Every agent action is verifiable
Every conversation aligns with a clear, testable user goal

The result? AI training data that’s as rich as real conversations, but structured enough to build better models.

How it works (breakdown + example)

Phase 1: Task blueprinting

It starts with setting the stage: gathering APIs, domain policies, and any context the agent might need. Then, an LLM proposes a user task—something like, "Find me a flight under $500 that lands before noon."

Behind the scenes, APIGen-MT builds a structured plan:

The user intent (q)
A sequence of ideal API calls to solve it (agt)
The final response the agent should produce (ogt)

Before this plan is approved, it’s stress-tested. Format checks verify API syntax. Execution checks make sure those APIs work in a real (simulated) environment. Policy tests, written as Python unit tests, enforce domain rules.

Then comes the review panel—a group of LLMs evaluating the blueprint’s coherence and completeness. If a task fails at any point, feedback is generated and the blueprint is refined in a new round.

And for more complex tasks? APIGen-MT supports "Reverse Task Recombination," stitching together multiple validated tasks into one seamless challenge.

Validated task blueprints guide simulated agent-human interactions to produce verifiable multi-turn trajectories.

Phase 2: Simulated dialogue

Once a blueprint is approved, it’s turned into a realistic multi-turn conversation.

An LLM plays the human. It gradually reveals subgoals and context. The AI agent interprets the intent and uses API calls to respond. Crucially, the simulated human doesn’t know what tools the agent has—mimicking real-world constraints.

Each interaction is validated: Did the final state match expectations? Did the conversation make sense? Only successful dialogues are added to the training dataset.

To stabilize things, APIGen-MT samples multiple human responses (Best-of-N) and uses self-critique to pick the most consistent ones.

A stepwise agent-user dialogue with interleaved API calls based on evolving user intent.

Why it stands out

What makes APIGen-MT such a strong foundation for agentic AI?

First, it separates planning from execution. By designing blueprints first, it avoids tangled logic later.

Second, it grounds everything in execution. API calls aren’t just text—they’re run in a simulated environment, making each task testable and reproducible.

Third, it treats validation seriously—with multi-stage checks, LLM committees, and feedback loops that refine tasks until they’re solid.

And finally, it models agentic interaction as a POMDP—a mathematical framework for decision-making under uncertainty. This helps frame agent behavior in terms of actions, observations, and latent states, just like real-life conversations.

Comparison with similar tools

Feature	APIGen-MT	MultiChallenge	ToolSandbox	InterCode	AgentInstruct
Multi-turn generation	✅	✅	✅	✅	✅
Groundtruth verification	✅	❌	✅	✅	❌
Executable environments	✅	❌	✅	✅	❌
Open-sourced pipeline/data	✅	✅	✅	✅	✅

Real-world use cases

Whether you’re building a travel assistant or a compliance bot, APIGen-MT makes it easier to:

Train agents that adapt to evolving user goals
Simulate complex workflows with many moving parts
Test agents in controlled, repeatable environments
Benchmark long-horizon reasoning with confidence

Benchmarks or results

The BFCL v3 benchmark is a leading evaluator of function-calling performance in LLMs, spanning both single-turn and multi-turn scenarios. APIGen-MT-trained models dominate here:

The xLAM-2-70b-fc-r model achieves a staggering 78.19% accuracy, the highest on the leaderboard—outpacing frontier models like GPT-4o (72.08%) and Claude 3.5.
Even smaller variants like xLAM-2-32b-fc-r (75.83%) and xLAM-2-8b-fc-r (72.83%) stand tall among giants.
The smallest xLAM-2-1b-fc-r achieves 43.12%—surpassing GPT-4o’s 41% in function-calling mode.

APIGen-MT-trained xLAM-2 models outperform baselines across sizes, with even the 1B variant exceeding GPT-4o in function-calling accuracy.

τ-Bench pushes agents into real-world-like conversations with policy-constrained APIs in domains like retail and airlines.

xLAM-2-70b-fc-r scores 56.2% overall success—leaping over GPT-4o (52.9%) and approaching Claude 3.5 Sonnet (60.1%).
Smaller models still punch above their weight: xLAM-2-32b-fc-r: 54.6%, xLAM-2-8b-fc-r: 46.7%.
Even xLAM-2-3b-fc-r and 1b-fc-r hold competitive ground, proving that quality multi-turn training data can reduce the need for massive scale.

APIGen-MT-trained xLAM-2 models achieve strong multi-turn performance, with the 70B model outperforming GPT-4o and closely trailing Claude 3.5 Sonnet.

The consistent success of smaller xLAM-2 models trained with APIGen-MT data underlines a paradigm shift:

You don’t need a bigger model—you need better data.

By simulating rich agent-human interplay and embedding verifiable actions, APIGen-MT ensures that models aren't just talking—they’re reasoning, executing, and adapting.

The bigger picture

The future of AI isn’t prompt engineering—it’s agent design.

APIGen-MT shifts the paradigm from statically scripted interactions to dynamic, goal-driven simulations. It lets researchers build, validate, and share datasets that reflect how people actually talk and solve problems.

And it’s open-source. The data. The models. The pipeline. If you’re working on tool-using agents, this is the benchmark to beat—and the starting point to build from.

References

Wang, B., Zhang, K., Lin, Z., Zhou, H., Liu, P., & Ren, X. (2025). APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay. arXiv preprint arXiv:2504.03601. https://arxiv.org/abs/2504.03601
Berkeley AI Research. (n.d.). BFCL Leaderboard (Berkeley Function-Calling Leaderboard). Retrieved April 11, 2025, from https://gorilla.cs.berkeley.edu/leaderboard.html
Guo, D., Liu, P., Xie, M., Lin, Z., Yu, M., Zhang, K., & Ren, X. (2024). Tau-bench: Evaluating Agent Functionality and Alignment with Human Intent via Multi-turn Tool-augmented Dialogues. arXiv preprint arXiv:2406.12045. https://arxiv.org/pdf/2406.12045