Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback

1The University of Hong Kong, 2Xiaohongshu Inc. 3University of Electronic Science and Technology of China 4Harbin Institute of Technology

An overview of Agent2World as a three-stage pipeline:
Deep Researcher synthesizes external knowledge via web search to resolve underspecified specifications. Model Developer implements an executable symbolic world model (e.g., PDDL domains or runnable simulators). Testing Team performs adaptive unit testing and simulation-based validation, providing execution-grounded feedback for iterative repair.

Agent2World overall pipeline

Abstract

Symbolic world models (e.g., PDDL domains or executable simulators) are central to model-based planning, but training LLMs to generate such world models is limited by the lack of large-scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior-level errors arising from interactive execution.

In this paper, we propose Agent2World, a tool-augmented multi-agent framework that achieves strong inference-time world-model generation and also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback. Agent2World follows a three-stage pipeline: (i) a Deep Researcher agent performs knowledge synthesis via web search to address specification gaps; (ii) a Model Developer agent implements executable world models; and (iii) a specialized Testing Team conducts adaptive unit testing and simulation-based validation.

Agent2World demonstrates superior inference-time performance across three benchmarks spanning both PDDL and executable code representations, achieving consistent state-of-the-art results. Beyond inference, the Testing Team serves as an interactive environment for the Model Developer, providing behavior-aware adaptive feedback that yields multi-turn training trajectories. Fine-tuning on these trajectories substantially improves world-model generation, yielding an average relative gain of 30.95% over the same model before training.

BibTeX


wait for the paper