Danau5tin 11 hours ago

My RL trained multi-agent-coding model Orca-Agent-v0.1-14B reached a 167% higher relative score than its base model on Stanford's TerminalBench. I've open sourced everything.

*What I did:*

- I trained a 14B orchestrator model to better coordinate explorer & coder subagents (subagents are tool calls for orchestrator) - Scaled to 32x H100s that were pushed to their limits across 4 bare-metal nodes - Scaled to 256 Docker environments rolling out simultaneously, automatically distributed across the cluster

*Key results:*

- Qwen3-14B jumped from *7% → 18.25%* on TerminalBench after training - Model now within striking distance of Qwen3-Coder-480B (19.7%) - Training was stable with smooth entropy decrease and healthy gradient norms

*Training approach:*

Reward design and biggest learning: Kept it simple - *just unit tests*. Every "smart" reward signal I tried to craft led to policy collapse

Curriculum learning: - Stage-1: Tasks where base model succeeded 1-2/3 times (41 tasks) - Stage-2: Tasks where Stage-1 model succeeded 1-4/5 times

Dataset: Used synthetically generated RL environments and unit tests

*More details:*

I have added lots more details in the repo linked to this submission, including training code, model weights, datasets.

Huge thanks to: - Tara for providing the compute - Prime Intellect team for building prime-rl and dealing with my endless questions - Alex Dimakis for the conversation that sparked training the orchestrator model

Thanks for reading!

Dan

(Evaluated on the excellent TerminalBench benchmark by Stanford & Laude Institute)