What if LMs could collectively train, slashing RL post-training costs?
Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Training language models with reinforcement learning can enhance their reasoning abilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. But there's a problem: scaling RL training requires massive centralized infrastructure that creates bottlenecks and drives up costs. Current approaches demand synchronized GPU clusters and carefully engineered systems that are expensive and fragile.
The Gensyn AI Team tackles this challenge with a new approach called Swarm Sampling Policy Optimization (SAPO). Instead of requiring centralized coordination, SAPO enables decentralized networks of diverse compute nodes to share experiences and train collaboratively. Each node manages its own model while sharing training data with others, eliminating the need for synchronized weights or homogeneous hardware.
The Landscape of RL and Multi-Agent Training
Current RL fine-tuning methods like RLHF and RLVR rely on centralized policy updates using algorithms like PPO and Group Relative Policy Optimization (GRPO). These approaches work but create scaling bottlenecks as they require coordinated infrastructure.
Multi-agent methods have emerged as an alternative, organized around three concepts: debate (where multiple models refine responses through dialogue), specialization (assigning specific roles), and self-improvement (bootstrapped reasoning). Multi-agent approaches to RL training show promise but often still require orchestrated coordination.
SAPO bridges these approaches differently. It uses reward-driven trial-and-error like traditional RL but doesn't require synchronized policies or centralized rollout generation. By sharing experiences across a decentralized network, it captures benefits of multi-agent methods while avoiding their coordination overhead.
How Swarms Work: The SAPO Framework
SAPO operates on a decentralized network of nodes that generate and share rollouts across discrete time steps. Each node maintains its own dataset of verifiable tasks, policy model, and can generate responses independently. The key requirement is that tasks must be verifiable - their answers can be algorithmically checked for correctness.
Keep reading with a 7-day free trial
Subscribe to AIModels.fyi to keep reading this post and get 7 days of free access to the full post archives.