SCORP Project Page

Abstract

Cooperative driving is inherently a safety- and efficiency-critical task that involves coordinating diverse, interaction-realistic multi-agent trajectories. Existing diffusion-based methods can capture multimodal behaviors from demonstrations, but they often suffer from weak scene consistency and poor alignment with closed-loop cooperative objectives. As a result, post-training becomes necessary, yet stable online post-training in reactive multi-agent environments remains challenging. In this paper, we present SCORP, a scene-consistent multi-agent diffusion planner with stable online reinforcement learning (RL) post-training for cooperative driving. For pre-training, we develop a scene-conditioned multi-agent denoising architecture that couples inter-agent self-attention with a dual-path conditioning mechanism: cross-attention provides direct scene-information injection, while AdaLN-Zero enables additional flexible and stable conditional modulation, thereby improving the scene consistency and road adherence of joint trajectories. For post-training, we formulate a two-layer Markov decision process (MDP) that explicitly couples the reverse denoising chain with policy--environment interaction, and we co-design dense, well-shaped planning rewards and variance-gated group-relative optimization (VG-GRPO) to mitigate advantage collapse and gradient instability during closed-loop training. Extensive experiments show that SCORP outperforms strong open-source baselines on WOMD, with 10.47\%--28.26\% and 1.70\%--7.22\% improvements in core safety and efficiency metrics, respectively. Moreover, relative to alternative post-training methods, SCORP delivers significant and consistent gains in driving safety and traffic efficiency, highlighting stable and sustained advances in closed-loop cooperative driving.

Method

Overall pipeline: condition-enhanced diffusion pre-training learns scene-consistent multimodal joint trajectories; stable online RL post-training aligns it with closed-loop safety and efficiency.

Pre-training: Multi-Agent Diffusion Model

Dual-path scene conditioning combines cross-attention and AdaLN-Zero for scene consistency in dense interactions.

Architecture of the multi-agent diffusion planner. A scene encoder captures relational context in a symmetric manner, and a denoising decoder generates joint multi-agent plans via inter-agent attention and scene-conditioned generation, augmented by AdaLN-Zero for stable conditional modulation and constraint enforcement.

Post-training: Online RL with VG-GRPO

Dense rewards + VG-GRPO stabilize training and improve closed-loop safety-efficiency cooperation.

The stable online RL post-training framework has three core components: closed-loop policy rollouts, dense reward evaluation, and our proposed variance-gated group relative policy optimization (VG-GRPO).

Video Results

Local videos cover demonstrations, pre-training vs post-training comparisons, and ablations.

Featured Reel

SCORP Showcase Reel

Missing file: assets/videos/multi-orft-demo-reel.mp4

Comparison Group

Pre-training vs post-training. Pre-training can show undesirable behavior in some scenarios due to imitation-learning limits. Post-training substantially improves closed-loop planning, safety, and efficiency in challenging scenarios.

Missing file: assets/videos/10700fd08e7fa7f9第一行-对比0不安全不效率（小的最左边）.mp4

Pre-training Before online post-training. Failed interactions: Agents 27–28 and 12–14.

Missing file: assets/videos/10700fd08e7fa7f9第一行-对比1安全不效率.mp4

Post-training 1M steps Safety improved but still conservative.

Missing file: assets/videos/10700fd08e7fa7f9第一行-对比2安全效率.mp4

Post-training 10M steps Safer and more efficient cooperative planning.

Case A

Base Model

Missing file: assets/videos/7da908b45ecf7f7第二行-对比0.mp4

Pre-training Paired case A.

Incremental Gain

Missing file: assets/videos/7da908b45ecf7f7第二行-对比1.mp4

Post-training Paired case A. Online RL adds the cooperative improvement.

Case B

Base Model

Missing file: assets/videos/dd96b90d51d22359-第三行对比0.mp4

Pre-training Paired case B.

Incremental Gain

Missing file: assets/videos/dd96b90d51d22359-第三行对比1.mp4

Post-training Paired case B. Online RL adds the cooperative improvement.

Case C

Base Model

Missing file: assets/videos/80afca2d422becb8-第四行对比0.mp4

Pre-training Paired case C.

Incremental Gain

Missing file: assets/videos/80afca2d422becb8-第四行对比1.mp4

Post-training Paired case C. Online RL adds the cooperative improvement.

Additional Case

Base Model

Missing file: assets/videos/9ce7fddf20fe8730-第五行对比0.mp4

Pre-training Additional paired scenario.

Incremental Gain

Missing file: assets/videos/9ce7fddf20fe8730-第五行对比1.mp4

Post-training Additional paired scenario. Online RL adds the cooperative improvement.

Ablation Group

AdaLN-Zero ablations with direct w/o vs with comparisons.

Case A

Without

Missing file: assets/videos/2b39b2dd85cb81a9第二行-AdaLN对比0.mp4

w/o AdaLN-Zero Ablation case A.

With

Missing file: assets/videos/2b39b2dd85cb81a9第一行-AdaLN对比1.mp4

with AdaLN-Zero Ablation case A.

Case B

Without

Missing file: assets/videos/2d2af4ce25eba9db-AdaLN对比0.mp4

w/o AdaLN-Zero Ablation case B.

With

Missing file: assets/videos/2d2af4ce25eba9db-AdaLN对比1.mp4

with AdaLN-Zero Ablation case B.

Visualization Group

Diverse cooperative behaviors under closed-loop execution.

Diverse Behaviors

Scenario 1

Missing file: assets/videos/diverse0.MP4

Interaction Pattern 1 Vehicle 0 and Vehicle 1, Vehicle 1 goes first.

Scenario 2

Missing file: assets/videos/diverse1.MP4

Interaction Pattern 2 Vehicle 0 and Vehicle 1, Vehicle 1 yields.

Closed-loop Results

Primary objectives: collision rate (CR), off-road rate (OR), average speed (AS). Secondary diagnostics: ADE, Kin.

WOMD Testing Interactive Split

Method	CR ↓	OR ↓	AS ↑	ADE ↓	Kin ↓
TrafficBotsV1.5	2.74 ±0.21	1.79 ±0.14	8.03 ±0.48	1.68 ±0.09	0.26 ±0.02
SMART-large	2.22 ±0.09	1.58 ±0.10	8.34 ±0.30	1.30 ±0.01	0.21 ±0.01
VBD	2.46 ±0.14	1.92 ±0.18	8.08 ±0.52	1.41 ±0.02	0.24 ±0.01
SMART-tiny-CLSFT	2.10 ±0.10	1.53 ±0.12	8.47 ±0.44	1.23 ±0.03	0.25 ±0.02
SCORP	1.89 ±0.12	1.36 ±0.08	8.61 ±0.46	1.36 ±0.04	0.32 ±0.03

Values are mean ± std over repeated closed-loop evaluations in the same simulator configuration.

Effect of Post-training Strategy

Method	CR ↓	OR ↓	AS ↑	ADE ↓	Kin ↓
Pre-trained only	2.04 ±0.11	1.68 ±0.10	8.36 ±0.42	1.28 ±0.02	0.25 ±0.02
SFT	2.01 ±0.07	1.64 ±0.06	8.37 ±0.36	1.15 ±0.015	0.25 ±0.01
DPO	1.97 ±0.13	1.58 ±0.09	8.15 ±0.39	1.33 ±0.04	0.27 ±0.01
Offline RL	2.18 ±0.09	1.82 ±0.14	8.98 ±0.68	1.37 ±0.05	0.26 ±0.02
SCORP (Online RL)	1.89 ±0.12	1.36 ±0.08	8.61 ±0.46	1.36 ±0.04	0.32 ±0.03

Qualitative: Pre-training vs. Post-training

Qualitative comparison between pre-training and post-training.

Example rollout: online post-training improves conflict resolution (safety) and overall traffic progress (efficiency) with longer training.

Stability: Road Consistency (AdaLN-Zero)

Without AdaLN-Zero: off-road drift in a sharp right turn. — Without AdaLN-Zero

With AdaLN-Zero: on-road stable turn. — With AdaLN-Zero

AdaLN-Zero scene-conditioned modulation improves boundary adherence in dense interactions, reducing off-road events.

Ablations

Compact summaries. Full details and protocols are in the paper PDF.

AdaLN-Zero

Setting	CR ↓	OR ↓	AS ↑	ADE ↓	Kin ↓
w/o AdaLN-Zero	2.11	2.05	8.40	1.30	0.26
with AdaLN-Zero	2.04	1.68	8.36	1.28	0.25

VG-GRPO

Gating std1/std2	Collapse step	CR ↓	OR ↓	AS ↑	ADE ↓	Kin ↓
w/o gating	≈ 0.5M	2.15	2.03	8.05	1.74	0.40
0.03/0.06	-	1.89	1.36	8.61	1.36	0.32
0.00/0.06	≈ 2.0M	2.01	1.57	8.42	1.42	0.30
0.03/0.09	-	1.96	1.50	8.49	1.30	0.29

Ablation on Post-training Data Distribution

Dataset Type	CR ↓	OR ↓	AS ↑	ADE ↓	Kin ↓
High-score	1.95	1.34	8.53	1.30	0.31
Low-score	2.18	1.99	8.49	1.40	0.37
Full	1.89	1.36	8.61	1.36	0.32

CR: collision rate (%); OR: off-road rate (%); AS: average speed (m/s); ADE: average displacement error (m); Kin: kinematic infeasibility rate (%).

SCORP: Scene-Consistent Multi-Agent Diffusion Planning With Stable Online Reinforcement Post-Training for Cooperative Driving

Pre-training: Multi-Agent Diffusion Model

Post-training: Online RL with VG-GRPO

SCORP Showcase Reel

Comparison Group

Ablation Group

Visualization Group

WOMD Testing Interactive Split

Effect of Post-training Strategy

Qualitative: Pre-training vs. Post-training

Stability: Road Consistency (AdaLN-Zero)

AdaLN-Zero

VG-GRPO

Ablation on Post-training Data Distribution