SCORP project page

SCORP: Scene-Consistent Multi-Agent Diffusion Planning With Stable Online Reinforcement Post-Training for Cooperative Driving

Scene condition-enhanced diffusion pre-training with stable online RL post-training for safer, more efficient cooperative driving.

PDF Paper
Code Coming Soon

Abstract

Cooperative driving is inherently a safety- and efficiency-critical task that involves coordinating diverse, interaction-realistic multi-agent trajectories. Existing diffusion-based methods can capture multimodal behaviors from demonstrations, but they often suffer from weak scene consistency and poor alignment with closed-loop cooperative objectives. As a result, post-training becomes necessary, yet stable online post-training in reactive multi-agent environments remains challenging. In this paper, we present SCORP, a scene-consistent multi-agent diffusion planner with stable online reinforcement learning (RL) post-training for cooperative driving. For pre-training, we develop a scene-conditioned multi-agent denoising architecture that couples inter-agent self-attention with a dual-path conditioning mechanism: cross-attention provides direct scene-information injection, while AdaLN-Zero enables additional flexible and stable conditional modulation, thereby improving the scene consistency and road adherence of joint trajectories. For post-training, we formulate a two-layer Markov decision process (MDP) that explicitly couples the reverse denoising chain with policy--environment interaction, and we co-design dense, well-shaped planning rewards and variance-gated group-relative optimization (VG-GRPO) to mitigate advantage collapse and gradient instability during closed-loop training. Extensive experiments show that SCORP outperforms strong open-source baselines on WOMD, with 10.47\%--28.26\% and 1.70\%--7.22\% improvements in core safety and efficiency metrics, respectively. Moreover, relative to alternative post-training methods, SCORP delivers significant and consistent gains in driving safety and traffic efficiency, highlighting stable and sustained advances in closed-loop cooperative driving.

Method

Overall pipeline: condition-enhanced diffusion pre-training learns scene-consistent multimodal joint trajectories; stable online RL post-training aligns it with closed-loop safety and efficiency.

Pre-training: Multi-Agent Diffusion Model

Dual-path scene conditioning combines cross-attention and AdaLN-Zero for scene consistency in dense interactions.

Architecture of the multi-agent diffusion planner.

Architecture of the multi-agent diffusion planner. A scene encoder captures relational context in a symmetric manner, and a denoising decoder generates joint multi-agent plans via inter-agent attention and scene-conditioned generation, augmented by AdaLN-Zero for stable conditional modulation and constraint enforcement.

Post-training: Online RL with VG-GRPO

Dense rewards + VG-GRPO stabilize training and improve closed-loop safety-efficiency cooperation.

Online RL post-training framework.

The stable online RL post-training framework has three core components: closed-loop policy rollouts, dense reward evaluation, and our proposed variance-gated group relative policy optimization (VG-GRPO).

Video Results

Local videos cover demonstrations, pre-training vs post-training comparisons, and ablations.

Featured Reel

SCORP Showcase Reel

Comparison Group

Pre-training vs post-training. Pre-training can show undesirable behavior in some scenarios due to imitation-learning limits. Post-training substantially improves closed-loop planning, safety, and efficiency in challenging scenarios.

Pre-training Before online post-training. Failed interactions: Agents 27–28 and 12–14.
Post-training 1M steps Safety improved but still conservative.
Post-training 10M steps Safer and more efficient cooperative planning.
Case A
Base Model
Pre-training Paired case A.
Incremental Gain
Post-training Paired case A. Online RL adds the cooperative improvement.
Case B
Base Model
Pre-training Paired case B.
Incremental Gain
Post-training Paired case B. Online RL adds the cooperative improvement.
Case C
Base Model
Pre-training Paired case C.
Incremental Gain
Post-training Paired case C. Online RL adds the cooperative improvement.
Additional Case
Base Model
Pre-training Additional paired scenario.
Incremental Gain
Post-training Additional paired scenario. Online RL adds the cooperative improvement.

Ablation Group

AdaLN-Zero ablations with direct w/o vs with comparisons.

Case A
Without
w/o AdaLN-Zero Ablation case A.
With
with AdaLN-Zero Ablation case A.
Case B
Without
w/o AdaLN-Zero Ablation case B.
With
with AdaLN-Zero Ablation case B.

Visualization Group

Diverse cooperative behaviors under closed-loop execution.

Diverse Behaviors
Scenario 1
Interaction Pattern 1 Vehicle 0 and Vehicle 1, Vehicle 1 goes first.
Scenario 2
Interaction Pattern 2 Vehicle 0 and Vehicle 1, Vehicle 1 yields.

Closed-loop Results

Primary objectives: collision rate (CR), off-road rate (OR), average speed (AS). Secondary diagnostics: ADE, Kin.

WOMD Testing Interactive Split

Method CR ↓ OR ↓ AS ↑ ADE ↓ Kin ↓
TrafficBotsV1.5 2.74 ±0.21 1.79 ±0.14 8.03 ±0.48 1.68 ±0.09 0.26 ±0.02
SMART-large 2.22 ±0.09 1.58 ±0.10 8.34 ±0.30 1.30 ±0.01 0.21 ±0.01
VBD 2.46 ±0.14 1.92 ±0.18 8.08 ±0.52 1.41 ±0.02 0.24 ±0.01
SMART-tiny-CLSFT 2.10 ±0.10 1.53 ±0.12 8.47 ±0.44 1.23 ±0.03 0.25 ±0.02
SCORP 1.89 ±0.12 1.36 ±0.08 8.61 ±0.46 1.36 ±0.04 0.32 ±0.03

Values are mean ± std over repeated closed-loop evaluations in the same simulator configuration.

Effect of Post-training Strategy

Method CR ↓ OR ↓ AS ↑ ADE ↓ Kin ↓
Pre-trained only 2.04 ±0.11 1.68 ±0.10 8.36 ±0.42 1.28 ±0.02 0.25 ±0.02
SFT 2.01 ±0.07 1.64 ±0.06 8.37 ±0.36 1.15 ±0.015 0.25 ±0.01
DPO 1.97 ±0.13 1.58 ±0.09 8.15 ±0.39 1.33 ±0.04 0.27 ±0.01
Offline RL 2.18 ±0.09 1.82 ±0.14 8.98 ±0.68 1.37 ±0.05 0.26 ±0.02
SCORP (Online RL) 1.89 ±0.12 1.36 ±0.08 8.61 ±0.46 1.36 ±0.04 0.32 ±0.03

Qualitative: Pre-training vs. Post-training

Qualitative comparison between pre-training and post-training.

Example rollout: online post-training improves conflict resolution (safety) and overall traffic progress (efficiency) with longer training.

Stability: Road Consistency (AdaLN-Zero)

Without AdaLN-Zero: off-road drift in a sharp right turn.
Without AdaLN-Zero
With AdaLN-Zero: on-road stable turn.
With AdaLN-Zero

AdaLN-Zero scene-conditioned modulation improves boundary adherence in dense interactions, reducing off-road events.

Ablations

Compact summaries. Full details and protocols are in the paper PDF.

AdaLN-Zero

Setting CR ↓ OR ↓ AS ↑ ADE ↓ Kin ↓
w/o AdaLN-Zero 2.11 2.05 8.40 1.30 0.26
with AdaLN-Zero 2.04 1.68 8.36 1.28 0.25

VG-GRPO

Gating std1/std2 Collapse step CR ↓ OR ↓ AS ↑ ADE ↓ Kin ↓
w/o gating ≈ 0.5M 2.15 2.03 8.05 1.74 0.40
0.03/0.06 - 1.89 1.36 8.61 1.36 0.32
0.00/0.06 ≈ 2.0M 2.01 1.57 8.42 1.42 0.30
0.03/0.09 - 1.96 1.50 8.49 1.30 0.29

Ablation on Post-training Data Distribution

Dataset Type CR ↓ OR ↓ AS ↑ ADE ↓ Kin ↓
High-score 1.95 1.34 8.53 1.30 0.31
Low-score 2.18 1.99 8.49 1.40 0.37
Full 1.89 1.36 8.61 1.36 0.32

CR: collision rate (%); OR: off-road rate (%); AS: average speed (m/s); ADE: average displacement error (m); Kin: kinematic infeasibility rate (%).