Flow-OPD
T2I OPD

Flow-OPD: On-Policy Distillation for Flow Matching Models

To our best knowledge, we present the first integration of On-Policy Distillation into Flow Matching models, achieving exceptional multi-task performance across diverse generative domains.

Zhen Fang1* Wenxuan Huang*†‡ Yu Zeng1 Yiming Zhao1 Shuang Chen2 Kaituo Feng3 Yunlong Lin3 Jie Liu3 Lin Chen1 Zehui Chen1 Shaosheng Cao4‡ Feng Zhao1

1University of Science and Technology of China (USTC)  ·  2UCLA  ·  3CUHK  ·  4Xiaohongshu

*Equal Contribution  ·  Project Leader  ·  Corresponding Author

Flow-OPD Teaser
+18pt
Avg. improvement vs. base
+8pt
vs. GRPO-Mix (best baseline)
0.92
GenEval (base: 0.63)
0.94
OCR accuracy (base: 0.59)

Abstract

We identify two critical bottlenecks in multi-task Flow Matching training: reward sparsity and gradient interference. Standard GRPO works in single-task settings but catastrophically degrades in multi-task settings due to divergent gradients compressing high-dimensional image space into a single scalar reward. Flow-OPD integrates On-Policy Distillation into the Flow Matching pipeline, replacing sparse scalar rewards with dense, trajectory-level, multi-teacher vector field supervision. Evaluated on SD-3.5-Medium, Flow-OPD achieves +18pt average improvement over vanilla GRPO and surpasses individual teacher models on OCR and DeQA.

Method

Flow-OPD decouples expertise acquisition from model unification through a two-stage process: Cold Start initialization followed by Multi-Teacher On-Policy Distillation.

Base Model
SD-3.5-M
Stage 1
Cold Start
SFT or Merge
Stage 2
Student ●●●
Teachers
GenEval
OCR
DeQA
PickScore
Output
Unified Student

Cold Start Initialization

SFT-based: fine-tune student on teacher trajectories for stable starting point. Model Merging: superpose anisotropic teacher priors without training cost, consistently outperforming SFT across all OOD benchmarks.

On-Policy Sampling (SDE)

Convert the deterministic ODE into a Stochastic Differential Equation (SDE) for stochastic exploration. Sample G trajectories per prompt to generate on-policy marginal distributions.

Multi-Teacher Dense Labeling

Each teacher acts as a Generative Reward Model (GRM), returning a full vector field vϕk. Dynamic routing αk selects domain experts from GenEval, OCR, DeQA and PickScore.

MAR: Manifold Anchor Regularization

A task-agnostic aesthetic teacher prevents background mode collapse and semantic redundancy. KL regularization from the aesthetic teacher acts as a continuous elastic anchor, decoupling functional alignment from stylistic preservation.

Core Formulations

On-Policy Sampling (SDE)
dxt = [ vθ(xt, t) + (σt2/2t) ⋅ (xt + (1−t)vθ(xt, t)) ] dt + σt dw

Converting the deterministic ODE into an SDE for stochastic on-policy exploration. Sampling G trajectories per prompt generates marginal distribution xt ∼ ρtθ(⊵ | c).

Task-Specific Teacher Routing
vtarget(xt, t, c) = vϕk(xt, t, c) ∃ c ∈ ᴴk

Each task ᴴk has a dedicated domain expert teacher ϕk. Data belonging to task k is routed exclusively to its corresponding teacher, providing dense supervision without gradient interference across domains.

Dense KL Reward
DKLθ ‖ πtarget) = Δt/2 ⋅ (σt(1−t)/2t + 1/σt)2 ⋅ ‖vθ − vtarget2

Reverse KL divergence in the SDE framework reduces to a time-weighted L2 distance between student and teacher vector fields.

PPO-Clipped Policy Update
J(θ) ≈ 1/(B⋅G) ∑ min(ρt,i,j(θ)⋅rt,i,jOPD, clip(ρ, 1−ε, 1+ε)⋅rt,i,jOPD)

Clipped surrogate objective bounds the policy trust region. Gradients flow exclusively through the policy ratio ρ, preserving fine-grained credit assignment.

Total Training Loss (with MAR)
LTotal(θ) = LPolicy(θ) + λ ∎c,t,xt∼ρtθ [ w(t) ⋅ ‖vθ − vaesthetic2 ]

MAR introduces KL regularization from a frozen aesthetic teacher as a continuous elastic anchor, preventing aesthetic degradation while absorbing functional intelligence.

Main Results

Evaluated on SD-3.5-Medium across GenEval, OCR, DeQA, and PickScore. Training on 4×8 H800 GPUs (~50 hours).

Overall Performance — Multi-Task Alignment on SD-3.5-M

ModelGenEval ↑OCR Acc. ↑DeQA ↑PickScore ↑Average ↑
SD-3.5-M (base)0.630.594.0721.640.72
+ GRPO-GenEval0.940.654.0121.530.81
+ GRPO-OCR0.640.924.0621.690.80
+ GRPO-DeQA0.640.664.2323.020.76
+ GRPO-PickScore0.510.694.2223.190.73
GRPO-Mix (3:1:1)0.730.834.3321.840.82
Ours — SFT Init0.910.924.2921.830.88
Ours — Merge Init0.920.944.3523.080.90

Cold Start Ablation

T2I-CompBench (OOD Generalization). Model Merging optimally leverages homogeneous teacher priors without additional training cost.

ModelColorShape3D-SpatialNumeracyOverall
GRPO-Mix0.6920.6110.4220.6410.587
Cold Start (w/o OPD)0.7100.6130.4250.6460.595
Flow-OPD (Merge Init)0.7190.6290.4570.6840.618

MAR Ablation

MAR prevents reward hacking by anchoring optimization to a task-agnostic aesthetic manifold.

ModelImageReward ↑Aesthetic ↑HPS-v2.1 ↑QwenVL ↑
SD-3.5-M (base)1.025.870.2983.45
GRPO-Mix1.235.930.3103.88
Flow-OPD w/o MAR1.265.890.3003.82
Flow-OPD (w/ MAR)1.366.230.3304.05
Qualitative Results Comparison
More Results 1 More Results 2 More Results 3

BibTeX

@article{flowopd2026, title = {Flow-OPD: On-Policy Distillation for Flow Matching Models}, author = {Zhen Fang and Wenxuan Huang and Yu Zeng and Yiming Zhao and Shuang Chen and Kaituo Feng and Yunlong Lin and Jie Liu and Lin Chen and Zehui Chen and Shaosheng Cao and Feng Zhao}, year = {2026} }