T2I OPD

Flow-OPD: On-Policy Distillation for Flow Matching Models

To our best knowledge, we present the first integration of On-Policy Distillation into Flow Matching models, achieving exceptional multi-task performance across diverse generative domains.

Zhen Fang^1* Wenxuan Huang^*†‡ Yu Zeng¹ Yiming Zhao¹ Shuang Chen² Kaituo Feng³ Yunlong Lin³ Jie Liu³ Lin Chen¹ Zehui Chen¹ Shaosheng Cao^4‡ Feng Zhao¹

¹University of Science and Technology of China (USTC) · ²UCLA · ³CUHK · ⁴Xiaohongshu

^*Equal Contribution · ^†Project Leader · ^‡Corresponding Author

Paper </> Code Model

+18pt

Avg. improvement vs. base

+8pt

vs. GRPO-Mix (best baseline)

0.92

GenEval (base: 0.63)

0.94

OCR accuracy (base: 0.59)

Abstract

We identify two critical bottlenecks in multi-task Flow Matching training: reward sparsity and gradient interference. Standard GRPO works in single-task settings but catastrophically degrades in multi-task settings due to divergent gradients compressing high-dimensional image space into a single scalar reward. Flow-OPD integrates On-Policy Distillation into the Flow Matching pipeline, replacing sparse scalar rewards with dense, trajectory-level, multi-teacher vector field supervision. Evaluated on SD-3.5-Medium, Flow-OPD achieves +18pt average improvement over vanilla GRPO and surpasses individual teacher models on OCR and DeQA.

Method

Flow-OPD decouples expertise acquisition from model unification through a two-stage process: Cold Start initialization followed by Multi-Teacher On-Policy Distillation.

Base Model

SD-3.5-M

→

Stage 1

Cold Start

SFT or Merge

→

Stage 2

Student ●●●

Teachers

GenEval

OCR

DeQA

PickScore

→

Output

Unified Student

Cold Start Initialization

SFT-based: fine-tune student on teacher trajectories for stable starting point. Model Merging: superpose anisotropic teacher priors without training cost, consistently outperforming SFT across all OOD benchmarks.

On-Policy Sampling (SDE)

Convert the deterministic ODE into a Stochastic Differential Equation (SDE) for stochastic exploration. Sample G trajectories per prompt to generate on-policy marginal distributions.

Multi-Teacher Dense Labeling

Each teacher acts as a Generative Reward Model (GRM), returning a full vector field v_ϕk. Dynamic routing α_k selects domain experts from GenEval, OCR, DeQA and PickScore.

MAR: Manifold Anchor Regularization

A task-agnostic aesthetic teacher prevents background mode collapse and semantic redundancy. KL regularization from the aesthetic teacher acts as a continuous elastic anchor, decoupling functional alignment from stylistic preservation.

Core Formulations

On-Policy Sampling (SDE)

dx_t = [ v_θ(x_t, t) + (σ_t²/2t) ⋅ (x_t + (1−t)v_θ(x_t, t)) ] dt + σ_t dw

Converting the deterministic ODE into an SDE for stochastic on-policy exploration. Sampling G trajectories per prompt generates marginal distribution x_t ∼ ρ_t^θ(⊵ | c).

Task-Specific Teacher Routing

v_target(x_t, t, c) = v_{ϕ_k}(x_t, t, c) ∃ c ∈ ᴴ_k

Each task ᴴ_k has a dedicated domain expert teacher ϕ_k. Data belonging to task k is routed exclusively to its corresponding teacher, providing dense supervision without gradient interference across domains.

Dense KL Reward

D_KL(π_θ ‖ π_target) = Δt/2 ⋅ (σ_t(1−t)/2t + 1/σ_t)² ⋅ ‖v_θ − v_target‖²

Reverse KL divergence in the SDE framework reduces to a time-weighted L₂ distance between student and teacher vector fields.

PPO-Clipped Policy Update

J(θ) ≈ 1/(B⋅G) ∑ min(ρ_t,i,j(θ)⋅r_t,i,j^OPD, clip(ρ, 1−ε, 1+ε)⋅r_t,i,j^OPD)

Clipped surrogate objective bounds the policy trust region. Gradients flow exclusively through the policy ratio ρ, preserving fine-grained credit assignment.

Total Training Loss (with MAR)

L_Total(θ) = L_Policy(θ) + λ ∎_{c,t,x_t∼ρ_t^θ} [ w(t) ⋅ ‖v_θ − v_aesthetic‖² ]

MAR introduces KL regularization from a frozen aesthetic teacher as a continuous elastic anchor, preventing aesthetic degradation while absorbing functional intelligence.

Main Results

Evaluated on SD-3.5-Medium across GenEval, OCR, DeQA, and PickScore. Training on 4×8 H800 GPUs (~50 hours).

Overall Performance — Multi-Task Alignment on SD-3.5-M

Model	GenEval ↑	OCR Acc. ↑	DeQA ↑	PickScore ↑	Average ↑
SD-3.5-M (base)	0.63	0.59	4.07	21.64	0.72
+ GRPO-GenEval	0.94	0.65	4.01	21.53	0.81
+ GRPO-OCR	0.64	0.92	4.06	21.69	0.80
+ GRPO-DeQA	0.64	0.66	4.23	23.02	0.76
+ GRPO-PickScore	0.51	0.69	4.22	23.19	0.73
GRPO-Mix (3:1:1)	0.73	0.83	4.33	21.84	0.82
Ours — SFT Init	0.91	0.92	4.29	21.83	0.88
Ours — Merge Init	0.92	0.94	4.35	23.08	0.90

Cold Start Ablation

T2I-CompBench (OOD Generalization). Model Merging optimally leverages homogeneous teacher priors without additional training cost.

Model	Color	Shape	3D-Spatial	Numeracy	Overall
GRPO-Mix	0.692	0.611	0.422	0.641	0.587
Cold Start (w/o OPD)	0.710	0.613	0.425	0.646	0.595
Flow-OPD (Merge Init)	0.719	0.629	0.457	0.684	0.618

MAR Ablation

MAR prevents reward hacking by anchoring optimization to a task-agnostic aesthetic manifold.

Model	ImageReward ↑	Aesthetic ↑	HPS-v2.1 ↑	QwenVL ↑
SD-3.5-M (base)	1.02	5.87	0.298	3.45
GRPO-Mix	1.23	5.93	0.310	3.88
Flow-OPD w/o MAR	1.26	5.89	0.300	3.82
Flow-OPD (w/ MAR)	1.36	6.23	0.330	4.05

BibTeX

@article{flowopd2026, title = {Flow-OPD: On-Policy Distillation for Flow Matching Models}, author = {Zhen Fang and Wenxuan Huang and Yu Zeng and Yiming Zhao and Shuang Chen and Kaituo Feng and Yunlong Lin and Jie Liu and Lin Chen and Zehui Chen and Shaosheng Cao and Feng Zhao}, year = {2026} }