Zhen Fang (方镇)

About Me

I am a first-year master student at University of Science and Technology of China (USTC), advised by Prof. Feng Zhao. I got a B.E. degree at Communication University of China in 2025. My current research focus is on Vision Language Action models, and I also have a keen interest in music production. I welcome any opportunities for communication!

My research focuses on a tri-fold approach to human-like artificial intelligence: Precision Sensing (CV), Deep Reasoning (FC & LRM), and Decisive Action (VLA & UMM). My goal is to synthesize these components into a seamless cognitive loop, exploring the uncharted frontiers of machine intelligence. Beyond the technical challenges, I am profoundly curious about the true potential of this field and remain eager to discover where the ultimate limit of AI truly lies.

Research Interests

Unified Multimodal Model 🔥🔥: Unstanding vs. Generation
Agent🔥🔥: Function Calling & Deep Research
Embodied AI: Vision Language Action
Computer Vision: image generation, image edit

News

[Aug. 2025] One paper about function calling is accepted by EMNLP 2025.
[Jul. 2024] One paper about image editing is accepted by ACMMM 2024.

Awards

[Jul. 2025]We get the THIRD prize at Robotwin Dual-Arm Collaboration Challenge Within 2nd Meis Workshop! (CVPR 2025 Workshop)

Publications

Arxiv

Vision-DeepResearch:Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Yao Hu, Philip Torr, Feng Zhao, Wanli Ouyang

PDF Code Project Page Preprint

Arxiv

Vision-DeepResearch Benchmark:Rethinking Visual and Textual Search for Multimodal Large Language Models

Yu Zeng*, Wenxuan Huang*, Zhen Fang*, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao

PDF Code Project Page Preprint

Arxiv

UniCorn:Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Zhen Fang*(Project Leader), Ruiyan Han*,XinYu Sun*,Yuchen Ma,Ziheng Wang,Yu Zeng,Zehui Chen,Lin Chen,Wenxuan Huang,Wei-Jie Xu,Yi Cao,Feng Zhao

PDF Code Project Page Preprint

Arxiv

DualVLA:Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

Zhen Fang*, Zhuoyang Liu*, Jiaming Liu, Hao Chen, Yu Zeng,Shiting Huang, Zehui Chen1, Lin Chen, Shanghang Zhang, Feng Zhao

PDF Project Page Preprint

EMNLP

CRITICTOOL:Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

Shiting Huang*, Zhen Fang*, Zehui Chen, Siyu Yuan, Junjie Ye, Yu Zeng, Lin Chen, Qi Mao, Feng Zhao

The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)

PDF Main

ACMMM

MAG-Edit:Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance

Qi Mao, Lan Chen, Yuchao Gu, Zhen Fang, and Mike Zheng Shou

ACM International Conference on Multimedia (ACMMM), 2024.

PDF Code Project Page Poster

TMM

StarVid:Enhancing Semantic Alignment in Video Diffusion Models via Spatial and SynTactic Guided Attention Refocusing

Yuanhang Li, Qi Mao, Lan Chen, Zhen Fang, Lei Tian, Xinyan Xiao, Libiao Jin, Hua Wu

IEEE Transactions on Multimedia (TMM)

PDF

Miscellaneous

Outside of my research, I am a creator at heart. I immerse myself in 📖 reading (with Wang Xiaobo as my favorite author), 🎵 music production (Lo-fi Hip-hop & EDM, Click to listen to my portfolio🎧), ✍️ creative writing, and ⚽ table football. I thrive on the sensation of bringing something new into existence. My journey also involves exploring 🎨 visual design, 💫 anime, 📷 photography, and 🎬 movie production. For me, life is all about perception and experience!