Xiangyu Yue

Assistant Professor
Multimedia Laboratory (MMLab)
Department of Information Engineering
The Chinese University of Hong Kong

Email: xyyue [at] ie.cuhk.edu.hk

I am an Assistant Professor in the Department of Information Engineering at The Chinese University of Hong Kong, with the Multimedia Laboratory (MMLab). I received my Ph.D. in Electrical Engineering and Computer Science from the University of California, Berkeley, working with Prof. Alberto Sangiovanni Vincentelli and Prof. Kurt Keutzer at Berkeley AI Research.

Prior to Berkeley, I received my M.S. degree from Stanford University and B.S. degree from Nanjing University. I have spent time at Google Research, Google [x] Robotics, Baidu AI Research, and Tencent AI Lab. I received the Lotfi A. Zadeh Award for my research.

Openings

Prospective students and researchers: I have multiple fully funded Ph.D. positions for 2027, as well as Post-Doc, RA, visiting student, and intern opportunities. Please email me if you are interested and highlight any funding sources or support you may already have.

Selected Publications

Google Scholar and Full List

ECCV

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Peiwen Sun*, Xudong Lu*, Huadai Liu*, Yang Bo, Dongming Wu, Huankang Guan, Minghong Cai, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Rui Liu, Xiangyu Yue

European Conference on Computer Vision (ECCV), 2026

Paper Project

ECCV

GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing

Zifeng Zhu, Jiaming Han, Jiaxiang Zhao, Minnan Luo, Xiangyu Yue

European Conference on Computer Vision (ECCV), 2026

Xiangyu Yue

Openings

Selected Publications

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

Distill on a Diet: Efficient Knowledge Distillation via Learnable Data Pruning

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Twins: Learn to Predict Unified Representations with Focal Loss

MIND: Multi-rationale INtegrated Discriminative Reasoning Framework for Multi-modal Large Models

MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

Elastic Diffusion Transformer

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

RISE: Self-Improving Robot Policy with Compositional World Model

Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation

SpatialLogic-Bench: A Diagnostic Benchmark for Task-Oriented Spatiotemporal Reasoning

Probing Audio-Visual Reasoning in Multimodal Language Models through the Lens of Audio

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Exploring Reasoning Reward Model for Agents

Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

StyleDoctor: Towards Specialist Reward Model for Style-centric Generation Tasks

LATTICE: Democratize High-Fidelity 3D Generation at Scale

NaTex: Seamless Texture Generation as Latent Color Diffusion

3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding

OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

Transition Models: Rethinking the Generative Learning Objective

Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement

MMBench-GUI: A Unified Hierarchical Evaluation Framework for Multi-Platform GUI Agents

OneThinker: All-in-one Reasoning Model for Image and Video

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Evolve Vision-Language-Action Model into an Agent with On-the-fly Tool-use

PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Consistent Noisy Latent Rewards for Trajectory Preference Optimization in Diffusion Models

Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

Video-R1: Reinforcing Video Reasoning in MLLMs

Native-Resolution Image Synthesis

ReSim: Reliable World Simulation for Autonomous Driving

Learning to Integrate Diffusion ODEs by Averaging the Derivatives

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Fira: Can We Achieve Full-rank Training of LLMs under Low-rank Constraint?

From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision

Chimera: Improving Generalist Model with Domain-Specific Experts

CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation

FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions

Unleashing Vecset Diffusion Model for Fast Shape Generation

HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation

SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Scaling Omni-modal Pretraining with Multimodal Context: Advancing Universal Representation Learning Across Modalities

Breaking the Encoder Barrier for Seamless Video-Language Understanding

Learning Beyond Still Frames: Scaling Vision-Language Models with Video

RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance

UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

OneLLM: One Framework to Align All Modalities with Language

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

Meta-Transformer: A Unified Framework for Multimodal Learning

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model