Modeling, Understanding, and Interacting with the 3D World

Abstract

The rapid rise of large language models has brought AI into people’s daily lives and is reshaping many aspects of society. It is increasingly recognized that AI’s success in the digital domain must be extended to the real 3D world, ultimately enabling robotic AI systems to live and work in physical environments. Achieving this goal requires models that can effectively model, understand, and interact with the 3D world. In this talk, I will present our recent research spanning 3D object generation, dynamic scene understanding, geometric and spatial reasoning, world models, and active vision systems. In particular, I will introduce Stream3D, a scalable framework for streaming and consistent 3D generation from sparse observations; PAGE-4D, a dynamic-aware 4D reconstruction model that jointly estimates geometry and camera motion in dynamic scenes; GeoWorld, a geometry-grounded world modeling framework that improves spatial reasoning and physical consistency in vision-language models; GEM, a geometry-enhanced world model that aligns generative dynamics with structured geometric representations for robotic manipulation; and an active vision system that enables robots to actively perceive the world, improve scene understanding, and increase manipulation success through closed-loop interaction. Together, these works highlight a pathway toward robotic AI systems that can robustly perceive, predict, and act in the real world.

About the speaker

Prof. Mengyu Wang is an Associate Professor with appointments at Harvard Medical School, Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University, Harvard Data Science Initiative, and Broad Institute of MIT and Harvard. Prof. Mengyu Wang has interests spanning generative AI for computer vision, multimodal large language model behaviors and agents, AI for robotics, AI for genomics, and various other AI applications in medicine.

Sign in to your account