The School of Computing and Data Science (https://www.cds.hku.hk/) was established by the University of Hong Kong on 1 July 2024, comprising the Department of Computer Science and Department of Statistics and Actuarial Science and Department of AI and Data Science.

Abstract

Large Vision-Language Models (VLMs) have demonstrated impressive generalization in the digital realm, but translating this into reliable robot manipulation and navigation remains a fundamental challenge. This talk explores a hybrid path forward: augmenting generalist "brains" with specialist "nervous systems." I will first present two foundation model efforts: SeeDo, which leverages VLMs to interpret long-horizon human videos and generate executable task plans, and INT-ACT, an evaluation suite that diagnoses a critical intention-to-execution gap in current Vision-Language-Action (VLA) systems. This gap reveals a key generalization boundary: robust task understanding does not guarantee robust physical control. To bridge this divide, I will introduce specialist models that provide two missing ingredients: fine-grained physical understanding and acquiring data for learning at scale. EgoPAT3Dv2 grounds robot action by learning 3D human intention forecasting from real-world egocentric videos. To address the data-scaling challenge, RAP employs a real-to-sim-to-real paradigm, while CityWalker explores web-scale video to learn robust, specialized skills. I will conclude by drawing analogies from the only known generalist agents—ourselves—to offer my answer to the question posed in the title.

About the speaker

Chen Feng is an Institute Associate Professor at New York University, Director of the AI4CE Lab, and Founding Co-Director of the NYU Center for Robotics and Embodied Intelligence. His research focuses on active and collaborative robot perception and robot learning to address multidisciplinary, use-inspired challenges in construction, manufacturing, and transportation. He is dedicated to developing novel algorithms and systems that enable intelligent agents to understand and interact with dynamic, unstructured environments. Prior to NYU, he worked as a research scientist in the Computer Vision Group at Mitsubishi Electric Research Laboratories (MERL) in Cambridge, Massachusetts, where he developed patented algorithms for localization, mapping, and 3D deep learning in autonomous vehicles and robotics. Chen earned his doctoral and master's degrees from the University of Michigan between 2010 and 2015, and his bachelor's degree in 2010 from Wuhan University. As an active contributor to the AI and robotics communities, he has published over 90 papers in top conferences and journals such as CVPR, ICCV, RA-L, ICRA, and IROS, and has served as an area chair and associate editor. In 2023, he was awarded the NSF CAREER Award. More information about his research can be found at https://ai4ce.github.io.

 

Division of Computer Science,
School of Computing and Data Science

Rm 207 Chow Yei Ching Building
The University of Hong Kong
Pokfulam Road, Hong Kong
香港大學計算與數據科學院, 計算機科學系
香港薄扶林道香港大學周亦卿樓207室

Email: csenq@hku.hk
Telephone: 3917 3146

Copyright © School of Computing and Data Science, The University of Hong Kong. All rights reserved.
Don't have an account yet? Register Now!

Sign in to your account