Bio： Song-Chun Zhu was born in Ezhou, Hubei Province, China. He is a world-famous computer vision expert, statistics and applied mathematician, and expert in artificial intelligence. He graduated from the University of Science and Technology of China in 1991, studied in the United States from 1992, and received a PhD degree in computer science from Harvard University in 1996. From 2002 to 2020, he was a professor of the Department of Statistics and Computer Science in UCLA, and the director of UCLA Center for vision, cognition, learning and autonomous robotics. He has published more than 300 papers in international top journals and conferences, won many international awards in the fields of computer vision, pattern recognition and cognitive science, including three top international awards in the field of computer vision - Marr Prize, Helmholtz award, etc., twice served as the chairman of the International Conference on Computer Vision and Pattern Recognition (CVPR2012, CVPR2019), and from 2010 to 2020 twice served as Director of MURI, an US multi University and interdisciplinary cooperative project in the fields of vision, cognitive science and artificial intelligence. Professor Song-Chun Zhu has long been committed to building a unified mathematical framework of computer vision, cognitive science, and even artificial intelligence science. After 28 years in the United States, Professor Zhu returned to China in September 2020 to serve as dean of Beijing Institute for General Artificial Intelligence, Chair Professor of Peking University and Chair Professor of Basic Science of Tsinghua University.
Title：Computer Vision: A Task-oriented and Agent-based Perspective
Bio： Bo Xu is currently the president of the Institute of Automation of Chinese Academy of Sciences, the dean of the School of Artificial Intelligence, University of Chinese Academy of Sciences, and a member of the National New Generation Artificial Intelligence Strategic Advisory Committee. He has long been engaged in the research and application of intelligent speech processing and artificial intelligence technology. He has won the "Outstanding Youth Award of Chinese Academy of Sciences", "First Prize of Wang Xuan News Science and Technology Progress" and other awards. He has presided over a number of key projects and foundations including national support, 863, 973 and natural science foundation. His research results have been applied on a large scale in the fields of education, broadcasting and television, and security. In recent years, he mainly focuses on the research and exploration on auditory models, brain-like intelligence, cognitive computing, and game intelligence.
Title：Three-modal Large Model--Exploring the Path of Universal Artificial Intelligence
Abstract： With the proposal of the text model GPT3/BERT and so on, the pre-trained models present a trend of rapid development, and the dual-modal models of image-text joint learning are also emerging, showing the powerful ability to automatically learn different tasks and quickly migrate to the data in different domains under unsupervised conditions. However, current pre-trained models ignore sound information. There are also a lot of sounds around us, where speech is not only a means of communication between humans, but also contains emotions and feelings. In this talk, I will introduce the first image-text-audio three-modal large model named "ZiDongTaiChu" which includes the modality of speech. The model maps different modalities of vision, text, and speech to a unified semantic space through their respective encoders, and then learns the semantic association and feature alignment among the modalities through the multi-head self-attention mechanism to form a multi-modal unified knowledge representation. It can realize not only cross-modal understanding, but also cross-modal generation, and at the same time achieve a balance between understanding and generating cognitive abilities. We propose a unified multi-level and multi-task self-supervised learning framework based on token-level, modality-level and sample-level, which provides model-based support for more diverse and wider downstream tasks, and especially we realize the function of generating audio from image and generating image from audio through the semantic network. The three-modal large model is an important attempt towards universal artificial intelligence with artistic creation capabilities, powerful interaction capabilities and task generalization capabilities.
Bio： Larry S. Davis is a Distinguished University Professor in the Department of Computer Science and director of the Center for Automation Research (CfAR). His research focuses on object/action recognition/scene analysis, event and modeling recognition, image and video databases, tracking, human movement modeling, 3-D human motion capture, and camera networks. Davis is also affiliated with the Computer Vision Laboratory in CfAR. He served as chair of the Department of Computer Science from 1999 to 2012. He received his doctorate from the University of Maryland in 1976. He was named an IAPR Fellow, an IEEE Fellow, and ACM Fellow.
Bio： Yoichi Sato is a professor at Institute of Industrial Science, the University of Tokyo. He received his B.S. degree from the University of Tokyo in 1990, and his MS and PhD degrees in robotics from School of Computer Science, Carnegie Mellon University in 1993 and 1997. His research interests include first-person vision, and gaze sensing and analysis, physics-based vision, and reflectance analysis. He served/is serving in several conference organization and journal editorial roles including IEEE Transactions on Pattern Analysis and Machine Intelligence, International Journal of Computer Vision, Computer Vision and Image Understanding, CVPR 2023 General Co-Chair, ICCV 2021 Program Co-Chair, ACCV 2018 General Co-Chair, ACCV 2016 Program Co-Chair and ECCV 2012 Program Co-Chair.
Title：Understanding Human Activities from First-Person Perspectives
Abstract： Wearable cameras have become widely available as off-the-shelf products. First-person videos captured by wearable cameras provide close-up views of fine-grained human behavior, such as interaction with objects using hands, interaction with people, and interaction with the environment. First-person videos also provide an important clue to the intention of the person wearing the camera, such as what they are trying to do or what they are attended to. These advantages are unique to first-person videos, which are different from videos captured by fixed cameras like surveillance cameras. As a result, they attracted increasing interest to develop various computer vision methods using first-person videos as input. On the other hand, first-person videos pose a major challenge to computer vision due to multiple factors such as continuous and often violent camera movements, a limited field of view, and rapid illumination changes. In this talk, I will talk about our attempts to develop first-person vision methods for different tasks, including action recognition, future person localization, and gaze estimation.
Bio： Michael Black received his B.Sc. from the University of British Columbia (1985), his M.S. from Stanford (1989), and his Ph.D. from Yale University (1992). After post-doctoral research at the University of Toronto, he worked at Xerox PARC as a member of research staff and area manager. From 2000 to 2010 he was on the faculty of Brown University in the Department of Computer Science (Assoc. Prof. 2000-2004, Prof. 2004-2010). He is one of the founding directors at the Max Planck Institute for Intelligent Systems in Tübingen, Germany, where he leads the Perceiving Systems department. He is also a Distinguished Amazon Scholar (VP), an Honorarprofessor at the University of Tuebingen, and Adjunct Professor at Brown University. His work has won several awards including the IEEE Computer Society Outstanding Paper Award (1991), Honorable Mention for the Marr Prize (1999 and 2005), and all three major test-of-time awards including the 2010 Koenderink Prize, the 2013 Helmholtz Prize, and the 2020 Longuet-Higgins Prize. He is a foreign member of the Royal Swedish Academy of Sciences. In 2013 he co-founded Body Labs Inc., which was acquired by Amazon in 2017.