Bio： Song-Chun Zhu was born in Ezhou, Hubei Province, China. He is a world-famous computer vision expert, statistics and applied mathematician, and expert in artificial intelligence. He graduated from the University of Science and Technology of China in 1991, studied in the United States from 1992, and received a PhD degree in computer science from Harvard University in 1996. From 2002 to 2020, he was a professor of the Department of Statistics and Computer Science in UCLA, and the director of UCLA Center for vision, cognition, learning and autonomous robotics. He has published more than 300 papers in international top journals and conferences, won many international awards in the fields of computer vision, pattern recognition and cognitive science, including three top international awards in the field of computer vision - Marr Prize, Helmholtz award, etc., twice served as the chairman of the International Conference on Computer Vision and Pattern Recognition (CVPR2012, CVPR2019), and from 2010 to 2020 twice served as Director of MURI, an US multi University and interdisciplinary cooperative project in the fields of vision, cognitive science and artificial intelligence. Professor Song-Chun Zhu has long been committed to building a unified mathematical framework of computer vision, cognitive science, and even artificial intelligence science. After 28 years in the United States, Professor Zhu returned to China in September 2020 to serve as dean of Beijing Institute for General Artificial Intelligence, Chair Professor of Peking University and Chair Professor of Basic Science of Tsinghua University.
Title：Computer Vision: A Task-oriented and Agent-based Perspective
Abstract： In the past 40+ years, computer vision has been studied from two popular perspectives: i) Geometry-based and object centered representations in the 1970s-90s, and ii) Appearance-based and view-centered representations in the 2000-2020s. In this talk, I will argue for a third perspective: iii) agent-based and task-centered representations, which will lead to general purpose vision systems as integrated parts of AI agents. From this perspective, vision is driven by large number of daily tasks that AI agents need to perform, including searching, reconstruction, recognition, grasping, social communications, tool using, etc. Thus vision is viewed as a continuous computational process to serve these tasks. The key concepts in this perspective are physical and social representations for: functionality, physics, intentionality, causality and utility.
Bio： Michael Black received his B.Sc. from the University of British Columbia (1985), his M.S. from Stanford (1989), and his Ph.D. from Yale University (1992). He has held positions at the University of Toronto, Xerox PARC, and Brown University. He is one of the founding directors at the Max Planck Institute for Intelligent Systems in Tübingen, Germany, where he leads the Perceiving Systems department. He is a Distinguished Amazon Scholar and an Honorarprofessor at the University of Tuebingen. His work has won several awards including the IEEE Computer Society Outstanding Paper Award (1991), Honorable Mention for the Marr Prize (1999 and 2005), the 2010 Koenderink Prize, the 2013 Helmholtz Prize, and the 2020 Longuet-Higgins Prize. He is a member of the German National Academy of Sciences Leopoldina and a foreign member of the Royal Swedish Academy of Sciences. In 2013 he co-founded Body Labs Inc., which was acquired by Amazon in 2017.
Title：Learning digital humans for the Metaverse
Abstract： The Metaverse will require artificial humans that interact with real humans as well as with real and virtual 3D worlds. This requires a real-time understanding of humans and scenes as well as the generation of natural and appropriate behavior. We approach the problem of modeling such embodied human behavior through capture, modeling, and synthesis. First we learn realistic and expressive 3D human avatars from 3D scans. We then train neural networks to estimate human pose and shape from images and video. Specifically, we focus on humans interacting with each other and the 3D world. By capturing people in action, we are able to train neural networks to model and generate human movement and human-scene interaction. To validate our models, we synthesize virtual humans in novel 3D scenes. The goal is to produce realistic human avatars that interact with virtual worlds in ways that are indistinguishable from real humans.
Bio： Bo Xu is currently the president of the Institute of Automation of Chinese Academy of Sciences, the dean of the School of Artificial Intelligence, University of Chinese Academy of Sciences, and a member of the National New Generation Artificial Intelligence Strategic Advisory Committee. He has long been engaged in the research and application of intelligent speech processing and artificial intelligence technology. He has won the "Outstanding Youth Award of Chinese Academy of Sciences", "First Prize of Wang Xuan News Science and Technology Progress" and other awards. He has presided over a number of key projects and foundations including national support, 863, 973 and natural science foundation. His research results have been applied on a large scale in the fields of education, broadcasting and television, and security. In recent years, he mainly focuses on the research and exploration on auditory models, brain-like intelligence, cognitive computing, and game intelligence.
Title：Three-modality Fundamental Big Model - Exploring the Path to More General Artificial Intelligence
Abstract： With the proposal of the text model GPT3/BERT and so on, the pre-trained models present a trend of rapid development, and the dual-modal models of image-text joint learning are also emerging, showing the powerful ability to automatically learn different tasks and quickly migrate to the data in different domains under unsupervised conditions. However, current pre-trained models ignore sound information. There are also a lot of sounds around us, where speech is not only a means of communication between humans, but also contains emotions and feelings. In this talk, I will introduce the first image-text-audio three-modal large model named "ZiDongTaiChu" which includes the modality of speech. The model maps different modalities of vision, text, and speech to a unified semantic space through their respective encoders, and then learns the semantic association and feature alignment among the modalities through the multi-head self-attention mechanism to form a multi-modal unified knowledge representation. It can realize not only cross-modal understanding, but also cross-modal generation, and at the same time achieve a balance between understanding and generating cognitive abilities. We propose a unified multi-level and multi-task self-supervised learning framework based on token-level, modality-level and sample-level, which provides model-based support for more diverse and wider downstream tasks, and especially we realize the function of generating audio from image and generating image from audio through the semantic network. The three-modal large model is an important attempt towards universal artificial intelligence with artistic creation capabilities, powerful interaction capabilities and task generalization capabilities.
Bio： Larry S. Davis is a Distinguished University Professor in the Department of Computer Science and director of the Center for Automation Research (CfAR). His research focuses on object/action recognition/scene analysis, event and modeling recognition, image and video databases, tracking, human movement modeling, 3-D human motion capture, and camera networks. Davis is also affiliated with the Computer Vision Laboratory in CfAR. He served as chair of the Department of Computer Science from 1999 to 2012. He received his doctorate from the University of Maryland in 1976. He was named an IAPR Fellow, an IEEE Fellow, and ACM Fellow.
Title：Compression Compatible Deep Learning
Abstract： The deep learning revolution incited by the 2012 Alexnet paper has been transformative for the field of computer vision. Many problems which were severely limited using classical solutions are now seeing unprecedented success. The rapid proliferation of deep learning methods has led to a sharp increase in their use in consumer and embedded applications. One consequence of consumer and embedded applications is lossy multimedia compression which is required to engineer the efficient storage and transmission of data in these real-world scenarios. As such, there has been increased interest in a deep learning solution for multimedia compression which would allow for higher compression ratios and increased visual quality. The deep learning approach to multimedia compression, so called Learned Multimedia Compression, involves computing a compressed representation of an image or video using a deep network for the encoder and the decoder. While these techniques have enjoyed impressive academic success, their industry adoption has been essentially non-existent. Classical compression techniques like JPEG and MPEG are too entrenched in modern computing to be easily replaced. In this talk we take an orthogonal approach and leverage deep learning to improve the compression fidelity of these classical algorithms. This allows the advances in deep learning to be used for multimedia compression in the context of classical methods. We first describe the underlying theory which unifies the mathematical models of compression and deep learning allowing deep networks to operate on compressed data directly. We then discuss how deep learning can be used to correct information loss in JPEG compression in the most general setting. This allows images to be encoded at high compression ratios while still maintaining visual fidelity. Lastly, we describe how deep learning based inferencing tasks, like classification, detection, and segmentation, behave in the presence of classical compression and how to mitigate performance loss. This allows images to be compressed further but without accuracy loss on downstream learning tasks.
Bio： Yoichi Sato is a professor at Institute of Industrial Science, the University of Tokyo. He received his B.S. degree from the University of Tokyo in 1990, and his MS and PhD degrees in robotics from School of Computer Science, Carnegie Mellon University in 1993 and 1997. His research interests include first-person vision, and gaze sensing and analysis, physics-based vision, and reflectance analysis. He served/is serving in several conference organization and journal editorial roles including IEEE Transactions on Pattern Analysis and Machine Intelligence, International Journal of Computer Vision, Computer Vision and Image Understanding, CVPR 2023 General Co-Chair, ICCV 2021 Program Co-Chair, ACCV 2018 General Co-Chair, ACCV 2016 Program Co-Chair and ECCV 2012 Program Co-Chair.
Title：Understanding Human Activities from First-Person Perspectives
Abstract： Wearable cameras have become widely available as off-the-shelf products. First-person videos captured by wearable cameras provide close-up views of fine-grained human behavior, such as interaction with objects using hands, interaction with people, and interaction with the environment. First-person videos also provide an important clue to the intention of the person wearing the camera, such as what they are trying to do or what they are attended to. These advantages are unique to first-person videos, which are different from videos captured by fixed cameras like surveillance cameras. As a result, they attracted increasing interest to develop various computer vision methods using first-person videos as input. On the other hand, first-person videos pose a major challenge to computer vision due to multiple factors such as continuous and often violent camera movements, a limited field of view, and rapid illumination changes. In this talk, I will talk about our attempts to develop first-person vision methods for different tasks, including action recognition, future person localization, and gaze estimation.