简介： 朱松纯教授出生于湖北省鄂州，全球著名计算机视觉专家，统计与应用数学家、人工智能专家。 1991年毕业于中国科技大学，1992年赴美留学，1996年获得美国哈佛大学计算机博士学位。2002年至2020年，在美国加州大学洛杉矶分校（UCLA）担任统计系与计算机系教授，UCLA视觉、认知、学习与自主机器人中心主任。在国际顶级期刊和会议上发表论文300余篇，获得计算机视觉、模式识别、认知科学领域多个国际奖项，包括3次问鼎计算机视觉领域国际最高奖项--马尔奖，赫尔姆霍茨奖等，2次担任国际计算机视觉与模式识别大会主席（CVPR2012、CVPR2019），2010-2020年 2次担任美国视觉、认知科学、人工智能等领域多大学、跨学科合作项目MURI负责人。朱松纯教授长期致力于构建计算机视觉、认知科学、乃至人工智能科学的统一数理框架。在留美28年后， 朱教授于2020年9月回国，担任北京通用人工智能研究院院长，并任北京大学讲席教授、清华大学基础科学讲席教授。
报告题目：Computer Vision: A Task-oriented and Agent-based Perspective
摘要： In the past 40+ years, computer vision has been studied from two popular perspectives: i) Geometry-based and object centered representations in the 1970s-90s, and ii) Appearance-based and view-centered representations in the 2000-2020s. In this talk, I will argue for a third perspective: iii) agent-based and task-centered representations, which will lead to general purpose vision systems as integrated parts of AI agents. From this perspective, vision is driven by large number of daily tasks that AI agents need to perform, including searching, reconstruction, recognition, grasping, social communications, tool using, etc. Thus vision is viewed as a continuous computational process to serve these tasks. The key concepts in this perspective are physical and social representations for: functionality, physics, intentionality, causality and utility.
简介： Michael Black received his B.Sc. from the University of British Columbia (1985), his M.S. from Stanford (1989), and his Ph.D. from Yale University (1992). He has held positions at the University of Toronto, Xerox PARC, and Brown University. He is one of the founding directors at the Max Planck Institute for Intelligent Systems in Tübingen, Germany, where he leads the Perceiving Systems department. He is a Distinguished Amazon Scholar and an Honorarprofessor at the University of Tuebingen. His work has won several awards including the IEEE Computer Society Outstanding Paper Award (1991), Honorable Mention for the Marr Prize (1999 and 2005), the 2010 Koenderink Prize, the 2013 Helmholtz Prize, and the 2020 Longuet-Higgins Prize. He is a member of the German National Academy of Sciences Leopoldina and a foreign member of the Royal Swedish Academy of Sciences. In 2013 he co-founded Body Labs Inc., which was acquired by Amazon in 2017.
报告题目：Learning digital humans for the Metaverse
摘要： The Metaverse will require artificial humans that interact with real humans as well as with real and virtual 3D worlds. This requires a real-time understanding of humans and scenes as well as the generation of natural and appropriate behavior. We approach the problem of modeling such embodied human behavior through capture, modeling, and synthesis. First we learn realistic and expressive 3D human avatars from 3D scans. We then train neural networks to estimate human pose and shape from images and video. Specifically, we focus on humans interacting with each other and the 3D world. By capturing people in action, we are able to train neural networks to model and generate human movement and human-scene interaction. To validate our models, we synthesize virtual humans in novel 3D scenes. The goal is to produce realistic human avatars that interact with virtual worlds in ways that are indistinguishable from real humans.
摘要： 随着文本模型GPT3/BERT等提出，预训练模型呈现高速发展的趋势，图像-文本联合学习的双模态模型也不断涌现，显示出在无监督情况下自动学习不同任务和快速迁移到不同领域数据的强大能力。然而，当前的预训练模型忽略了声音信息。在我们周边还包含大量的声音，其中语音不仅是人类之间交流的手段，还蕴藏着情绪和感情。本报告将介绍引入语音以后的首个图-文-音三模态大模型 “紫东太初”。模型将视觉、文本、语音不同模态通过各自编码器映射到统一语义空间，然后通过多头自注意力机制（Multi-head Self-attention）学习模态之间的语义关联以及特征对齐，形成多模态统一知识表示；既可以实现跨模态理解，还能实现跨模态生成，同时做到理解和生成认知能力的平衡；我们提出了一个基于词条级别(Token-level)、模态级别(Modality-level)以及样本级别(Sample-level)的多层次、多任务自监督学习统一框架，对更广泛、更多样的下游任务提供模型基础支撑，并特别地实现了通过语义网络以图生音、以音生图的功能。三模态大模型是迈向具有艺术创作能力、强大交互能力和任务泛化能力的通用型人工智能的一次重要的尝试。
简介： Larry S. Davis is a Distinguished University Professor in the Department of Computer Science and director of the Center for Automation Research (CfAR). His research focuses on object/action recognition/scene analysis, event and modeling recognition, image and video databases, tracking, human movement modeling, 3-D human motion capture, and camera networks. Davis is also affiliated with the Computer Vision Laboratory in CfAR. He served as chair of the Department of Computer Science from 1999 to 2012. He received his doctorate from the University of Maryland in 1976. He was named an IAPR Fellow, an IEEE Fellow, and ACM Fellow.
报告题目：Compression Compatible Deep Learning
摘要： The deep learning revolution incited by the 2012 Alexnet paper has been transformative for the field of computer vision. Many problems which were severely limited using classical solutions are now seeing unprecedented success. The rapid proliferation of deep learning methods has led to a sharp increase in their use in consumer and embedded applications. One consequence of consumer and embedded applications is lossy multimedia compression which is required to engineer the efficient storage and transmission of data in these real-world scenarios. As such, there has been increased interest in a deep learning solution for multimedia compression which would allow for higher compression ratios and increased visual quality. The deep learning approach to multimedia compression, so called Learned Multimedia Compression, involves computing a compressed representation of an image or video using a deep network for the encoder and the decoder. While these techniques have enjoyed impressive academic success, their industry adoption has been essentially non-existent. Classical compression techniques like JPEG and MPEG are too entrenched in modern computing to be easily replaced. In this talk we take an orthogonal approach and leverage deep learning to improve the compression fidelity of these classical algorithms. This allows the advances in deep learning to be used for multimedia compression in the context of classical methods. We first describe the underlying theory which unifies the mathematical models of compression and deep learning allowing deep networks to operate on compressed data directly. We then discuss how deep learning can be used to correct information loss in JPEG compression in the most general setting. This allows images to be encoded at high compression ratios while still maintaining visual fidelity. Lastly, we describe how deep learning based inferencing tasks, like classification, detection, and segmentation, behave in the presence of classical compression and how to mitigate performance loss. This allows images to be compressed further but without accuracy loss on downstream learning tasks.
简介： Yoichi Sato is a professor at Institute of Industrial Science, the University of Tokyo. He received his B.S. degree from the University of Tokyo in 1990, and his MS and PhD degrees in robotics from School of Computer Science, Carnegie Mellon University in 1993 and 1997. His research interests include first-person vision, and gaze sensing and analysis, physics-based vision, and reflectance analysis. He served/is serving in several conference organization and journal editorial roles including IEEE Transactions on Pattern Analysis and Machine Intelligence, International Journal of Computer Vision, Computer Vision and Image Understanding, CVPR 2023 General Co-Chair, ICCV 2021 Program Co-Chair, ACCV 2018 General Co-Chair, ACCV 2016 Program Co-Chair and ECCV 2012 Program Co-Chair.
报告题目：Understanding Human Activities from First-Person Perspectives
摘要： Wearable cameras have become widely available as off-the-shelf products. First-person videos captured by wearable cameras provide close-up views of fine-grained human behavior, such as interaction with objects using hands, interaction with people, and interaction with the environment. First-person videos also provide an important clue to the intention of the person wearing the camera, such as what they are trying to do or what they are attended to. These advantages are unique to first-person videos, which are different from videos captured by fixed cameras like surveillance cameras. As a result, they attracted increasing interest to develop various computer vision methods using first-person videos as input. On the other hand, first-person videos pose a major challenge to computer vision due to multiple factors such as continuous and often violent camera movements, a limited field of view, and rapid illumination changes. In this talk, I will talk about our attempts to develop first-person vision methods for different tasks, including action recognition, future person localization, and gaze estimation.