Introduction

What is Multimodal?

Multimodal Communicative Behaviors

Modality

事物发生或经历的方式

Modality：表示某种类型的信息和/或存储信息的表示格式
Sensory modality：感觉的主要形式之一，如视觉或触觉；沟通渠道

Multiple Communities and Modalities

A Historical View

Prior Research on “Multimodal”

Core Technical Challenges

1705.09406.pdf (arxiv.org)

Core Challenge 1: Representation

Early Examples

Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014

Representation

Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy.（学习如何利用互补性和冗余性在数据库中表示和汇总多模态数据）

Joint representations
Coordinated representations

Core Challenge 2: Alignment

Definition: Identify the direct relations between (sub)elements from two or more different modalities.（确定两个或多个不同模式的（子）要素之间的直接关系。）

Explicit Alignment

Implicit Alignment

Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Core Challenge 3 – Translation

Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013

Definition: Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective.（将数据从一种模态转换到另一种模态的过程，其中的翻译关系通常是开放的或主观的）

Ahuja, C., & Morency, L. P. (2019). Language2Pose: Natural Language Grounded Pose Forecasting. Proceedings of 3DV Conference

Core Challenge 4: Fusion

Definition: To join information from two or more modalities to perform a prediction task.（将来自两个或多个模式的信息连接起来以执行预测任务）

Core Challenge 5: Co-Learning

Definition: Transfer knowledge between modalities, including their representations and predictive model。（在模式之间转移知识，包括其表示和预测模型。）

Pham et al., Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities

Taxonomy of Multimodal Research

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy

Real world tasks tackled by MMML

Multimodal Research Tasks

Affective Computing（情感计算）

Common Topics in Affective Computing

Affectivestates–emotions, moods, and feelings
Cognitivestates–thinking and information processing
Personality–patterns of acting, feeling, and thinking
Pathology–health, functioning, and disorders
Social processes –groups, cultures, and perception
情感状态–情绪、情绪和感觉
认知状态–思维和信息处理
个性–行为、感觉和思维模式
病理学–健康、功能和紊乱
社会过程–群体、文化和感知

AVEC 2011 – The First International Audio/Visual Emotion Challenge, B. Schuller et al., 2011

AVEC 2013 – The Continuous Audio/Visual Emotion and Depression Recognition Challenge, Valstar et al. 2013

Introducing the RECOLA Multimodal Corpus of Remote Collaborative and Affective Interactions, F. Ringeval et al., 2013

Multimodal Sentiment Analysis（多模态情绪分析）

Multi-Party Emotion Recognition（多方情绪识别）

What are the Core Challenges Most Involved in Affect Recognition?

Project Example: Select-Additive Learning

Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency and Eric P . Xing, Select-additive Learning: Improving Generalization In Multimodal Sentiment Analysis, ICME 2017

Project Example: Word-Level Gated Fusion

Minghai Chen, Sen Wang, Paul Pu Liang, Ta d a sBaltrušaitis, Amir Zadeh, Louis-Philippe Morency, Multimodal Sentiment
Analysis with Word-Level Fusion and Reinforcement Learning, ICMI 2017, https://arxiv.org/abs/1802.00924

Media Description（媒体描述）

给定媒体（图像、视频、视听剪辑）提供自由形式的文本描述

Large-Scale Image Captioning Dataset

Microsoft Common Objects in COntext(MS COCO)
120000 images
Each image is accompanied with five free form sentences describing it (at least 8 words)
Sentences collected using crowdsourcing (Mechanical Turk)
Also contains object detections, boundaries and keypoints

Evaluating Image Caption Generations（评估图像字幕生成）

Has an evaluation server
Training and validation -80K images (400K captions)
Testing –40K images (380K captions), a subset contains more captions for better evaluation, these are kept privately (to avoid over-fitting and cheating)
Evaluation is difficult as there is no one “correct” answer for describing an image in a sentence
Given a candidate sentence it is evaluated against a set of “ground truth” sentences

Video captioning（视频字幕）

Video Description and Alignment（视频描述和对齐）

Charade Dataset: http://allenai.org/plato/charades/

How to Address the Challenge of Evaluation？

Large-Scale Description and Grounding Dataset

VisualGenome

Multimodal QA（多模态问答）

Multimodal QA dataset 1 –VQA (C1)

VQA 2.0

Multimodal QA –other VQA datasets

Multimodal QA –other VQA datasets (C7)

TVQA

Multimodal QA –Visual Reasoning (C8)

VCR：Visual Commonsense Reasoning

Scocail-IQ

Project Example: Adversarial Attacks on VQA models

Vasu Sharma, Ankita Kalra, Vaibhav, SimralChaudhary, LabheshPatel, Louis-Philippe Morency, Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models. NeurIPS ViGILworkshop 2018.

Embedded Assistive Agents
Language, Vision and Actions
Many Technical Challenges

Navigating in a Virtual House（在虚拟房屋中导航）

Multiple Step Instructions

Language meets Games

Heinrich Kuttler and Nantas Nardelli and Alexander H. Miller and Roberta Raileanu and Marco Selvatici and Edward Grefenstette and Tim Rocktaschel, The Nethack Learning Environment.

ShrimaiPrabhumoye, Margaret Li, Jack Urbanek, Emily Dinan, DouweKiela, Jason Weston, Arthur Szlam. I love your chain mail! Making knights smile in a fantasy game world: Open-domain goal-oriented dialogue agents.

Project Example: Instruction Following

Devendra Singh Chaplot, KanthashreeMysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, Ruslan Salakhutdinov, Gated-Attention Architectures for T ask-Oriented Language Grounding. AAAI 2018

Project Example: Multiagent Trajectory Forecasting

Project Examples, Advice and Support

Latest List of Multimodal Datasets

Some Advice About Multimodal Research

Think more about the research problems, and less about the datasets themselves
- Aim for generalizable models across several datasets
- Aim formodelsinspiredbyexistingresearche.g.psychology
Some areas to consider beyond performance:
- Robustness tomissing/noisy modalities, adversarial attacks
- Studying social biases and creating fairer models
- Interpretable models
- Faster models for training/storage/inference
Theoretical projects are welcome too –make sure there are also experiments to validate theory

Some Advice About Multimodal Datasets

If you are used to deal with text or speech
- Space will become an issue working with image/video data
- Some datasets are in 100s of GB (compressed)
Memory for processingit will become an issue as well
- Won’t be able to store it all in memory
Time to extract features and train algorithms will also become an issue
Plan accordingly!
- Sometimes tricky to experiment on a laptop (might need to do it on a subset of data)

Available Tools

List of Multimodal datasets

Affective Computing（情感计算）

Acted Facial Expressions in the Wild (part of EmotiWChallenge))

AVEC challenge datasets

The Interactive Emotional Dyadic Motion Capture (IEMOCAP)

Persuasive Opinion Multimedia (POM)

Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos (MOSI)

CMU-MOSEI：Multimodal sentiment and emotion recognition

Tumblr Dataset: Sentiment and Emotion Analysis

AMHUSE Dataset: Multimodal Humor Sensing

Video Game Dataset: Multimodal Game Rating

DEAP

Continuous LIRIS-ACCEDE

Media description

MS COCO

MPII Movie Description dataset

Montréal Video Annotation dataset

LSMDC

Charades Dataset

Referring Expression datasets

Flickr30k Entities

CSI Corpus

MVSO

NeuralWalker

Visual Relationdataset

Multimodal QA

VQA v2.0

Textbook Question Answering

Multimodal Dialog

Visual Dialog

Talk the Walk

Cooperative Vision-and-Dialog Navigation

CLEVR-Dialog

Fashion Retrieval

Event detection

Multimedia Event Detection

Title-based Video Summarization dataset

MediaEval

CrisisMMD

Multimodal Retrieval

Interior Design Dataset

MIRFLICKR-1M

NUS-WIDE dataset

Yahoo Flickr Creative Commons 100M

Other Multimodal Datasets

YouTube 8M

YouTube Bounding Boxes

Basic Concepts – Neural Networks

Unimodal Basic Representations

Unimodal Representation – Visual Modality

Unimodal Representation – Language Modality

Unimodal Representation – Acoustic Modality

Other Unimodal Representations

Unimodal Representation – Sensors

Lee et al., Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks. ICRA 2019

Unimodal Representation – Tables

Bao et al., Table-to-Text: Describing Table Region with Natural Language. AAAI 2018

Unimodal Representation – Graphs

Hamilton and Tang, Tutorial on Graph Representation Learning. AAAI 2019

Unimodal Representation – Sets

*Zaheer et al., DeepSets. NeurIPS 2017, Li et al., Point Cloud GAN. arxiv 2018 *

Machine Learning – Basic Concepts

Training, Testing and Dataset

Nearest Neighbor Classifier

Simple Classifier: Nearest Neighbor

Definition of K-Nearest Neighbor

Data-Driven Approach

Evaluation methods (for validation and testing)

Linear Classification: Scores and Loss

Learning model parameters

Neural Networks gradient

Gradient descent

Optimization – Practical Guidelines

CNNs and Visual Representations

Image Representations

Object-Based Visual Representation

Object Descriptors

Convolution Kernels

Object Descriptors

Facial expression analysis

Articulated Body Tracking: OpenPose

CMU-Perceptual-Computing-Lab/openpose: OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation (github.com)

Convolutional Neural Networks

Translation Invariance

Learned vs Predefined Kernels

Convolution Math

Convolutional Neural Layer

Convolutional Neural Network

Example of CNN Architectures

Common architectures

VGGNet model

Other architectures

Residual Networks

Visualizing CNNs

Visualizing the Last CNN Layer: t-sne

Deconvolution

CAM: Class Activation Mapping

Grad-CAM

Region-based CNNs

Object Detection (and Segmentation)

Selective Search

R-CNN

Trade-off Between Speed and Accuracy

Sequential Modeling with Convolutional Networks

3D CNN

Temporal Convolution Network (TCN)

Appendix: Tools for Automatic visual behavior analysis

OpenFace: an open source facial behavior analysis toolkit, T. Baltrušaitis et al., 2016

Image from Hachisu et al (2018). FaceLooks: A Smart Headband for Signaling Face-to-Face Behavior. Sensors.

Language Representations and RNNs

Word Representations

How to learn (word) features/representations?

Distance and similarity

How to learn (word) features/representations?

How to use these word representations

Vector space models of words

Sentence Modeling

Sentence Modeling: Sequence Label Prediction

Sentence Modeling: Sequence Prediction

Sentence Modeling: Sequence Representation

Sentence Modeling: Language Model

Language Model Application: Language Generation

Language Model Application: Speech Recognition

Challenges in Sequence Modeling

Recurrent Neural Networks

Gated Recurrent Neural Networks

Syntax and Language Structure

Dependency Grammar

Language Ambiguity

Recursive Neural Network

Socher et al., Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, EMNLP 2013

Stack LSTM

Dyer et al., Transition-Based Dependency Parsing with Stack Long Short-Term Memory, 2015

Multimodal Representations

Graph Representations

RECAP: Tree-based RNNs (or Recursive Neural Network)

Graphs (aka “Networks”)

Graphs – Supervised Task

Graphs – Unsupervised Task

Graph Neural Nets

Graph Neural Nets – Supervised Training

Graph Neural Nets – Neighborhood Aggregation

Kipf et al., 2017. Semi-supervised Classification with Graph Convolutional Networks. ICLR.

Li et al., 2016. Gated Graph Sequence Neural Networks. ICLR.

*Duvenaud et al. 2016. Convolutional Networks on Graphs for Learning Molecular Fingerprints. ICML. Li et al. 2016. *

Gated Graph Sequence Neural Networks. ICLR.

Multimodal representations

Unsupervised Joint representations

Unsupervised representation learning

Shallow multimodal representations

Autoencoders

Deep Multimodal autoencoders

Ngiam et al., Multimodal Deep Learning, 2011

Deep Multimodal Boltzmann machines

Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014

Supervised Joint representations

Multimodal Joint Representation

Multimodal Sentiment Analysis

Unimodal, Bimodal and Trimodal Interactions

Bilinear Pooling

Tenenbaum and Freeman, 2000

Multimodal Tensor Fusion Network (TFN)

Zadeh, Jones and Morency, EMNLP 2017

From Tensor Representation to Low-rank Fusion

① Decomposition of weight tensor W

② Decomposition of Z

③ Rearranging computation

Multimodal LSTM

Multimodal Sequence Modeling – Early Fusion

Multi-View Long Short-Term Memory (MV-LSTM)

Multi-View Long Short-Term Memory

Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016

Topologies for Multi-View LSTM

Coordinated Multimodal Representations

Learn (unsupervised) two or more coordinated representations from multiple modalities. A loss function is defined to bring closer these multiple representations.

Coordinated Multimodal Embeddings

Frome et al., DeViSE: A Deep Visual-Semantic Embedding Model, NIPS 2013

Structure-preserving Loss – Multimodal Embeddings

Wang et al., Learning Deep Structure-Preserving Image-Text Embeddings, CVPR 2016

Coordinated Representations

Quick Recap

Coordinated Multimodal Representations

Learn (unsupervised) two or more coordinated representations from multiple modalities. A loss function is defined to bring closer these multiple representations.

Structured coordinated embeddings

Vendrov et al., Order-Embeddings of Images and Language, 2016

Multivariate Statistical Analysis

Random Variables

Definitions

Principal component analysis

Eigenvalues and Eigenvectors

Singular Value Decomposition (SVD)

Canonical Correlation Analysis

Correlated Projection

Exploring Deep Correlation Networks

Deep Canonical Correlation Analysis

Andrew et al., ICML 2013

Deep Canonically Correlated Autoencoders (DCCAE)

Wang et al., ICML 2015

Deep Correlational Neural Network

Chandar et al., Neural Computation, 2015

Multi-View Clustering

Data Clustering

“Soft” Clustering: Nonnegative Matrix Factorization

Semi-NMF and Other Extensions

Ding et al., TPAMI2015

Trigerous et al., TPAMI 2015

Principles of Multi-View Clustering

Yan Yang and Hao Wang, Multi-view Clustering: A Survey, Big data mining and analytics, Volume 1, Number 2, June 2018

Multi-view subspace clustering

Definition: learns a unified feature representation from all the view subspaces by assuming that all views share this representation

Deep Matrix Factorization

Li and Tang, MMML 2015

Other Multi-View Clustering Approaches

Yan Yang and Hao Wang, Multi-view Clustering: A Survey, Big data mining and analytics, Volume 1, Number 2, June 2018

Auto-Encoder in Auto-Encoder Network

Deep Canonically Correlated Autoencoders (DCCAE)

Multi-view Latent “Intact” Space

Multimodal alignment

Explicit multimodal-alignment

Explicit alignment - goal is to find correspondences between modalities

▪ Aligning speech signal to a transcript

▪ Aligning two out-of sync sequences

▪ Co-referring expressions

Implicit multimodal-alignment

Implicit alignment - uses internal latent alignment of modalities in order to better solve various problems

▪ Machine Translation

▪ Cross-modal retrieval

▪ Image & Video Captioning

▪ Visual Question Answering

Explicit alignment

Let’s start unimodal – Dynamic Time Warping

Dynamic Time Warping continued

DTW alternative formulation

Canonical Correlation Analysis reminder

Canonical Time Warping

Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009

Optimized by Coordinate-descent –fix one set of parameters, optimize another

Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009, NIPS

Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI

Deep Canonical Time Warping

Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR

Implicit alignment

Attention models

Recent attention models can be roughly split into three major categories

Soft attention

Acts like a gate function. Deterministic inference.
Transform network

Warp the input to better align with canonical view。
Hard attention

Includes stochastic processes. Related to reinforcement learning.

Soft attention

Machine Translation

Given a sentence in one language translate it to another

Not exactly multimodal task – but a good start! Each language can be seen almost as a modality.

Machine Translation with RNNs

Decoder – attention model

Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

How do we encode attention?

MT with attention

Visual captioning with soft attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Xu et al., 2015

Looking at more fine grained feature

允许潜在数据对齐
让我们了解网络“看到”的内容
可以使用反向传播进行优化

Spatial Transformer networks

Glimpse Network (Hard Attention)

Hard attention

Soft attention requires computing a representation for the whole image or sentence
Hard attention on the other hand forces looking only at one part
Main motivation was reduced computational cost rather than improved accuracy (although that happens a bit as well) ▪
Saccade followed by a glimpse – how human visual system works

Recurrent Models of Visual Attention, Mnih, 2014] [Multiple Object Recognition with Visual Attention, Ba, 2015

Hard attention examples

Glimpse Sensor

Recurrent Models of Visual Attention, Mnih, 2014

Overall Architecture - Emission network

Recurrent model of Visual Attention (RAM)

Explicit alignment -aligns two or more modalities (or views) as an actual task. The goal is to find correspondences between modalities

Dynamic Time Warping
Canonical Time Warping
Deep Canonical Time Warping

Implicit alignment -uses internal latent alignment of modalities in order to better solve various problems

Attention models
Soft attention
Spatial transformer networks
Hard attention

Alignment and Representations

Contextualized Sequence Encoding

Sequence Encoding - Contextualization

Self-Attention

Transformer Multi-Head Self-Attention

Position embeddings

Sequence-to-Sequence Using Transformer

Seq2Seq with Transformer Attentions

Contextualized Multimodal Embedding

Multimodal Embeddings

Multimodal Transformer

Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2019

Language Pre-training

Token-level and Sentence-level Embeddings

Pre-Training and Fine-Tuning

BERT: Bidirectional Encoder Representations from Transformers

Pre-training BERT Model

Three Embeddings: Token + Position + Sentence

Fine-Tuning BERT

Multimodal Pre-training

VL-BERT

M-BERT

1908.05787.pdf (arxiv.org)

Alignment and Translation

Alignment for Speech Recognition

Architecture of Speech Recognition

slazebni.cs.illinois.edu/spring17/lec26_audio.pdf

Option 1: Sequence-to-Sequence (Seq2Seq)

Option 2: Seq2Seq with Attention

Option 3: Sequence Labeling with RNN

Speech Alignment

Connectionist Temporal Classification (CTC)

Amodei, Dario, et al. “Deep speech 2: End-to-end speech recognition in english and mandarin.” (2015)

CTC Optimization

Visualizing CTC Predictions

Multi-View Video Alignment

Temporal Alignment using Neural Representations

Temporal Cycle-Consistency Learning

Multimodal Translation Visual Question Answering (VQA)

VQA and Attention

Co-attention

Lu et al., Hierarchical Question-Image Co-Attention for Visual Question Answering, NIPS 2016

Hierarchical Co-attention

Stacked Attentions

Yang et al., Stacked Attention Networks for Image Question Answering, CVPR 2016

VQA: Neural Module Networks

Neural Module Network

Andreas et al., Deep Compositional Question Answering with Neural Module Networks, 2016

Predefined Set of Modules

Johnson et al., CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, CVPR 2017

End-to- End Neural Module Network

Hu et al., Learning to Reason: End-to-End Module Networks for Visual Question Answering, 2017

VQA: Neural Symbolic Networks

Neural-symbolic VQA

Kexin Yi, et al. “Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding.” Neurips 2018

The Neuro-symbolic Concept Learner

Jiayuan Mao , et al. “The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision.” ICLR 2019

Speech-Vision Translation: Applications

Translation 1: Visually indicated sounds

Owens et al. Visually indicated sounds, CVPR, 2016

Translation 2: The Sound of Pixels

Zhao, Hang, et al. “The sound of pixels.”, ECCV 2018

Speech2face

Generative Models

Probabilistic Graphical Models

Definition: A probabilistic graphical model (PGM) is a graph formalism for compactly modeling joint probability distributions and dependence structures over a set of random variables

Inference for Known Joint Probability Distribution

Creating a Graphical Model

Example: Inferring Emotion from Interaction Logs

Example: Bayesian Network Representation

Example: Bayesian Network Approach

Example: Dynamic Bayesian Network Approach

Bayesian Networks

Definition: A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions

Bayesian Network (BN)

Joint Probability in Graphical Models

Conditional Probability Distribution (CPD)

Generative Model: Naïve Bayes Classifier

Dynamic Bayesian Network

Bayesian network allows to represent sequential dependencies.
Dynamically changing or evolving over time.
Directed graphical model of stochastic processes.
Especially aiming at time series modeling.

Hidden Markov Models

Factorial HMM

The Boltzmann Zipper

The Coupled HMM

Generating Data Using Neural Networks

Variational Autoencoder

Generative Adversarial Network (GAN)

GAN Training

Example: Audio to Scene

Audio to Scene Samples - wjohn1483.github.io

Example: Talking Head

https://arxiv.org/pdf/1905.08233.pdf

Bidirectional GAN

cAE-GAN

Cycle GAN

BiCycle GAN

Discriminative Graphical Models

Quick Recap

Fusion – Probabilistic Graphical Models

Restricted Boltzmann Machines

Deep Multimodal Boltzmann machines

Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014

Restricted Boltzmann Machine (RBM)

Smolensky, Information Processing in Dynamical Systems: Foundations of Harmony Theory, 1986

Undirected Graphical Model

A generative rather than discriminative model
Connections from every hidden unit to every visible one
No connections across units (hence “Restricted”), makes it easier to train and run inference

Markov Random Fields

Example: Markov Random Field – Graphical Model

Example: Markov Random Field – Factor Graph

Conditional Random Fields

Conditional Random Fields (Factor Graphs)

Conditional Random Fields (Log-linear Model)

Learning Parameters of a CRF Model

CRFs for Shallow Parsing

Latent-Dynamic CRF

Hidden Conditional Random Field

Multi-view Latent Variable Discriminative Models

CRFs and Deep Learning

Conditional Neural Fields

Deep Conditional Neural Fields

CRF and Bilinear LSTM

CNN and CRF and Bilinear LSTM

Continuous and Fully-Connected CRFs

Continuous Conditional Neural Field

High-Order Continuous Conditional Neural Field

Fully-Connected Continuous Conditional Neural Field

Fully-Connected CRF

CNN and Fully-Connected CRF

Fully Connected Deep Structured Networks

Zheng et al., 2015; Schwing and Urtasun, 2015

Sigurdsson et al., Asynchronous Temporal Fields for Action Recognition, CVPR 2017

Soft-Label Chain CRF

Phrase Grounding by Soft-Label Chain CRF

Liu J, Hockenmaier J. “Phrase Grounding by Soft-Label Chain Conditional Random Field” EMNLP 2019

Fusion, co-learning and new trends

Quick Recap: Multimodal Fusion

Multimodal fusion

Fusion – Probabilistic Graphical Models

Model-free Fusion

Model-agnostic approaches – early fusion

Easy to implement – just concatenate the features
Exploit dependencies between features
Can end up very high dimensional
More difficult to use if features have different granularities

Model-agnostic approaches – late fusion

Late Fusion on Multi-Layer Unimodal Classifiers

Multimodal Fusion Architecture Search (MFAS)

“Perez-Rua, Vielzeuf, Pateux, Baccouche, Frederic Jurie,MFAS: Multimodal Fusion Architecture Search,CVPR 2019

Proposed solution: Explore the search space with Sequential Model-Based Optimization

Start with simpler models first (all L=1 models) and iteratively increase the complexity (L=2, L=3,…)
Use a surrogate function to predict performance of unseen architectures
- e.g., the performance of all the L=1 models should give us an idea of how well the L=2 models will perform

Memory-Based Fusion

Zadeh et al., Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018

Local Fusion and Kernel Functions

What is a Kernel function?

A kernel function: Acts as a similarity metric between data points

Non-linearly separable data

Radial Basis Function Kernel (RBF)

Some other kernels

Histogram Intersection Kernel ：good for histogram features
String kernels ：specifically for text and sentence features
Proximity distribution kernel
(Spatial) pyramid matching kernel

Kernel CCA

Transformer’s Attention Function

Tsai et al., Transformer Dissection: An Unified Understanding for Transformer’s Attention via the Lens of Kernel, EMNLP 2019

Multiple Kernel Learning

MKL in Unimodal Case

Pick a family of kernels and learn which kernels are important for the classification case
For example a set of RBF and polynomial kernels

MKL in Multimodal/Multiview Case

Pick a family of kernels for each modality and learn which kernels are important for the classification case
Does not need to be different modalities, often we use different views of the same modality (HOG, SIFT, etc.)

Co-Learning

Co-Learning - The 5th Multimodal Challenge

Definition: Transfer knowledge between modalities, including their representations and predictive models.

Co-learning Example with Paired Data

ViCo: Word Embeddings from Visual Co-occurrences

ViCo: Word Embeddings from Visual Co-occurrences

Co-Learning with Paired Data: Multimodal Cyclic Translation

Paul Pu Liang, Hai Pham, et al., “Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities”, AAAI 2019

Co-Learning Example with Weakly Paired Data

End-to-End Learning of Visual Representations from Uncurated Instructional Videos Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman – CVPR 2020

Weakly Paired Data

Multiple Instance Learning Noise Contrastive Estimation

Research Trend: Few-Shot Learning and Weakly Supervised

Few-Shot Learning in RL Environment

Hill et al., Grounded Language Learning Fast and Slow. arXiv 2020

Grounded Language Learning

Weakly-Supervised Phrase Grounding

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding, EMNLP 2020

Multimodal Alignment Framework

Research Trends in Multimodal ML

Abstraction and logic
Multimodal reasoning
Towards causal inference
Understanding multimodal models
Commonsense and coherence
Social impact - fairness and misinformation
Emotional and engaging interactions
Multi-lingual multimodal grounding

Abstraction and Logic

Learning by Abstraction: The Neural State Machine

Hudson, Drew, and Christopher D. Manning. “Learning by abstraction: The neural state machine.“ NeurIPS 2019

Learning by Abstraction: The Neural State Machine

VQA under the Lens of Logic

Gokhale, Tejas, et al. “VQA-LOL: Visual question answering under the lens of logic.“, ECCV 2020

Multimodal Reasoning

Cross-Modality Relevance for Reasoning on Language and Vision

Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

Gan, Zhe, et al. “Multi-step reasoning via recurrent dual attention for visual dialog.“ ACL 2019

Hypothesis: The failure of visual dialog is caused by the inherent weakness of single-step reasoning.
Intuition: Humans take a first glimpse of an image and a dialog history, before revisiting specific parts of the image/text to understand the multimodal context.
Proposal: Apply Multi-step reasoning to visual dialog by using a recurrent (aka multi-step) version of attention (aka reasoning). This is done on both text and questions (aka, dual).

Towards Causal Inference

Visual Dialogue Expressed with Causal Graph

Two Causal Principles for Improving Visual Dialog

Qi, Jiaxin, et al. “Two causal principles for improving visual dialog.“ CVPR 2020

This paper identifies two causal principles that are holding back VisDial models.

Harmful shortcut bias between dialog history (H) and the answer (A)
Unobserved confounder between H, Q and A leading to spurious correlations.

By identifying and addressing these principles in a model-agnostic manner, they are able to promote any VisDial model to SOTA levels.

Studying Biases in VQA Models

*Agarwal, Vedika, Rakshith Shetty, and Mario Fritz. “Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing.” *

Understanding Multimodal Models

Introspecting VQA Models with Sub-Questions

Selvaraju, Ramprasaath R., et al. “SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions.”, CVPR 2020

New Dataset

SQuINTing Model

Training Multimodal Networks

Commonsense and Coherence

Emotions are Often Context Dependent

“COSMIC: COmmonSense knowledge for eMotion Identification in Conversations”, Findings of EMNLP 2020

Commonsense and Emotion Recognition

Proposed approach (COSMIC):

For each utterance, try to infer

speaker’s intention
effect on the speaker/listener
reaction of the speaker/listener

Proposed Model (COSMIC)

Coherence and Commonsense

Coherence relations provide information about how the content of discourse units relate to one another.

They have been used to predict commonsense inference in text.

Cross-modal Coherence Modeling for Caption Generation ACL 2020

Fair Representation Learning

Pena et al., Bias in Multimodal AI: A Testbed for Fair Automatic Recruitment. ICMI 2020

Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News, EMNLP 2020

Emotional and Engaging Interactions

Dialogue Act Classification (DAC)

“Towards Emotion-aided Multi-modal Dialogue Act Classification”, ACL 2020

Image-Chat: Engaging Grounded Conversations

Shuster, Kurt, et al. “Image-chat: Engaging grounded conversations.“ ACL 2020

Multi-Lingual Multimodal Grounding

Room-Across-Toom: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge – EMNLP 2020

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Cogswell, Michael, et al. “Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data.”

Connecting Language to Actions

What does interaction mean?

Sequential and Online Modeling

Planning

V+L -> A

First Major Question: Alignment

Ma et al, “Self-Monitoring Navigation Agent via Auxiliary Progress Estimation” ICLR 2019

Alignment

Lots of Data

Ku et al. Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding — EMNLP 2020

What if you make a mistake?

Ke 2019, Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation - CVPR 2019

Why does this question matter?

Because in general, we can’t supervise everything

Seven High-level Tasks

Data collection

End-to-End Models

A Shared Semantic Space

Paxton et al. Prospection: Interpretable Plans From Language By Predicting the Future ICRA 2019

Predicting the Future

Objectives

Embodiment

Choose your own adventure — Lots of noise
What does it mean to succeed?
Where do concepts come from?
What’s the role of exploration?
Language is woefully underspecified

Multimodal Human-inspired Language Learning

Approach

Workflow

Proposed Model: Vision-language Pre-training

Recognition Model using pre-trained VLP

Grounded Multimodal Learning

Children learn in a multimodal environment. We investigate human-like learning in the following perspectives:

● Association of new information to previous (past) knowledge

● Generalization of learned knowledge to unseen (future) concepts

○ Zero-shot compositionality of the learned concepts

■ Blue + Dog -> Blue dog !?

Model

New work in progress: Video-Text Coref

Two Approaches to Latent Structure Learning

Latent Tree Formalism: Lexicalized PCFG

Probabilistic Model of Lexicalized PCFG

Latent Tree Learning: Baselines

Baselines:

○ DMV (Klein and Manning, 2004): generative model of dependency structures.

○ Compound PCFG (Kim et al., 2019): neural model to parameterize probabilistic context-free grammar using sentence-by-sentence parameters and variational training.

○ Compound PCFG w/ right-headed rule: takes predictions of Compound PCFG and choose the head of right child as the head of the parent.

○ ON-LSTM (Shen et al., 2019) and PRPN (Shen et al., 2018): two unsupervised constituency parsing models

○ VGNSL (Shi et al., 2019): unsupervised constituency parsing model with image information

Latent Template Learning: Concept

Latent Template Learning: Generative Model

Learning of the Latent Template Model

Generation Conditioned on Templates

Automatic Speech Recognition

EESEN: https://github.com/srvk/eesen

Pre-trained models from more established Speech Recognition corpus, (Libri Speech and SwitchBoard in our case)
3 models: ESP-Net1 , EESEEN-WFST2 , and EESEN-rnnLM decoding, trained with CTC loss
Major Challenges:
- fully annotated transcription not available for evaluation
- much more noisy than the pre-trained datasets
- multiple speakers present

ESP-Net architecture Watanabe et al. 2018

ASR: Seedling Dataset Samples(ESPnet vs EESEN)

ESPnet: Hey, do you want to play anything or read a book or anything a book? Okay, which book which book you want to read? The watch one little baby who is born far away. And another who is born on the very next day. And both of these babies as everyone knows.

Turn the Page. Had Ten Little Fingers ten fingers and ten little toes. There was only there was one little baby who is born in a town and another who is wrapped in either down. And both of these babies as everyone knows add ten little fingers and ten.

Have you any water recently? Get some water, please. Get some water please some water. Yeah water is delicious. Why don’t you have some? Give me some water, please.

There was one little baby who is born in the house and another who Snuffer suffered from sneezes and chills. And both of these babies with everyone knows. at ten little fingers and ten little toes just like

Object Detection: Methodology

Multimodal association in SeedlingS Corpus

Coherence and Grounding in Multimodal Communication

Commonsense and Coherence

Architecture

Grounding

杰克成

https://jackhcc.github.io/posts/Lesson-CS-Multimodal-Machine-Learning.html

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源杰克成 !

Multimodal Machine Learning

Embedded Microprocessor System笔记

嵌入式微处理系统学习笔记

2021-09-26 Embedded

Microprocessor

Python-Feather文件操作

Feather文件格式详解

2021-09-24 Python

Feather

CS-Multimodal Machine Learning

Introduction

What is Multimodal?

Multimodal Communicative Behaviors

Modality

Multiple Communities and Modalities

A Historical View

Prior Research on “Multimodal”

Core Technical Challenges

Core Challenge 1: Representation

Early Examples

Representation

Core Challenge 2: Alignment

Explicit Alignment

Implicit Alignment

Core Challenge 3 – Translation

Core Challenge 4: Fusion

Core Challenge 5: Co-Learning

Taxonomy of Multimodal Research

Real world tasks tackled by MMML

Multimodal Research Tasks

Multimodal Research Tasks

Affective Computing（情感计算）

Common Topics in Affective Computing

Multimodal Sentiment Analysis（多模态情绪分析）

Multi-Party Emotion Recognition（多方情绪识别）

What are the Core Challenges Most Involved in Affect Recognition?

Project Example: Select-Additive Learning

Project Example: Word-Level Gated Fusion

Media Description（媒体描述）

Large-Scale Image Captioning Dataset

Evaluating Image Caption Generations（评估图像字幕生成）

Video captioning（视频字幕）

Video Description and Alignment（视频描述和对齐）

How to Address the Challenge of Evaluation？

Large-Scale Description and Grounding Dataset

Multimodal QA（多模态问答）

Multimodal QA dataset 1 –VQA (C1)

VQA 2.0

Multimodal QA –other VQA datasets

Multimodal QA –other VQA datasets (C7)

TVQA

Multimodal QA –Visual Reasoning (C8)

VCR：Visual Commonsense Reasoning

Social-IQ (A10)

Scocail-IQ

Project Example: Adversarial Attacks on VQA models

Multimodal Navigation（多模式导航）

Navigating in a Virtual House（在虚拟房屋中导航）

Multiple Step Instructions

Language meets Games

Project Example: Instruction Following

Project Example: Multiagent Trajectory Forecasting

Project Examples, Advice and Support

Latest List of Multimodal Datasets

Some Advice About Multimodal Research

Some Advice About Multimodal Datasets

Available Tools

List of Multimodal datasets

Affective Computing（情感计算）

Media description

Multimodal QA

Multimodal Navigation

Multimodal Dialog

Event detection

Multimodal Retrieval

Other Multimodal Datasets

Basic Concepts – Neural Networks

Unimodal Basic Representations

Unimodal Representation – Visual Modality

Unimodal Representation – Language Modality

Unimodal Representation – Acoustic Modality

Other Unimodal Representations

Unimodal Representation – Sensors

Unimodal Representation – Tables

Unimodal Representation – Graphs

Unimodal Representation – Sets

Machine Learning – Basic Concepts

Training, Testing and Dataset

Nearest Neighbor Classifier