Introduction
What is Multimodal?
Multimodal Communicative Behaviors
Modality
事物发生或经历的方式
- Modality:表示某种类型的信息和/或存储信息的表示格式
- Sensory modality:感觉的主要形式之一,如视觉或触觉;沟通渠道
Multiple Communities and Modalities
A Historical View
Prior Research on “Multimodal”
Core Technical Challenges
Core Challenge 1: Representation
Early Examples
Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014
Representation
- Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy.(学习如何利用互补性和冗余性在数据库中表示和汇总多模态数据)
- Joint representations
- Coordinated representations
Core Challenge 2: Alignment
- Definition: Identify the direct relations between (sub)elements from two or more different modalities.(确定两个或多个不同模式的(子)要素之间的直接关系。)
Explicit Alignment
Implicit Alignment
Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
Core Challenge 3 – Translation
Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013
- Definition: Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective.(将数据从一种模态转换到另一种模态的过程,其中的翻译关系通常是开放的或主观的)
Ahuja, C., & Morency, L. P. (2019). Language2Pose: Natural Language Grounded Pose Forecasting. Proceedings of 3DV Conference
Core Challenge 4: Fusion
- Definition: To join information from two or more modalities to perform a prediction task.(将来自两个或多个模式的信息连接起来以执行预测任务)
Core Challenge 5: Co-Learning
- Definition: Transfer knowledge between modalities, including their representations and predictive model。(在模式之间转移知识,包括其表示和预测模型。)
Taxonomy of Multimodal Research
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy
Real world tasks tackled by MMML
Multimodal Research Tasks
Multimodal Research Tasks
Affective Computing(情感计算)
Common Topics in Affective Computing
Affectivestates–emotions, moods, and feelings
Cognitivestates–thinking and information processing
Personality–patterns of acting, feeling, and thinking
Pathology–health, functioning, and disorders
Social processes –groups, cultures, and perception
情感状态–情绪、情绪和感觉
认知状态–思维和信息处理
个性–行为、感觉和思维模式
病理学–健康、功能和紊乱
社会过程–群体、文化和感知
AVEC 2011 – The First International Audio/Visual Emotion Challenge, B. Schuller et al., 2011
AVEC 2013 – The Continuous Audio/Visual Emotion and Depression Recognition Challenge, Valstar et al. 2013
Introducing the RECOLA Multimodal Corpus of Remote Collaborative and Affective Interactions, F. Ringeval et al., 2013
Multimodal Sentiment Analysis(多模态情绪分析)
Multi-Party Emotion Recognition(多方情绪识别)
What are the Core Challenges Most Involved in Affect Recognition?
Project Example: Select-Additive Learning
Project Example: Word-Level Gated Fusion
Minghai Chen, Sen Wang, Paul Pu Liang, Ta d a sBaltrušaitis, Amir Zadeh, Louis-Philippe Morency, Multimodal Sentiment
Analysis with Word-Level Fusion and Reinforcement Learning, ICMI 2017, https://arxiv.org/abs/1802.00924
Media Description(媒体描述)
给定媒体(图像、视频、视听剪辑)提供自由形式的文本描述
Large-Scale Image Captioning Dataset
- Microsoft Common Objects in COntext(MS COCO)
- 120000 images
- Each image is accompanied with five free form sentences describing it (at least 8 words)
- Sentences collected using crowdsourcing (Mechanical Turk)
- Also contains object detections, boundaries and keypoints
Evaluating Image Caption Generations(评估图像字幕生成)
- Has an evaluation server
- Training and validation -80K images (400K captions)
- Testing –40K images (380K captions), a subset contains more captions for better evaluation, these are kept privately (to avoid over-fitting and cheating)
- Evaluation is difficult as there is no one “correct” answer for describing an image in a sentence
- Given a candidate sentence it is evaluated against a set of “ground truth” sentences
Video captioning(视频字幕)
Video Description and Alignment(视频描述和对齐)
Charade Dataset: http://allenai.org/plato/charades/
How to Address the Challenge of Evaluation?
Large-Scale Description and Grounding Dataset
Multimodal QA(多模态问答)
Multimodal QA dataset 1 –VQA (C1)
VQA 2.0
Multimodal QA –other VQA datasets
Multimodal QA –other VQA datasets (C7)
TVQA
Multimodal QA –Visual Reasoning (C8)
VCR:Visual Commonsense Reasoning
Social-IQ (A10)
Scocail-IQ
Project Example: Adversarial Attacks on VQA models
Multimodal Navigation(多模式导航)
- Embedded Assistive Agents
- Language, Vision and Actions
- Many Technical Challenges
Navigating in a Virtual House(在虚拟房屋中导航)
Multiple Step Instructions
Language meets Games
Project Example: Instruction Following
Project Example: Multiagent Trajectory Forecasting
Project Examples, Advice and Support
Latest List of Multimodal Datasets
Some Advice About Multimodal Research
- Think more about the research problems, and less about the datasets themselves
- Aim for generalizable models across several datasets
- Aim formodelsinspiredbyexistingresearche.g.psychology
- Some areas to consider beyond performance:
- Robustness tomissing/noisy modalities, adversarial attacks
- Studying social biases and creating fairer models
- Interpretable models
- Faster models for training/storage/inference
- Theoretical projects are welcome too –make sure there are also experiments to validate theory
Some Advice About Multimodal Datasets
- If you are used to deal with text or speech
- Space will become an issue working with image/video data
- Some datasets are in 100s of GB (compressed)
- Memory for processingit will become an issue as well
- Won’t be able to store it all in memory
- Time to extract features and train algorithms will also become an issue
- Plan accordingly!
- Sometimes tricky to experiment on a laptop (might need to do it on a subset of data)
Available Tools
List of Multimodal datasets
Affective Computing(情感计算)
Acted Facial Expressions in the Wild (part of EmotiWChallenge))
AVEC challenge datasets
The Interactive Emotional Dyadic Motion Capture (IEMOCAP)
Persuasive Opinion Multimedia (POM)
Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos (MOSI)
CMU-MOSEI:Multimodal sentiment and emotion recognition
Tumblr Dataset: Sentiment and Emotion Analysis
AMHUSE Dataset: Multimodal Humor Sensing
Video Game Dataset: Multimodal Game Rating
DEAP
Continuous LIRIS-ACCEDE
Media description
MPII Movie Description dataset
Montréal Video Annotation dataset
Flickr30k Entities
Multimodal QA
VQA v2.0
Multimodal Navigation
Multimodal Dialog
Cooperative Vision-and-Dialog Navigation
Event detection
Title-based Video Summarization dataset
CrisisMMD
Multimodal Retrieval
Yahoo Flickr Creative Commons 100M
Other Multimodal Datasets
Basic Concepts – Neural Networks
Unimodal Basic Representations
Unimodal Representation – Visual Modality
Unimodal Representation – Language Modality
Unimodal Representation – Acoustic Modality
Other Unimodal Representations
Unimodal Representation – Sensors
Lee et al., Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks. ICRA 2019
Unimodal Representation – Tables
Bao et al., Table-to-Text: Describing Table Region with Natural Language. AAAI 2018
Unimodal Representation – Graphs
Hamilton and Tang, Tutorial on Graph Representation Learning. AAAI 2019
Unimodal Representation – Sets
*Zaheer et al., DeepSets. NeurIPS 2017, Li et al., Point Cloud GAN. arxiv 2018 *
Machine Learning – Basic Concepts
Training, Testing and Dataset
Nearest Neighbor Classifier
Simple Classifier: Nearest Neighbor
Definition of K-Nearest Neighbor
Data-Driven Approach
Evaluation methods (for validation and testing)
Linear Classification: Scores and Loss
Learning model parameters
Neural Networks gradient
Gradient descent
Optimization – Practical Guidelines
CNNs and Visual Representations
Image Representations
Object-Based Visual Representation
Object Descriptors
Convolution Kernels
Object Descriptors
Facial expression analysis
Articulated Body Tracking: OpenPose
Convolutional Neural Networks
Convolutional Neural Networks
Translation Invariance
Learned vs Predefined Kernels
Convolution Math
Convolutional Neural Layer
Convolutional Neural Network
Example of CNN Architectures
Common architectures
VGGNet model
Other architectures
Residual Networks
Visualizing CNNs
Visualizing the Last CNN Layer: t-sne
Deconvolution
CAM: Class Activation Mapping
Grad-CAM
Region-based CNNs
Object Detection (and Segmentation)
Selective Search
R-CNN
Trade-off Between Speed and Accuracy
Sequential Modeling with Convolutional Networks
3D CNN
Temporal Convolution Network (TCN)
Appendix: Tools for Automatic visual behavior analysis
OpenFace: an open source facial behavior analysis toolkit, T. Baltrušaitis et al., 2016
Image from Hachisu et al (2018). FaceLooks: A Smart Headband for Signaling Face-to-Face Behavior. Sensors.
Language Representations and RNNs
Word Representations
How to learn (word) features/representations?
Distance and similarity
How to learn (word) features/representations?
How to use these word representations
Vector space models of words
Sentence Modeling
Sentence Modeling: Sequence Label Prediction
Sentence Modeling: Sequence Prediction
Sentence Modeling: Sequence Representation
Sentence Modeling: Language Model
Language Model Application: Language Generation
Language Model Application: Speech Recognition
Challenges in Sequence Modeling
Recurrent Neural Networks
Gated Recurrent Neural Networks
Syntax and Language Structure
Syntax and Language Structure
Dependency Grammar
Language Ambiguity
Recursive Neural Network
Socher et al., Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, EMNLP 2013
Stack LSTM
Dyer et al., Transition-Based Dependency Parsing with Stack Long Short-Term Memory, 2015
Multimodal Representations
Graph Representations
RECAP: Tree-based RNNs (or Recursive Neural Network)
Graphs (aka “Networks”)
Graphs – Supervised Task
Graphs – Unsupervised Task
Graph Neural Nets
Graph Neural Nets – Supervised Training
Graph Neural Nets – Neighborhood Aggregation
Kipf et al., 2017. Semi-supervised Classification with Graph Convolutional Networks. ICLR.
Li et al., 2016. Gated Graph Sequence Neural Networks. ICLR.
*Duvenaud et al. 2016. Convolutional Networks on Graphs for Learning Molecular Fingerprints. ICML. Li et al. 2016. *
Gated Graph Sequence Neural Networks. ICLR.
Multimodal representations
Multimodal representations
Unsupervised Joint representations
Unsupervised representation learning
Shallow multimodal representations
Autoencoders
Deep Multimodal autoencoders
Ngiam et al., Multimodal Deep Learning, 2011
Deep Multimodal Boltzmann machines
Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014
Supervised Joint representations
Multimodal Joint Representation
Multimodal Sentiment Analysis
Unimodal, Bimodal and Trimodal Interactions
Bilinear Pooling
Tenenbaum and Freeman, 2000
Multimodal Tensor Fusion Network (TFN)
Zadeh, Jones and Morency, EMNLP 2017
From Tensor Representation to Low-rank Fusion
① Decomposition of weight tensor W
② Decomposition of Z
③ Rearranging computation
Multimodal LSTM
Multimodal Sequence Modeling – Early Fusion
Multi-View Long Short-Term Memory (MV-LSTM)
Multi-View Long Short-Term Memory
Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016
Topologies for Multi-View LSTM
Coordinated Multimodal Representations
Coordinated Multimodal Representations
Learn (unsupervised) two or more coordinated representations from multiple modalities. A loss function is defined to bring closer these multiple representations.
Coordinated Multimodal Embeddings
Frome et al., DeViSE: A Deep Visual-Semantic Embedding Model, NIPS 2013
Structure-preserving Loss – Multimodal Embeddings
Wang et al., Learning Deep Structure-Preserving Image-Text Embeddings, CVPR 2016
Coordinated Representations
Quick Recap
Coordinated Multimodal Representations
Learn (unsupervised) two or more coordinated representations from multiple modalities. A loss function is defined to bring closer these multiple representations.
Structured coordinated embeddings
Vendrov et al., Order-Embeddings of Images and Language, 2016
Vendrov et al., Order-Embeddings of Images and Language, 2016
Multivariate Statistical Analysis
Random Variables
Definitions
Principal component analysis
Eigenvalues and Eigenvectors
Singular Value Decomposition (SVD)
Canonical Correlation Analysis
Canonical Correlation Analysis
Correlated Projection
Exploring Deep Correlation Networks
Deep Canonical Correlation Analysis
Andrew et al., ICML 2013
Deep Canonically Correlated Autoencoders (DCCAE)
Wang et al., ICML 2015
Deep Correlational Neural Network
Chandar et al., Neural Computation, 2015
Multi-View Clustering
Data Clustering
“Soft” Clustering: Nonnegative Matrix Factorization
Semi-NMF and Other Extensions
Ding et al., TPAMI2015
Trigerous et al., TPAMI 2015
Principles of Multi-View Clustering
Yan Yang and Hao Wang, Multi-view Clustering: A Survey, Big data mining and analytics, Volume 1, Number 2, June 2018
Multi-view subspace clustering
Definition: learns a unified feature representation from all the view subspaces by assuming that all views share this representation
Deep Matrix Factorization
Li and Tang, MMML 2015
Other Multi-View Clustering Approaches
Yan Yang and Hao Wang, Multi-view Clustering: A Survey, Big data mining and analytics, Volume 1, Number 2, June 2018
Auto-Encoder in Auto-Encoder Network
Deep Canonically Correlated Autoencoders (DCCAE)
Multi-view Latent “Intact” Space
Multimodal alignment
Multimodal alignment
Explicit multimodal-alignment
Explicit alignment - goal is to find correspondences between modalities
▪ Aligning speech signal to a transcript
▪ Aligning two out-of sync sequences
▪ Co-referring expressions
Implicit multimodal-alignment
Implicit alignment - uses internal latent alignment of modalities in order to better solve various problems
▪ Machine Translation
▪ Cross-modal retrieval
▪ Image & Video Captioning
▪ Visual Question Answering
Explicit alignment
Let’s start unimodal – Dynamic Time Warping
Dynamic Time Warping continued
DTW alternative formulation
Canonical Correlation Analysis reminder
Canonical Time Warping
Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009
Optimized by Coordinate-descent –fix one set of parameters, optimize another
Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009, NIPS
Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI
Deep Canonical Time Warping
Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR
Implicit alignment
Attention models
Recent attention models can be roughly split into three major categories
Soft attention
Acts like a gate function. Deterministic inference.
Transform network
Warp the input to better align with canonical view。
Hard attention
Includes stochastic processes. Related to reinforcement learning.
Soft attention
Machine Translation
Given a sentence in one language translate it to another
Not exactly multimodal task – but a good start! Each language can be seen almost as a modality.
Machine Translation with RNNs
Decoder – attention model
Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
How do we encode attention?
MT with attention
Visual captioning with soft attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Xu et al., 2015
Looking at more fine grained feature
- 允许潜在数据对齐
- 让我们了解网络“看到”的内容
- 可以使用反向传播进行优化
Spatial Transformer networks
Glimpse Network (Hard Attention)
Hard attention
Soft attention requires computing a representation for the whole image or sentence
Hard attention on the other hand forces looking only at one part
Main motivation was reduced computational cost rather than improved accuracy (although that happens a bit as well) ▪
Saccade followed by a glimpse – how human visual system works
Recurrent Models of Visual Attention, Mnih, 2014] [Multiple Object Recognition with Visual Attention, Ba, 2015
Hard attention examples
Glimpse Sensor
Recurrent Models of Visual Attention, Mnih, 2014
Overall Architecture - Emission network
Recurrent model of Visual Attention (RAM)
Multi-modal alignment recap
Explicit alignment -aligns two or more modalities (or views) as an actual task. The goal is to find correspondences between modalities
- Dynamic Time Warping
- Canonical Time Warping
- Deep Canonical Time Warping
Implicit alignment -uses internal latent alignment of modalities in order to better solve various problems
- Attention models
- Soft attention
- Spatial transformer networks
- Hard attention
Alignment and Representations
Contextualized Sequence Encoding
Sequence Encoding - Contextualization
Self-Attention
Transformer Multi-Head Self-Attention
Position embeddings
Sequence-to-Sequence Using Transformer
Seq2Seq with Transformer Attentions
Contextualized Multimodal Embedding
Multimodal Embeddings
Multimodal Transformer
Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2019
Cross-Modal Transformer
Language Pre-training
Token-level and Sentence-level Embeddings
Pre-Training and Fine-Tuning
BERT: Bidirectional Encoder Representations from Transformers
Pre-training BERT Model
Three Embeddings: Token + Position + Sentence
Fine-Tuning BERT
Multimodal Pre-training
VL-BERT
M-BERT
Alignment and Translation
Alignment for Speech Recognition
Architecture of Speech Recognition
slazebni.cs.illinois.edu/spring17/lec26_audio.pdf
Option 1: Sequence-to-Sequence (Seq2Seq)
Option 2: Seq2Seq with Attention
Option 3: Sequence Labeling with RNN
Speech Alignment
Connectionist Temporal Classification (CTC)
Amodei, Dario, et al. “Deep speech 2: End-to-end speech recognition in english and mandarin.” (2015)
CTC Optimization
Visualizing CTC Predictions
Multi-View Video Alignment
Temporal Alignment using Neural Representations
Temporal Cycle-Consistency Learning
Multimodal Translation Visual Question Answering (VQA)
VQA and Attention
Co-attention
Lu et al., Hierarchical Question-Image Co-Attention for Visual Question Answering, NIPS 2016
Hierarchical Co-attention
Stacked Attentions
Yang et al., Stacked Attention Networks for Image Question Answering, CVPR 2016
VQA: Neural Module Networks
Neural Module Network
Andreas et al., Deep Compositional Question Answering with Neural Module Networks, 2016
Predefined Set of Modules
Johnson et al., CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, CVPR 2017
End-to- End Neural Module Network
Hu et al., Learning to Reason: End-to-End Module Networks for Visual Question Answering, 2017
VQA: Neural Symbolic Networks
Neural-symbolic VQA
Kexin Yi, et al. “Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding.” Neurips 2018
The Neuro-symbolic Concept Learner
Jiayuan Mao , et al. “The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision.” ICLR 2019
Speech-Vision Translation: Applications
Translation 1: Visually indicated sounds
Owens et al. Visually indicated sounds, CVPR, 2016
Translation 2: The Sound of Pixels
Zhao, Hang, et al. “The sound of pixels.”, ECCV 2018
Speech2face
Generative Models
Probabilistic Graphical Models
Definition: A probabilistic graphical model (PGM) is a graph formalism for compactly modeling joint probability distributions and dependence structures over a set of random variables
Inference for Known Joint Probability Distribution
Creating a Graphical Model
Example: Inferring Emotion from Interaction Logs
Example: Bayesian Network Representation
Example: Bayesian Network Approach
Example: Dynamic Bayesian Network Approach
Bayesian Networks
Definition: A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions
Bayesian Network (BN)
Joint Probability in Graphical Models
Conditional Probability Distribution (CPD)
Generative Model: Naïve Bayes Classifier
Dynamic Bayesian Network
- Bayesian network allows to represent sequential dependencies.
- Dynamically changing or evolving over time.
- Directed graphical model of stochastic processes.
- Especially aiming at time series modeling.
Hidden Markov Models
Factorial HMM
The Boltzmann Zipper
The Coupled HMM
Generating Data Using Neural Networks
Variational Autoencoder
Generative Adversarial Network (GAN)
GAN Training
Example: Audio to Scene
Audio to Scene Samples - wjohn1483.github.io
Example: Talking Head
https://arxiv.org/pdf/1905.08233.pdf
Bidirectional GAN
cAE-GAN
Cycle GAN
BiCycle GAN
Discriminative Graphical Models
Quick Recap
Fusion – Probabilistic Graphical Models
Restricted Boltzmann Machines
Deep Multimodal Boltzmann machines
Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014
Restricted Boltzmann Machine (RBM)
Smolensky, Information Processing in Dynamical Systems: Foundations of Harmony Theory, 1986
Undirected Graphical Model
- A generative rather than discriminative model
- Connections from every hidden unit to every visible one
- No connections across units (hence “Restricted”), makes it easier to train and run inference
Markov Random Fields
Example: Markov Random Field – Graphical Model
Example: Markov Random Field – Factor Graph
Conditional Random Fields
Conditional Random Fields (Factor Graphs)
Conditional Random Fields (Log-linear Model)
Learning Parameters of a CRF Model
CRFs for Shallow Parsing
Latent-Dynamic CRF
Hidden Conditional Random Field
Multi-view Latent Variable Discriminative Models
CRFs and Deep Learning
Conditional Neural Fields
Deep Conditional Neural Fields
CRF and Bilinear LSTM
CNN and CRF and Bilinear LSTM
Continuous and Fully-Connected CRFs
Continuous Conditional Neural Field
High-Order Continuous Conditional Neural Field
Fully-Connected Continuous Conditional Neural Field
Fully-Connected CRF
CNN and Fully-Connected CRF
Fully Connected Deep Structured Networks
Zheng et al., 2015; Schwing and Urtasun, 2015
Sigurdsson et al., Asynchronous Temporal Fields for Action Recognition, CVPR 2017
Soft-Label Chain CRF
Phrase Grounding by Soft-Label Chain CRF
Liu J, Hockenmaier J. “Phrase Grounding by Soft-Label Chain Conditional Random Field” EMNLP 2019
Fusion, co-learning and new trends
Quick Recap: Multimodal Fusion
Multimodal fusion
Fusion – Probabilistic Graphical Models
Model-free Fusion
Model-agnostic approaches – early fusion
- Easy to implement – just concatenate the features
- Exploit dependencies between features
- Can end up very high dimensional
- More difficult to use if features have different granularities
Model-agnostic approaches – late fusion
Late Fusion on Multi-Layer Unimodal Classifiers
Multimodal Fusion Architecture Search (MFAS)
“Perez-Rua, Vielzeuf, Pateux, Baccouche, Frederic Jurie,MFAS: Multimodal Fusion Architecture Search,CVPR 2019
Proposed solution: Explore the search space with Sequential Model-Based Optimization
- Start with simpler models first (all L=1 models) and iteratively increase the complexity (L=2, L=3,…)
- Use a surrogate function to predict performance of unseen architectures
- e.g., the performance of all the L=1 models should give us an idea of how well the L=2 models will perform
Memory-Based Fusion
Zadeh et al., Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018
Local Fusion and Kernel Functions
What is a Kernel function?
A kernel function: Acts as a similarity metric between data points
Non-linearly separable data
Radial Basis Function Kernel (RBF)
Some other kernels
- Histogram Intersection Kernel :good for histogram features
- String kernels :specifically for text and sentence features
- Proximity distribution kernel
- (Spatial) pyramid matching kernel
Kernel CCA
Transformer’s Attention Function
Tsai et al., Transformer Dissection: An Unified Understanding for Transformer’s Attention via the Lens of Kernel, EMNLP 2019
Multiple Kernel Learning
MKL in Unimodal Case
- Pick a family of kernels and learn which kernels are important for the classification case
- For example a set of RBF and polynomial kernels
MKL in Multimodal/Multiview Case
- Pick a family of kernels for each modality and learn which kernels are important for the classification case
- Does not need to be different modalities, often we use different views of the same modality (HOG, SIFT, etc.)
Co-Learning
Co-Learning - The 5th Multimodal Challenge
Definition: Transfer knowledge between modalities, including their representations and predictive models.
Co-learning Example with Paired Data
ViCo: Word Embeddings from Visual Co-occurrences
ViCo: Word Embeddings from Visual Co-occurrences
Co-Learning with Paired Data: Multimodal Cyclic Translation
Paul Pu Liang, Hai Pham, et al., “Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities”, AAAI 2019
Co-Learning Example with Weakly Paired Data
End-to-End Learning of Visual Representations from Uncurated Instructional Videos Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman – CVPR 2020
Weakly Paired Data
Multiple Instance Learning Noise Contrastive Estimation
Research Trend: Few-Shot Learning and Weakly Supervised
Few-Shot Learning in RL Environment
Hill et al., Grounded Language Learning Fast and Slow. arXiv 2020
Grounded Language Learning
Weakly-Supervised Phrase Grounding
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding, EMNLP 2020
Multimodal Alignment Framework
Research Trends in Multimodal ML
- Abstraction and logic
- Multimodal reasoning
- Towards causal inference
- Understanding multimodal models
- Commonsense and coherence
- Social impact - fairness and misinformation
- Emotional and engaging interactions
- Multi-lingual multimodal grounding
Abstraction and Logic
Learning by Abstraction: The Neural State Machine
Hudson, Drew, and Christopher D. Manning. “Learning by abstraction: The neural state machine.“ NeurIPS 2019
Learning by Abstraction: The Neural State Machine
VQA under the Lens of Logic
Gokhale, Tejas, et al. “VQA-LOL: Visual question answering under the lens of logic.“, ECCV 2020
Multimodal Reasoning
Cross-Modality Relevance for Reasoning on Language and Vision
Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020
Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog
Gan, Zhe, et al. “Multi-step reasoning via recurrent dual attention for visual dialog.“ ACL 2019
- Hypothesis: The failure of visual dialog is caused by the inherent weakness of single-step reasoning.
- Intuition: Humans take a first glimpse of an image and a dialog history, before revisiting specific parts of the image/text to understand the multimodal context.
- Proposal: Apply Multi-step reasoning to visual dialog by using a recurrent (aka multi-step) version of attention (aka reasoning). This is done on both text and questions (aka, dual).
Towards Causal Inference
Visual Dialogue Expressed with Causal Graph
Two Causal Principles for Improving Visual Dialog
Qi, Jiaxin, et al. “Two causal principles for improving visual dialog.“ CVPR 2020
This paper identifies two causal principles that are holding back VisDial models.
- Harmful shortcut bias between dialog history (H) and the answer (A)
- Unobserved confounder between H, Q and A leading to spurious correlations.
By identifying and addressing these principles in a model-agnostic manner, they are able to promote any VisDial model to SOTA levels.
Studying Biases in VQA Models
*Agarwal, Vedika, Rakshith Shetty, and Mario Fritz. “Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing.” *
Understanding Multimodal Models
Introspecting VQA Models with Sub-Questions
Selvaraju, Ramprasaath R., et al. “SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions.”, CVPR 2020
New Dataset
SQuINTing Model
Training Multimodal Networks
Commonsense and Coherence
Emotions are Often Context Dependent
“COSMIC: COmmonSense knowledge for eMotion Identification in Conversations”, Findings of EMNLP 2020
Commonsense and Emotion Recognition
Proposed approach (COSMIC):
For each utterance, try to infer
- speaker’s intention
- effect on the speaker/listener
- reaction of the speaker/listener
Proposed Model (COSMIC)
Coherence and Commonsense
Coherence relations provide information about how the content of discourse units relate to one another.
They have been used to predict commonsense inference in text.
Cross-modal Coherence Modeling for Caption Generation
Cross-modal Coherence Modeling for Caption Generation ACL 2020
Social Impact – Fairness and Misinformation
Fair Representation Learning
Pena et al., Bias in Multimodal AI: A Testbed for Fair Automatic Recruitment. ICMI 2020
Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News
Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News, EMNLP 2020
Emotional and Engaging Interactions
Dialogue Act Classification (DAC)
“Towards Emotion-aided Multi-modal Dialogue Act Classification”, ACL 2020
Image-Chat: Engaging Grounded Conversations
Shuster, Kurt, et al. “Image-chat: Engaging grounded conversations.“ ACL 2020
Multi-Lingual Multimodal Grounding
Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
Room-Across-Toom: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge – EMNLP 2020
Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data
Cogswell, Michael, et al. “Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data.”
Connecting Language to Actions
What does interaction mean?
Sequential and Online Modeling
Planning
V+L -> A
First Major Question: Alignment
Ma et al, “Self-Monitoring Navigation Agent via Auxiliary Progress Estimation” ICLR 2019
Alignment
Lots of Data
Ku et al. Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding — EMNLP 2020
What if you make a mistake?
Ke 2019, Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation - CVPR 2019
Why does this question matter?
Because in general, we can’t supervise everything
Seven High-level Tasks
Data collection
End-to-End Models
A Shared Semantic Space
Paxton et al. Prospection: Interpretable Plans From Language By Predicting the Future ICRA 2019
Predicting the Future
Objectives
Embodiment
- Choose your own adventure — Lots of noise
- What does it mean to succeed?
- Where do concepts come from?
- What’s the role of exploration?
- Language is woefully underspecified
Multimodal Human-inspired Language Learning
Approach
Workflow
Proposed Model: Vision-language Pre-training
Recognition Model using pre-trained VLP
Grounded Multimodal Learning
Children learn in a multimodal environment. We investigate human-like learning in the following perspectives:
● Association of new information to previous (past) knowledge
● Generalization of learned knowledge to unseen (future) concepts
○ Zero-shot compositionality of the learned concepts
■ Blue + Dog -> Blue dog !?
Model
New work in progress: Video-Text Coref
Two Approaches to Latent Structure Learning
Latent Tree Formalism: Lexicalized PCFG
Probabilistic Model of Lexicalized PCFG
Latent Tree Learning: Baselines
Baselines:
○ DMV (Klein and Manning, 2004): generative model of dependency structures.
○ Compound PCFG (Kim et al., 2019): neural model to parameterize probabilistic context-free grammar using sentence-by-sentence parameters and variational training.
○ Compound PCFG w/ right-headed rule: takes predictions of Compound PCFG and choose the head of right child as the head of the parent.
○ ON-LSTM (Shen et al., 2019) and PRPN (Shen et al., 2018): two unsupervised constituency parsing models
○ VGNSL (Shi et al., 2019): unsupervised constituency parsing model with image information
Latent Template Learning: Concept
Latent Template Learning: Generative Model
Learning of the Latent Template Model
Generation Conditioned on Templates
Automatic Speech Recognition
EESEN: https://github.com/srvk/eesen
- Pre-trained models from more established Speech Recognition corpus, (Libri Speech and SwitchBoard in our case)
- 3 models: ESP-Net1 , EESEEN-WFST2 , and EESEN-rnnLM decoding, trained with CTC loss
- Major Challenges:
- fully annotated transcription not available for evaluation
- much more noisy than the pre-trained datasets
- multiple speakers present
ESP-Net architecture Watanabe et al. 2018
ASR: Seedling Dataset Samples(ESPnet vs EESEN)
ESPnet: Hey, do you want to play anything or read a book or anything a book? Okay, which book which book you want to read? The watch one little baby who is born far away. And another who is born on the very next day. And both of these babies as everyone knows.
Turn the Page. Had Ten Little Fingers ten fingers and ten little toes. There was only there was one little baby who is born in a town and another who is wrapped in either down. And both of these babies as everyone knows add ten little fingers and ten.
Have you any water recently? Get some water, please. Get some water please some water. Yeah water is delicious. Why don’t you have some? Give me some water, please.
There was one little baby who is born in the house and another who Snuffer suffered from sneezes and chills. And both of these babies with everyone knows. at ten little fingers and ten little toes just like