Introduction
What is Multimodal?

Multimodal Communicative Behaviors

Modality
事物发生或经历的方式
- Modality:表示某种类型的信息和/或存储信息的表示格式
- Sensory modality:感觉的主要形式之一,如视觉或触觉;沟通渠道
Multiple Communities and Modalities

A Historical View
Prior Research on “Multimodal”

Core Technical Challenges
Core Challenge 1: Representation

Early Examples


Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014
Representation
- Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy.(学习如何利用互补性和冗余性在数据库中表示和汇总多模态数据)

- Joint representations
- Coordinated representations
Core Challenge 2: Alignment
- Definition: Identify the direct relations between (sub)elements from two or more different modalities.(确定两个或多个不同模式的(子)要素之间的直接关系。)

Explicit Alignment

Implicit Alignment
Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Core Challenge 3 – Translation
Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013

- Definition: Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective.(将数据从一种模态转换到另一种模态的过程,其中的翻译关系通常是开放的或主观的)

Ahuja, C., & Morency, L. P. (2019). Language2Pose: Natural Language Grounded Pose Forecasting. Proceedings of 3DV Conference

Core Challenge 4: Fusion
- Definition: To join information from two or more modalities to perform a prediction task.(将来自两个或多个模式的信息连接起来以执行预测任务)


Core Challenge 5: Co-Learning
- Definition: Transfer knowledge between modalities, including their representations and predictive model。(在模式之间转移知识,包括其表示和预测模型。)


Taxonomy of Multimodal Research
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy

Real world tasks tackled by MMML

Multimodal Research Tasks
Multimodal Research Tasks


Affective Computing(情感计算)
Common Topics in Affective Computing
Affectivestates–emotions, moods, and feelings
Cognitivestates–thinking and information processing
Personality–patterns of acting, feeling, and thinking
Pathology–health, functioning, and disorders
Social processes –groups, cultures, and perception
情感状态–情绪、情绪和感觉
认知状态–思维和信息处理
个性–行为、感觉和思维模式
病理学–健康、功能和紊乱
社会过程–群体、文化和感知





AVEC 2011 – The First International Audio/Visual Emotion Challenge, B. Schuller et al., 2011
AVEC 2013 – The Continuous Audio/Visual Emotion and Depression Recognition Challenge, Valstar et al. 2013
Introducing the RECOLA Multimodal Corpus of Remote Collaborative and Affective Interactions, F. Ringeval et al., 2013
Multimodal Sentiment Analysis(多模态情绪分析)

Multi-Party Emotion Recognition(多方情绪识别)

What are the Core Challenges Most Involved in Affect Recognition?

Project Example: Select-Additive Learning


Project Example: Word-Level Gated Fusion
Minghai Chen, Sen Wang, Paul Pu Liang, Ta d a sBaltrušaitis, Amir Zadeh, Louis-Philippe Morency, Multimodal Sentiment
Analysis with Word-Level Fusion and Reinforcement Learning, ICMI 2017, https://arxiv.org/abs/1802.00924


Media Description(媒体描述)
给定媒体(图像、视频、视听剪辑)提供自由形式的文本描述

Large-Scale Image Captioning Dataset
- Microsoft Common Objects in COntext(MS COCO)
- 120000 images
- Each image is accompanied with five free form sentences describing it (at least 8 words)
- Sentences collected using crowdsourcing (Mechanical Turk)
- Also contains object detections, boundaries and keypoints
Evaluating Image Caption Generations(评估图像字幕生成)
- Has an evaluation server
- Training and validation -80K images (400K captions)
- Testing –40K images (380K captions), a subset contains more captions for better evaluation, these are kept privately (to avoid over-fitting and cheating)
- Evaluation is difficult as there is no one “correct” answer for describing an image in a sentence
- Given a candidate sentence it is evaluated against a set of “ground truth” sentences
Video captioning(视频字幕)

Video Description and Alignment(视频描述和对齐)
Charade Dataset: http://allenai.org/plato/charades/

How to Address the Challenge of Evaluation?

Large-Scale Description and Grounding Dataset

Multimodal QA(多模态问答)

Multimodal QA dataset 1 –VQA (C1)

VQA 2.0

Multimodal QA –other VQA datasets

Multimodal QA –other VQA datasets (C7)
TVQA

Multimodal QA –Visual Reasoning (C8)
VCR:Visual Commonsense Reasoning

Social-IQ (A10)
Scocail-IQ

Project Example: Adversarial Attacks on VQA models


Multimodal Navigation(多模式导航)
- Embedded Assistive Agents
- Language, Vision and Actions
- Many Technical Challenges
Navigating in a Virtual House(在虚拟房屋中导航)

Multiple Step Instructions

Language meets Games

Project Example: Instruction Following


Project Example: Multiagent Trajectory Forecasting


Project Examples, Advice and Support
Latest List of Multimodal Datasets



Some Advice About Multimodal Research
- Think more about the research problems, and less about the datasets themselves
- Aim for generalizable models across several datasets
- Aim formodelsinspiredbyexistingresearche.g.psychology
- Some areas to consider beyond performance:
- Robustness tomissing/noisy modalities, adversarial attacks
- Studying social biases and creating fairer models
- Interpretable models
- Faster models for training/storage/inference
- Theoretical projects are welcome too –make sure there are also experiments to validate theory
Some Advice About Multimodal Datasets
- If you are used to deal with text or speech
- Space will become an issue working with image/video data
- Some datasets are in 100s of GB (compressed)
- Memory for processingit will become an issue as well
- Won’t be able to store it all in memory
- Time to extract features and train algorithms will also become an issue
- Plan accordingly!
- Sometimes tricky to experiment on a laptop (might need to do it on a subset of data)
Available Tools

List of Multimodal datasets
Affective Computing(情感计算)
Acted Facial Expressions in the Wild (part of EmotiWChallenge))
AVEC challenge datasets
The Interactive Emotional Dyadic Motion Capture (IEMOCAP)
Persuasive Opinion Multimedia (POM)
Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos (MOSI)
CMU-MOSEI:Multimodal sentiment and emotion recognition
Tumblr Dataset: Sentiment and Emotion Analysis
AMHUSE Dataset: Multimodal Humor Sensing
Video Game Dataset: Multimodal Game Rating
DEAP
Continuous LIRIS-ACCEDE
Media description
MPII Movie Description dataset
Montréal Video Annotation dataset
Flickr30k Entities
Multimodal QA
VQA v2.0
Multimodal Navigation
Multimodal Dialog
Cooperative Vision-and-Dialog Navigation
Event detection
Title-based Video Summarization dataset
CrisisMMD
Multimodal Retrieval
Yahoo Flickr Creative Commons 100M
Other Multimodal Datasets
Basic Concepts – Neural Networks
Unimodal Basic Representations
Unimodal Representation – Visual Modality




Unimodal Representation – Language Modality



Unimodal Representation – Acoustic Modality


Other Unimodal Representations
Unimodal Representation – Sensors

Lee et al., Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks. ICRA 2019

Unimodal Representation – Tables
Bao et al., Table-to-Text: Describing Table Region with Natural Language. AAAI 2018

Unimodal Representation – Graphs
Hamilton and Tang, Tutorial on Graph Representation Learning. AAAI 2019

Unimodal Representation – Sets
*Zaheer et al., DeepSets. NeurIPS 2017, Li et al., Point Cloud GAN. arxiv 2018 *

Machine Learning – Basic Concepts
Training, Testing and Dataset

Nearest Neighbor Classifier

Simple Classifier: Nearest Neighbor

Definition of K-Nearest Neighbor

Data-Driven Approach

Evaluation methods (for validation and testing)

Linear Classification: Scores and Loss
Learning model parameters
Neural Networks gradient
Gradient descent
Optimization – Practical Guidelines
CNNs and Visual Representations
Image Representations
Object-Based Visual Representation

Object Descriptors

Convolution Kernels

Object Descriptors

Facial expression analysis

Articulated Body Tracking: OpenPose

Convolutional Neural Networks
Convolutional Neural Networks

Translation Invariance

Learned vs Predefined Kernels

Convolution Math
Convolutional Neural Layer
Convolutional Neural Network
Example of CNN Architectures
Common architectures

VGGNet model

Other architectures

Residual Networks

Visualizing CNNs
Visualizing the Last CNN Layer: t-sne

Deconvolution


CAM: Class Activation Mapping

Grad-CAM

Region-based CNNs
Object Detection (and Segmentation)

Selective Search

R-CNN

Trade-off Between Speed and Accuracy

Sequential Modeling with Convolutional Networks
3D CNN

Temporal Convolution Network (TCN)

Appendix: Tools for Automatic visual behavior analysis
OpenFace: an open source facial behavior analysis toolkit, T. Baltrušaitis et al., 2016
Image from Hachisu et al (2018). FaceLooks: A Smart Headband for Signaling Face-to-Face Behavior. Sensors.
Language Representations and RNNs
Word Representations
How to learn (word) features/representations?

Distance and similarity

How to learn (word) features/representations?

How to use these word representations

Vector space models of words

Sentence Modeling
Sentence Modeling: Sequence Label Prediction

Sentence Modeling: Sequence Prediction

Sentence Modeling: Sequence Representation

Sentence Modeling: Language Model

Language Model Application: Language Generation

Language Model Application: Speech Recognition

Challenges in Sequence Modeling

Recurrent Neural Networks
Gated Recurrent Neural Networks
Syntax and Language Structure
Syntax and Language Structure


Dependency Grammar

Language Ambiguity

Recursive Neural Network
Socher et al., Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, EMNLP 2013
Stack LSTM
Dyer et al., Transition-Based Dependency Parsing with Stack Long Short-Term Memory, 2015

Multimodal Representations
Graph Representations
RECAP: Tree-based RNNs (or Recursive Neural Network)

Graphs (aka “Networks”)

Graphs – Supervised Task

Graphs – Unsupervised Task

Graph Neural Nets


Graph Neural Nets – Supervised Training

Graph Neural Nets – Neighborhood Aggregation

Kipf et al., 2017. Semi-supervised Classification with Graph Convolutional Networks. ICLR.
Li et al., 2016. Gated Graph Sequence Neural Networks. ICLR.
*Duvenaud et al. 2016. Convolutional Networks on Graphs for Learning Molecular Fingerprints. ICML. Li et al. 2016. *
Gated Graph Sequence Neural Networks. ICLR.
Multimodal representations
Multimodal representations

Unsupervised Joint representations
Unsupervised representation learning

Shallow multimodal representations

Autoencoders


Deep Multimodal autoencoders
Ngiam et al., Multimodal Deep Learning, 2011

Deep Multimodal Boltzmann machines
Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014

Supervised Joint representations
Multimodal Joint Representation

Multimodal Sentiment Analysis

Unimodal, Bimodal and Trimodal Interactions

Bilinear Pooling
Tenenbaum and Freeman, 2000

Multimodal Tensor Fusion Network (TFN)
Zadeh, Jones and Morency, EMNLP 2017


From Tensor Representation to Low-rank Fusion

① Decomposition of weight tensor W

② Decomposition of Z

③ Rearranging computation

Multimodal LSTM
Multimodal Sequence Modeling – Early Fusion

Multi-View Long Short-Term Memory (MV-LSTM)

Multi-View Long Short-Term Memory
Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016

Topologies for Multi-View LSTM

Coordinated Multimodal Representations
Coordinated Multimodal Representations
Learn (unsupervised) two or more coordinated representations from multiple modalities. A loss function is defined to bring closer these multiple representations.

Coordinated Multimodal Embeddings
Frome et al., DeViSE: A Deep Visual-Semantic Embedding Model, NIPS 2013


Structure-preserving Loss – Multimodal Embeddings
Wang et al., Learning Deep Structure-Preserving Image-Text Embeddings, CVPR 2016

Coordinated Representations
Quick Recap




Coordinated Multimodal Representations
Learn (unsupervised) two or more coordinated representations from multiple modalities. A loss function is defined to bring closer these multiple representations.
Structured coordinated embeddings
Vendrov et al., Order-Embeddings of Images and Language, 2016
Vendrov et al., Order-Embeddings of Images and Language, 2016

Multivariate Statistical Analysis

Random Variables

Definitions





Principal component analysis

Eigenvalues and Eigenvectors

Singular Value Decomposition (SVD)

Canonical Correlation Analysis
Canonical Correlation Analysis

Correlated Projection








Exploring Deep Correlation Networks
Deep Canonical Correlation Analysis
Andrew et al., ICML 2013


Deep Canonically Correlated Autoencoders (DCCAE)
Wang et al., ICML 2015

Deep Correlational Neural Network
Chandar et al., Neural Computation, 2015

Multi-View Clustering
Data Clustering

“Soft” Clustering: Nonnegative Matrix Factorization

Semi-NMF and Other Extensions
Ding et al., TPAMI2015

Trigerous et al., TPAMI 2015
Principles of Multi-View Clustering
Yan Yang and Hao Wang, Multi-view Clustering: A Survey, Big data mining and analytics, Volume 1, Number 2, June 2018

Multi-view subspace clustering
Definition: learns a unified feature representation from all the view subspaces by assuming that all views share this representation

Deep Matrix Factorization
Li and Tang, MMML 2015

Other Multi-View Clustering Approaches
Yan Yang and Hao Wang, Multi-view Clustering: A Survey, Big data mining and analytics, Volume 1, Number 2, June 2018


Auto-Encoder in Auto-Encoder Network
Deep Canonically Correlated Autoencoders (DCCAE)


Multi-view Latent “Intact” Space

Multimodal alignment
Multimodal alignment
Explicit multimodal-alignment
Explicit alignment - goal is to find correspondences between modalities
▪ Aligning speech signal to a transcript
▪ Aligning two out-of sync sequences
▪ Co-referring expressions
Implicit multimodal-alignment
Implicit alignment - uses internal latent alignment of modalities in order to better solve various problems
▪ Machine Translation
▪ Cross-modal retrieval
▪ Image & Video Captioning
▪ Visual Question Answering
Explicit alignment
Let’s start unimodal – Dynamic Time Warping

Dynamic Time Warping continued


DTW alternative formulation


Canonical Correlation Analysis reminder


Canonical Time Warping
Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009

Optimized by Coordinate-descent –fix one set of parameters, optimize another
Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009, NIPS

Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI

Deep Canonical Time Warping
Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR


Implicit alignment
Attention models
Recent attention models can be roughly split into three major categories
Soft attention
Acts like a gate function. Deterministic inference.
Transform network
Warp the input to better align with canonical view。
Hard attention
Includes stochastic processes. Related to reinforcement learning.
Soft attention
Machine Translation
Given a sentence in one language translate it to another
Not exactly multimodal task – but a good start! Each language can be seen almost as a modality.
Machine Translation with RNNs

Decoder – attention model
Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015


How do we encode attention?

MT with attention

Visual captioning with soft attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Xu et al., 2015

Looking at more fine grained feature

- 允许潜在数据对齐
- 让我们了解网络“看到”的内容
- 可以使用反向传播进行优化
Spatial Transformer networks



Glimpse Network (Hard Attention)
Hard attention
Soft attention requires computing a representation for the whole image or sentence
Hard attention on the other hand forces looking only at one part
Main motivation was reduced computational cost rather than improved accuracy (although that happens a bit as well) ▪
Saccade followed by a glimpse – how human visual system works
Recurrent Models of Visual Attention, Mnih, 2014] [Multiple Object Recognition with Visual Attention, Ba, 2015
Hard attention examples

Glimpse Sensor
Recurrent Models of Visual Attention, Mnih, 2014


Overall Architecture - Emission network

Recurrent model of Visual Attention (RAM)

Multi-modal alignment recap
Explicit alignment -aligns two or more modalities (or views) as an actual task. The goal is to find correspondences between modalities
- Dynamic Time Warping
- Canonical Time Warping
- Deep Canonical Time Warping
Implicit alignment -uses internal latent alignment of modalities in order to better solve various problems
- Attention models
- Soft attention
- Spatial transformer networks
- Hard attention
Alignment and Representations
Contextualized Sequence Encoding
Sequence Encoding - Contextualization



Self-Attention

Transformer Multi-Head Self-Attention

Position embeddings


Sequence-to-Sequence Using Transformer
Seq2Seq with Transformer Attentions


Contextualized Multimodal Embedding
Multimodal Embeddings

Multimodal Transformer
Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2019

Cross-Modal Transformer

Language Pre-training
Token-level and Sentence-level Embeddings

Pre-Training and Fine-Tuning

BERT: Bidirectional Encoder Representations from Transformers


Pre-training BERT Model


Three Embeddings: Token + Position + Sentence

Fine-Tuning BERT





Multimodal Pre-training
VL-BERT

M-BERT

Alignment and Translation
Alignment for Speech Recognition
Architecture of Speech Recognition
slazebni.cs.illinois.edu/spring17/lec26_audio.pdf

Option 1: Sequence-to-Sequence (Seq2Seq)

Option 2: Seq2Seq with Attention

Option 3: Sequence Labeling with RNN

Speech Alignment
Connectionist Temporal Classification (CTC)
Amodei, Dario, et al. “Deep speech 2: End-to-end speech recognition in english and mandarin.” (2015)




CTC Optimization

Visualizing CTC Predictions

Multi-View Video Alignment
Temporal Alignment using Neural Representations

Temporal Cycle-Consistency Learning



Multimodal Translation Visual Question Answering (VQA)
VQA and Attention

Co-attention
Lu et al., Hierarchical Question-Image Co-Attention for Visual Question Answering, NIPS 2016


Hierarchical Co-attention

Stacked Attentions
Yang et al., Stacked Attention Networks for Image Question Answering, CVPR 2016
VQA: Neural Module Networks
Neural Module Network
Andreas et al., Deep Compositional Question Answering with Neural Module Networks, 2016

Predefined Set of Modules

Johnson et al., CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, CVPR 2017
End-to- End Neural Module Network
Hu et al., Learning to Reason: End-to-End Module Networks for Visual Question Answering, 2017


VQA: Neural Symbolic Networks
Neural-symbolic VQA
Kexin Yi, et al. “Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding.” Neurips 2018

The Neuro-symbolic Concept Learner
Jiayuan Mao , et al. “The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision.” ICLR 2019


Speech-Vision Translation: Applications
Translation 1: Visually indicated sounds
Owens et al. Visually indicated sounds, CVPR, 2016

Translation 2: The Sound of Pixels
Zhao, Hang, et al. “The sound of pixels.”, ECCV 2018


Speech2face

Generative Models
Probabilistic Graphical Models
Definition: A probabilistic graphical model (PGM) is a graph formalism for compactly modeling joint probability distributions and dependence structures over a set of random variables
Inference for Known Joint Probability Distribution


Creating a Graphical Model
Example: Inferring Emotion from Interaction Logs

Example: Bayesian Network Representation

Example: Bayesian Network Approach

Example: Dynamic Bayesian Network Approach


Bayesian Networks
Definition: A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions

Bayesian Network (BN)

Joint Probability in Graphical Models


Conditional Probability Distribution (CPD)

Generative Model: Naïve Bayes Classifier

Dynamic Bayesian Network
- Bayesian network allows to represent sequential dependencies.
- Dynamically changing or evolving over time.
- Directed graphical model of stochastic processes.
- Especially aiming at time series modeling.

Hidden Markov Models

Factorial HMM

The Boltzmann Zipper

The Coupled HMM

Generating Data Using Neural Networks
Variational Autoencoder

Generative Adversarial Network (GAN)

GAN Training


Example: Audio to Scene
Audio to Scene Samples - wjohn1483.github.io

Example: Talking Head
https://arxiv.org/pdf/1905.08233.pdf

Bidirectional GAN

cAE-GAN


Cycle GAN

BiCycle GAN

Discriminative Graphical Models
Quick Recap
Fusion – Probabilistic Graphical Models

Restricted Boltzmann Machines
Deep Multimodal Boltzmann machines
Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014


Restricted Boltzmann Machine (RBM)
Smolensky, Information Processing in Dynamical Systems: Foundations of Harmony Theory, 1986
Undirected Graphical Model
- A generative rather than discriminative model
- Connections from every hidden unit to every visible one
- No connections across units (hence “Restricted”), makes it easier to train and run inference



Markov Random Fields






Example: Markov Random Field – Graphical Model

Example: Markov Random Field – Factor Graph


Conditional Random Fields
Conditional Random Fields (Factor Graphs)


Conditional Random Fields (Log-linear Model)

Learning Parameters of a CRF Model

CRFs for Shallow Parsing

Latent-Dynamic CRF



Hidden Conditional Random Field

Multi-view Latent Variable Discriminative Models

CRFs and Deep Learning
Conditional Neural Fields

Deep Conditional Neural Fields

CRF and Bilinear LSTM

CNN and CRF and Bilinear LSTM

Continuous and Fully-Connected CRFs
Continuous Conditional Neural Field


High-Order Continuous Conditional Neural Field

Fully-Connected Continuous Conditional Neural Field

Fully-Connected CRF

CNN and Fully-Connected CRF

Fully Connected Deep Structured Networks
Zheng et al., 2015; Schwing and Urtasun, 2015

Sigurdsson et al., Asynchronous Temporal Fields for Action Recognition, CVPR 2017
Soft-Label Chain CRF
Phrase Grounding by Soft-Label Chain CRF
Liu J, Hockenmaier J. “Phrase Grounding by Soft-Label Chain Conditional Random Field” EMNLP 2019



Fusion, co-learning and new trends
Quick Recap: Multimodal Fusion
Multimodal fusion


Fusion – Probabilistic Graphical Models

Model-free Fusion
Model-agnostic approaches – early fusion
- Easy to implement – just concatenate the features
- Exploit dependencies between features
- Can end up very high dimensional
- More difficult to use if features have different granularities

Model-agnostic approaches – late fusion

Late Fusion on Multi-Layer Unimodal Classifiers

Multimodal Fusion Architecture Search (MFAS)
“Perez-Rua, Vielzeuf, Pateux, Baccouche, Frederic Jurie,MFAS: Multimodal Fusion Architecture Search,CVPR 2019
Proposed solution: Explore the search space with Sequential Model-Based Optimization
- Start with simpler models first (all L=1 models) and iteratively increase the complexity (L=2, L=3,…)
- Use a surrogate function to predict performance of unseen architectures
- e.g., the performance of all the L=1 models should give us an idea of how well the L=2 models will perform


Memory-Based Fusion
Zadeh et al., Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018

Local Fusion and Kernel Functions
What is a Kernel function?
A kernel function: Acts as a similarity metric between data points

Non-linearly separable data

Radial Basis Function Kernel (RBF)

Some other kernels
- Histogram Intersection Kernel :good for histogram features
- String kernels :specifically for text and sentence features
- Proximity distribution kernel
- (Spatial) pyramid matching kernel
Kernel CCA

Transformer’s Attention Function
Tsai et al., Transformer Dissection: An Unified Understanding for Transformer’s Attention via the Lens of Kernel, EMNLP 2019



Multiple Kernel Learning

MKL in Unimodal Case
- Pick a family of kernels and learn which kernels are important for the classification case
- For example a set of RBF and polynomial kernels
MKL in Multimodal/Multiview Case
- Pick a family of kernels for each modality and learn which kernels are important for the classification case
- Does not need to be different modalities, often we use different views of the same modality (HOG, SIFT, etc.)

Co-Learning
Co-Learning - The 5th Multimodal Challenge
Definition: Transfer knowledge between modalities, including their representations and predictive models.

Co-learning Example with Paired Data
ViCo: Word Embeddings from Visual Co-occurrences

ViCo: Word Embeddings from Visual Co-occurrences

Co-Learning with Paired Data: Multimodal Cyclic Translation
Paul Pu Liang, Hai Pham, et al., “Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities”, AAAI 2019
Co-Learning Example with Weakly Paired Data
End-to-End Learning of Visual Representations from Uncurated Instructional Videos Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman – CVPR 2020

Weakly Paired Data

Multiple Instance Learning Noise Contrastive Estimation

Research Trend: Few-Shot Learning and Weakly Supervised
Few-Shot Learning in RL Environment
Hill et al., Grounded Language Learning Fast and Slow. arXiv 2020

Grounded Language Learning

Weakly-Supervised Phrase Grounding
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding, EMNLP 2020

Multimodal Alignment Framework


Research Trends in Multimodal ML
- Abstraction and logic
- Multimodal reasoning
- Towards causal inference
- Understanding multimodal models
- Commonsense and coherence
- Social impact - fairness and misinformation
- Emotional and engaging interactions
- Multi-lingual multimodal grounding
Abstraction and Logic
Learning by Abstraction: The Neural State Machine
Hudson, Drew, and Christopher D. Manning. “Learning by abstraction: The neural state machine.“ NeurIPS 2019





Learning by Abstraction: The Neural State Machine

VQA under the Lens of Logic
Gokhale, Tejas, et al. “VQA-LOL: Visual question answering under the lens of logic.“, ECCV 2020


Multimodal Reasoning
Cross-Modality Relevance for Reasoning on Language and Vision
Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020


Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog
Gan, Zhe, et al. “Multi-step reasoning via recurrent dual attention for visual dialog.“ ACL 2019
- Hypothesis: The failure of visual dialog is caused by the inherent weakness of single-step reasoning.
- Intuition: Humans take a first glimpse of an image and a dialog history, before revisiting specific parts of the image/text to understand the multimodal context.
- Proposal: Apply Multi-step reasoning to visual dialog by using a recurrent (aka multi-step) version of attention (aka reasoning). This is done on both text and questions (aka, dual).


Towards Causal Inference
Visual Dialogue Expressed with Causal Graph

Two Causal Principles for Improving Visual Dialog
Qi, Jiaxin, et al. “Two causal principles for improving visual dialog.“ CVPR 2020
This paper identifies two causal principles that are holding back VisDial models.
- Harmful shortcut bias between dialog history (H) and the answer (A)
- Unobserved confounder between H, Q and A leading to spurious correlations.
By identifying and addressing these principles in a model-agnostic manner, they are able to promote any VisDial model to SOTA levels.




Studying Biases in VQA Models
*Agarwal, Vedika, Rakshith Shetty, and Mario Fritz. “Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing.” *


Understanding Multimodal Models
Introspecting VQA Models with Sub-Questions
Selvaraju, Ramprasaath R., et al. “SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions.”, CVPR 2020

New Dataset

SQuINTing Model

Training Multimodal Networks


Commonsense and Coherence
Emotions are Often Context Dependent
“COSMIC: COmmonSense knowledge for eMotion Identification in Conversations”, Findings of EMNLP 2020

Commonsense and Emotion Recognition
Proposed approach (COSMIC):
For each utterance, try to infer
- speaker’s intention
- effect on the speaker/listener
- reaction of the speaker/listener

Proposed Model (COSMIC)



Coherence and Commonsense
Coherence relations provide information about how the content of discourse units relate to one another.
They have been used to predict commonsense inference in text.
Cross-modal Coherence Modeling for Caption Generation
Cross-modal Coherence Modeling for Caption Generation ACL 2020

Social Impact – Fairness and Misinformation
Fair Representation Learning
Pena et al., Bias in Multimodal AI: A Testbed for Fair Automatic Recruitment. ICMI 2020

Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News
Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News, EMNLP 2020

Emotional and Engaging Interactions
Dialogue Act Classification (DAC)
“Towards Emotion-aided Multi-modal Dialogue Act Classification”, ACL 2020

Image-Chat: Engaging Grounded Conversations
Shuster, Kurt, et al. “Image-chat: Engaging grounded conversations.“ ACL 2020

Multi-Lingual Multimodal Grounding
Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
Room-Across-Toom: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge – EMNLP 2020

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data
Cogswell, Michael, et al. “Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data.”

Connecting Language to Actions
What does interaction mean?

Sequential and Online Modeling

Planning

V+L -> A

First Major Question: Alignment
Ma et al, “Self-Monitoring Navigation Agent via Auxiliary Progress Estimation” ICLR 2019

Alignment

Lots of Data
Ku et al. Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding — EMNLP 2020

What if you make a mistake?
Ke 2019, Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation - CVPR 2019
Why does this question matter?
Because in general, we can’t supervise everything
Seven High-level Tasks

Data collection

End-to-End Models

A Shared Semantic Space
Paxton et al. Prospection: Interpretable Plans From Language By Predicting the Future ICRA 2019

Predicting the Future

Objectives

Embodiment
- Choose your own adventure — Lots of noise
- What does it mean to succeed?
- Where do concepts come from?
- What’s the role of exploration?
- Language is woefully underspecified
Multimodal Human-inspired Language Learning
Approach

Workflow

Proposed Model: Vision-language Pre-training

Recognition Model using pre-trained VLP

Grounded Multimodal Learning
Children learn in a multimodal environment. We investigate human-like learning in the following perspectives:
● Association of new information to previous (past) knowledge
● Generalization of learned knowledge to unseen (future) concepts
○ Zero-shot compositionality of the learned concepts
■ Blue + Dog -> Blue dog !?
Model


New work in progress: Video-Text Coref




Two Approaches to Latent Structure Learning

Latent Tree Formalism: Lexicalized PCFG

Probabilistic Model of Lexicalized PCFG

Latent Tree Learning: Baselines
Baselines:
○ DMV (Klein and Manning, 2004): generative model of dependency structures.
○ Compound PCFG (Kim et al., 2019): neural model to parameterize probabilistic context-free grammar using sentence-by-sentence parameters and variational training.
○ Compound PCFG w/ right-headed rule: takes predictions of Compound PCFG and choose the head of right child as the head of the parent.
○ ON-LSTM (Shen et al., 2019) and PRPN (Shen et al., 2018): two unsupervised constituency parsing models
○ VGNSL (Shi et al., 2019): unsupervised constituency parsing model with image information
Latent Template Learning: Concept

Latent Template Learning: Generative Model

Learning of the Latent Template Model



Generation Conditioned on Templates

Automatic Speech Recognition
EESEN: https://github.com/srvk/eesen
- Pre-trained models from more established Speech Recognition corpus, (Libri Speech and SwitchBoard in our case)
- 3 models: ESP-Net1 , EESEEN-WFST2 , and EESEN-rnnLM decoding, trained with CTC loss
- Major Challenges:
- fully annotated transcription not available for evaluation
- much more noisy than the pre-trained datasets
- multiple speakers present
ESP-Net architecture Watanabe et al. 2018

ASR: Seedling Dataset Samples(ESPnet vs EESEN)
ESPnet: Hey, do you want to play anything or read a book or anything a book? Okay, which book which book you want to read? The watch one little baby who is born far away. And another who is born on the very next day. And both of these babies as everyone knows.
Turn the Page. Had Ten Little Fingers ten fingers and ten little toes. There was only there was one little baby who is born in a town and another who is wrapped in either down. And both of these babies as everyone knows add ten little fingers and ten.
Have you any water recently? Get some water, please. Get some water please some water. Yeah water is delicious. Why don’t you have some? Give me some water, please.
There was one little baby who is born in the house and another who Snuffer suffered from sneezes and chills. And both of these babies with everyone knows. at ten little fingers and ten little toes just like
Object Detection: Methodology

Multimodal association in SeedlingS Corpus

Coherence and Grounding in Multimodal Communication
Commonsense and Coherence
Architecture

Grounding





