Introduction
Machine Learning with Graphs
Why is Graph Deep Learning Hard?
- Networks are complex
- Arbitrary size and complex topological structure (i.e., no spatial locality like grids)
- No fixed node ordering or reference point
- Often dynamic and have multimodal features
Deep Learning in Graphs
Representation Learning
(Supervised) Machine Learning Lifecycle: This feature, that feature. Every single time!
Map nodes to d-dimensional embeddings such that similar nodes in the network are embedded close together
Course Outline
We are going to cover various topics in Machine Learning and Representation Learning for graph structured data:
- Traditional methods: Graphlets, Graph Kernels
- Methods for node embeddings: DeepWalk, Node2Vec
- Graph Neural Networks: GCN, GraphSAGE, GAT, Theory of GNNs
- Knowledge graphs and reasoning: TransE, BetaE
- Deep generative models for graphs: GraphRNN
- Applications to Biomedicine, Science, Industry
Applications of Graph ML
Different Types of Tasks
Classic Graph ML Tasks
- Node classification: Predict a property of a node
- Example: Categorize online users / items
- Link prediction: Predict whether there are missing links between two nodes
- Example: Knowledge graph completion
- Graph classification: Categorize different graphs
- Example: Molecule property prediction
- Clustering: Detect if nodes form a community
- Example: Social circle detection
- Other tasks:
- Graph generation: Drug discovery
- Graph evolution: Physical simulation
Node-level ML Tasks
Protein Folding
A protein chain acquires its native 3D structure
The Protein Folding Problem
Computationally predict a protein’s 3D structure based solely on its amino acid sequence
Alphafold: Impact
- Key idea: “Spatial graph”
- Nodes: Amino acids in a protein sequence
- Edges: Proximity between amino acids (residues)
Edge-level ML Tasks
Recommender Systems
- Users interacts with items
- Watch movies, buy merchandise, listen to music
- Nodes: Users and items
- Edges: User-item interactions
- Goal: Recommend items users might like
Pinsage: Graph-based Recommender
Ying et al., Graph Convolutional Neural Networks for Web-Scale Recommender Systems, KDD 2018
Drug Side Effects
- Many patients take multiple drugs to treat complex or co-existing diseases:
- 46% of people ages 70-79 take more than 5 drugs
- Many patients take more than 20 drugs to treat heart disease, depression, insomnia, etc.
- Task: Given a pair of drugs predict adverse side effects
Biomedical Graph Link Prediction
Subgraph-level ML Tasks
Traffic Prediction
Road Network as a Graph
- Nodes: Road segments
- Edges: Connectivity between road segments
- Prediction: Time of Arrival (ETA)
Graph-level ML Tasks
Drug Discovery
- Antibiotics are small molecular graphs
- Nodes: Atoms
- Edges: Chemical bonds
Deep Learning for Antibiotic Discovery
Stokeset al., A Deep Learning Approach to Antibiotic Discovery, Cell 2020
- A Graph Neural Network graph classification model
- Predict promising molecules from a pool of candidates
Molecule Generation/Optimization
Physics Simulation
Sanchez-Gonzalez et al., Learning to simulate complex physics with graph networks, ICML 2020
- Physical simulation as a graph:
- Nodes: Particles
- Edges: Interaction between particles
Simulation Learning Framework
A graph evolution task:
- Goal: Predict how a graph will evolve over
Choice of Graph Representation
Components of a Network
Choosing a Proper Representation
- If you connect individuals that work with each other, you will explore a professional network
- If you connect those that have a sexual relationship, you will be exploring sexual networks
- If you connect scientific papers that cite each other, you will be studying the citation network
- If you connect all papers with the same word in the title, what will you be exploring? It is a network, nevertheless
How do you define a graph
- How to build a graph:
- What are nodes?
- What are edges?
- Choice of the proper network representation of a given domain/problem determines our ability to use networks successfully:
- In some cases, there is a unique, unambiguous representation
- In other cases, the representation is by no means unique
- The way you assign links will determine the nature of the question you can study
Directed vs Undirected Graphs
Heterogeneous Graphs
Node Degrees
Bipartite Graph
- Bipartite graph is a graph whose nodes can be divided into two disjoint sets U and V such that every link connects a node in U to one in V; that is, U and V are independent sets
- Examples:
- Authors-to-Papers (they authored)
- Actors-to-Movies (they appeared in)
- Users-to-Movies (they rated)
- Recipes-to-Ingredients (they contain)
- “Folded” networks:
- Author collaboration networks
- Movie co-rating networks
FoldedProjected Bipartite Graphs
Representing Graphs: Adjacency Matrix
Networks are Sparse Graphs
Representing Graphs: Edge list
Representing Graphs: Adjacency list
Adjacency list:
Easier to work with if network is
Large
Sparse
Allows us to quickly retrieve all neighbors of a given node
1:
2: 3, 4
3: 2, 4
4: 5
5: 1, 2
Node and Edge Attributes
- Possible options:
- Weight (e.g., frequency of communication)
- Ranking (best friend, second best friend…)
- Type (friend, relative, co-worker)
- Sign: Friend vs. Foe, Trust vs. Distrust
- Properties depending on the structure of the rest of the graph: Number of common friends
More Types of Graphs
Connectivity of Undirected Graphs
- Connected (undirected) graph:
- Any two vertices can be joined by a path
- A disconnected graph is made up by two or more connected components
Connectivity: Example
The adjacency matrix of a network with several components can be written in a block- diagonal form, so that nonzero elements are confined to squares, with all other elements being zero:
Connectivity of Directed Graphs
- Strongly connected directed graph
- has a path from each node to every other node and vice versa (e.g., A-B path and B-A path)
- Weakly connected directed graph
- is connected if we disregard the edge directions
Strongly connected components (SCCs) can be identified, but not every node is part of a nontrivial strongly connected component.
Traditional Methods for Machine Learning in Graphs
Traditional Methods for Machine Learning in Graphs
Traditional ML Pipeline
Design features for nodes/links/graphs
Obtain features for all training data
Feature Desian
Using effective features over graphs is the key to achieving good model performance.
Traditional ML pipeline uses hand-designed features.
In this lecture, we overview the traditional features for:
- Node-level prediction
- Link-level prediction
- Graph-level prediction
For simplicity, we focus on undirected graphs.
Goal: Make predictions for a set of objects
Design choices:
- Features: d-dimensional vectors
- Objects: Nodes, edges, sets of nodes, entire graphs
- Objective function:
- What task are we aiming to solve?
Node-level Tasks and Features
Node-level Features: Overview
- Goal: Characterize the structure and position of a node in the network:
- Node degree
- Node centrality
- Clustering coefficient
- Graphlets
Node Degree
Node Centrality
- Node degree counts the neighboring nodes without capturing their importance.
- Node centrality 𝑐_𝑣 takes the node importance in a graph into account
- Different ways to model importance:
- Engienvector centrality
- Betweenness centrality
- Closeness centrality
- and many others…
Eigenvector centrality
Betweenness centrality
Closeness centrality
Clustering Coefficient
Graphlets
Goal: Describe network structure around node 𝑢
- Graphlets are small subgraphs that describe the structure of node 𝑢’s network neighborhood Analogy:
Degree counts #(edges) that a node touches
Clustering coefficient counts #(triangles) that a node touches.
Graphlet Degree Vector (GDV): Graphlet-base features for nodes
- GDV counts #(graphlets) that a node touches
Considering graphlets of size 2-5 nodes we get:
- Vector of 73 coordinates is a signature of a node that describes the topology of node’s neighborhood
Graphlet degree vector provides a measure of a node’s local network topology:
- Comparing vectors of two nodes provides a more detailed measure of local topological similarity than node degrees or clustering coefficient.
Induced Subgraph Isomorphism
Def: Induced subgraph is another graph, formed from a subset of vertices and all of the edges connecting the vertices in that subset.
Def: Graph Isomorphism(同构)
- Two graphs which contain the same number of nodes connected in the same way are said to be isomorphic.
Graphlet Degree Vector (GDV): A count vector of graphlets rooted at a given node
Node-level Feature: Summary
We have introduced different ways to obtain node features.
They can be categorized as:
- Importance-based features:
- Node degree
- Different node centrality measures
- Structure-based features:
- Node degree
- Clustering coefficient
- Graphlet count vector
- Importance-based features:
Importance-based features: capture the importance of a node in a graph
- Node degree:
- Simply counts the number of neighboring nodes
- Node centrality:
- Models importance of neighboring nodes in a graph
- Different modeling choices: eigenvector centrality, betweenness centrality, closeness centrality
- Node degree:
Useful for predicting influential nodes in a graph
- Example: predicting celebrity users in a social network
Structure-based features: Capture topological properties of local neighborhood around a node.
- Node degree:
- Counts the number of neighboring nodes
- Clustering coefficient:
- Measures how connected neighboring nodes are
- Graphlet degree vector:
- Counts the occurrences of different graphlets
- Node degree:
Useful for predicting a particular role a node plays in a graph:
- Example: Predicting protein functionality in a protein-protein interaction network
Link Prediction Task and Features
Recap
- The task is to predict new links based on the existing links.
- At test time, node pairs (with no existing links) are ranked, and top 𝐾 node pairs are predicted.
- The key is to design features for a pair of nodes.
Two formulations of the link prediction task:
Link Prediction via Proximity
- Methodology:
- For each pair of nodes (x,y) compute score c(x,y)
- For example, c(x,y) could be the # of common neighbors of x and y
- Sort pairs (x,y) by the decreasing score c(x,y)
- Predict top n pairs as new links
- See which of these links actually appear in 𝐺[𝑡1 ,𝑡1 ′ ]
- For each pair of nodes (x,y) compute score c(x,y)
Link-level Features: Overview
- Distance-based feature
- Local neighborhood overlap
- Global neighborhood overlap
Distance-based Features
Shortest-path distance between two nodes
However, this does not capture the degree of neighborhood overlap:
- Node pair (B, H) has 2 shared neighboring nodes, while pairs (B, E) and (A, B) only have 1 such node
Local Neighborhood Overlap
Global Neighborhood Overlap
Limitation of local neighborhood features:
Metric is always zero if the two nodes do not have any neighbors in common.
However, the two nodes may still potentially be connected in the future.
Global neighborhood overlap metrics resolve the limitation by considering the entire graph.
- Katz index: count the number of walks of all lengths between a given pair of nodes.
- Q: How to compute #walks between two nodes?
- Use powers of the graph adjacency matrix!
Intuition: Powers of Adj Matrices
Computing #walks between two nodes
Summary
- Distance-based features:
- Uses the shortest path length between two nodes but does not capture how neighborhood overlaps.
- Local neighborhood overlap:
- Captures how many neighboring nodes are shared by two nodes.
- Becomes zero when no neighbor nodes are shared.
- Global neighborhood overlap:
- Uses global graph structure to score two nodes.
- Katz index counts #walks of all lengths between two nodes.
Graph-level Features and Graph Kernels
Graph-level Features
Goal: We want features that characterize the structure of an entire graph
Backaround Kernel Methods
- Kernel methods are widely-used for traditional ML for graph-level prediction.
- Idea: Design kernels instead of feature vectors.
- A quick introduction to Kernels:
Overview
- Graph Kernels: Measure similarity between two graphs:
- Graphlet Kernel [1]
- Weisfeiler-Lehman Kernel [2]
- Other kernels are also proposed in the literature (beyond the scope of this lecture)
- Random-walk kernel
- Shortest-path graph kernel
- And many more…
[1] Shervashidze, Nino, et al. “Efficient graphlet kernels for large graph comparison.” Artificial Intelligence and Statistics. 2009.
[2] Shervashidze, Nino, et al. “Weisfeiler-lehman graph kernels.” Journal of Machine Learning Research 12.9 (2011).
Graph Kernel: Key Idea
- Goal: Design graph feature vector 𝜙(𝐺)
- Key idea: Bag-of-Words (BoW) for a graph
- Recall: BoW simply uses the word counts as features for documents (no ordering considered).
- Naïve extension to a graph: Regard nodes as words.
- Since both graphs have 4 red nodes, we get the same feature vector for two different graphs…
What if we use Bag of node degrees?
Both Graphlet Kernel and Weisfeiler-Lehman (WL) Kernel use Bag-of-* representation of graph, where * is more sophisticated than node degrees!
Graphlet Features
- Key idea: Count the number of different graphlets in a graph.
- Note: Definition of graphlets here is slightly different from node-level features.
- The two differences are:
- Nodes in graphlets here do not need to be connected (allows for isolated nodes)
- The graphlets here are not rooted.
- Examples in the next slide illustrate this.
Graphlet Kernel
Limitations: Counting graphlets is expensive!
Counting size-𝑘 graphlets for a graph with size 𝑛 by enumeration takes 𝑛^𝑘 .
This is unavoidable in the worst-case since subgraph isomorphism test (judging whether a graph is a subgraph of another graph) is NP-hard.
If a graph’s node degree is bounded by 𝑑, an 𝑂(𝑛𝑑^(𝑘−1)) algorithm exists to count all the graphlets of size 𝑘.
Can we design a more efficient graph kernel?
Weisfeiler-lehman Kernel
- Goal: Design an efficient graph feature descriptor 𝜙 (𝐺)
- Idea: Use neighborhood structure to iteratively enrich node vocabulary.
- Generalized version of Bag of node degrees since node degrees are one-hop neighborhood information.
- Algorithm to achieve this: Color refinement
Example of color refinement given two graphs
After color refinement, WL kernel counts number of nodes with a given color.
The WL kernel value is computed by the inner product of the color count vectors:
- WL kernel is computationally efficient
- The time complexity for color refinement at each step is linear in #(edges), since it involves aggregating neighboring colors.
- When computing a kernel value, only colors appeared in the two graphs need to be tracked.
- Thus, #(colors) is at most the total number of nodes.
- Counting colors takes linear-time w.r.t. #(nodes).
- In total, time complexity is linear in #(edges)
Summary
Graphlet Kernel
- Graph is represented as Bag-of-graphlets
- Computationally expensive
Weisfeiler-Lehman Kernel
- Apply 𝐾-step color refinement algorithm to enrich node colors
- Different colors capture different 𝐾-hop neighborhood structures
- Graph is represented as Bag-of-colors
- Computationally efficient
- Closely related to Graph Neural Networks (as we will see!)
- Apply 𝐾-step color refinement algorithm to enrich node colors
Traditional ML Pipeline
- Hand-crafted feature + ML model
Hand-crafted features for graph data
- Node-level:
- Node degree, centrality, clustering coefficient, graphlets
- Link-level:
- distance-based feature
- local/global neighborhood overlap
- Graph-level:
- Graphlet kernel, WL kernel
- Node-level:
Node Embeddings
Goal: Efficient task-independent feature learning for machine learning with graphs!
Why Embedding?
- Task: Map nodes into an embedding space
- Similarity of embeddings between nodes indicates their similarity in the network. For example:
- Both nodes are close to each other (connected by an edge)
- Encode network information
- Potentially used for many downstream predictions
- Similarity of embeddings between nodes indicates their similarity in the network. For example:
2D embedding of nodes of the Zachary’s Karate Club network:
Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014
Node Embeddings Encoder and Decoder
Setup
- Assume we have a graph G:
- V is the vertex set.
- A is the adjacency matrix (assume binary).
- For simplicity: No node features or extra information is used
Embedding Nodes
Goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the graph
Learning Node Embeddings
- Encoder maps from nodes to embeddings
- Define a node similarity function (i.e., a measure of similarity in the original network)
- Decoder 𝐃𝐄𝐂 maps from embeddings to the similarity score
- Optimize the parameters of the encoder so that:
Two Key Components
“shallow”Encoding
Simplest encoding approach: Encoder is just an embedding-lookup
- Each node is assigned a unique embedding vector (i.e., we directly optimize the embedding of each node)
- Many methods: DeepWalk, node2vec
Framework Summary
Encoder + Decoder Framework
How to Define Node Similarit
- Key choice of methods is how they define node similarity.
- Should two nodes have a similar embedding if they…
- are linked?
- share neighbors?
- have similar “structural roles”?
- We will now learn node similarity definition that uses random walks, and how to optimize embeddings for such a similarity measure.
Note on Node Embeddings
- This is unsupervised/self-supervisedway of learning node embeddings.
- We are not utilizing node labels
- We are not utilizing node features
- The goal is to directly estimate a set of coordinates (i.e., the embedding) of a node so that some aspect of the network structure (captured by DEC) is preserved.
- These embeddings are task independent
- They are not trained for a specific task but can be used for any task.
Random Walk Approaches for Node Embeddings
Notation
Random Walk
Random-walk Embeddings
Why Random Walks?
- Expressivity: Flexible stochastic(随机的) definition of node similarity that incorporates both local and higher-order neighborhood information Idea: if random walk starting from node 𝑢 visits 𝑣 with high probability, 𝑢 and 𝑣 are similar (high-order multi-hop information)
- Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks
Unsupervised Feature Learning
- Intuition: Find embedding of nodes in 𝑑-dimensional space that preserves similarity
- Idea: Learn node embedding such that nearby nodes are close together in the network
- Given a node 𝑢, how do we define nearby nodes?
Feature Learning as Optimization
But doing this naively is too expensive!
Negative Sampling
Stochastic Gradient Descent
Stochastic Gradient Descent: Instead of evaluating gradients over all examples, evaluate it for each individual training example.
Random Walks:Summary
How should we randomly walk?
- So far we have described how to optimize embeddings given a random walk strategy R -
- What strategies should we use to run these random walks?
- Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013)
- The issue is that such notion of similarity is too constrained
- Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013)
- How can we generalize this?
Reference: Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD
Overview of node2vec
Reference: Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD
- Goal: Embed nodes with similar network neighborhoods close in the feature space.
- We frame this goal as a maximum likelihood optimization problem, independent to the downstream prediction task.
- Key observation: Flexible notion of network neighborhood 𝑁_𝑅(𝑢) of node 𝑢 leads to rich node embeddings
- Develop biased 2^nd order random walk 𝑅 to generate network neighborhood 𝑁_𝑅(𝑢) of node 𝑢
node2vec: Biased Walks
Idea: use flexible, biased random walks that can trade off between local and global views of the network (Grover and Leskovec, 2016).
BFS VS.DFS
- Biased fixed-length random walk 𝑹 that given a node 𝒖 generates neighborhood 𝑵_𝑹(𝒖)
- Two parameters:
- Return parameter 𝒑:
- Return back to the previous node
- Return parameter 𝒑:
- In-out parameter 𝒒:
- Moving outwards (DFS) vs. inwards (BFS)
- Intuitively, 𝑞 is the “ratio” of BFS vs. DFS
- Two parameters:
Biased Random Walks
- Biased 2nd -order random walks explore network neighborhoods:
- Rnd. walk just traversed edge (𝑠1 , 𝑤) and is now at 𝑤
- Insight: Neighbors of 𝑤 can only be:
Idea: Remember where the walk came from
- Walker came over edge (𝐬𝟏, 𝐰) and is at 𝐰. Where to go next?
- 𝑝, 𝑞 model transition probabilities
- 𝑝 … return parameter
- 𝑞 … ”walk away” parameter
- Walker came over edge (𝐬𝟏, 𝐰) and is at 𝐰. Where to go next?
node2vec algorithm
1) Compute random walk probabilities
2) Simulate 𝑟 random walks of length 𝑙 starting from each node 𝑢
3) Optimize the node2vec objective using Stochastic Gradient Descent
Linear-time complexity
All 3 steps are individually parallelizable
Other Random Walk Ideas
- Different kinds of biased random walks:
- Alternative optimization schemes:
- Network preprocessing techniques:
Summary so far
Core idea: Embed nodes so that distances in embedding space reflect node similarities in the original network.
Different notions of node similarity:
- Naïve: similar if two nodes are connected
- Neighborhood overlap (covered in Lecture 2)
- Random walk approaches (covered today)
So what method should I use..?
No one method wins in all cases….
- E.g., node2vec performs better on node classification while alternative methods perform better on link prediction (Goyal and Ferrara, 2017 survey).
Random walk approaches are generally more efficient.
In general: Must choose definition of node similarity that matches your application.
Embedding Entire Graphs
Embedding Entire Graphs
Approach l
Simple (but effective) approach 1:
- Run a standard graph embedding technique on the (sub)graph 𝐺.
- Then just sum (or average) the node embeddings in the (sub)graph 𝐺.
Used by Duvenaud et al., 2016 to classify molecules based on their graph structure
Approach 2
Approach 2: Introduce a “virtual node” to represent the (sub)graph and run a standard graph embedding technique
Proposed by Li et al., 2016 as a general technique for subgraph embedding
Approach 3: Anonymous Walk Embeddings
States in anonymous(匿名的) walks correspond to the index of the first time we visited the node in a random walk
Anonymous Walk Embeddings, ICML 2018
Agnostic to the identity of the nodes visited (hence anonymous)
Example: Random walk w1 :
Number of Walks Grows
Simple Use of Anonymous Walks
Simulate anonymous walks 𝑤𝑖 of 𝑙 steps and record their counts.
Represent the graph as a probability distribution over these walks.
For example:
- Set 𝑙 = 3
- Then we can represent the graph as a 5-dim vector
- Since there are 5 anonymous walks 𝑤𝑖 of length 3: 111, 112, 121, 122, 123
- 𝒛_𝑮[𝑖] = probability of anonymous walk 𝑤𝑖 in graph 𝐺.
Sampling anonymous walks: Generate independently a set of 𝑚 random walks.
Represent the graph as a probability distribution over these walks.
How many random walks 𝑚 do we need?
New idea: Learn Walk Embeddings
Rather than simply representing each walk by the fraction of times it occurs, we learn embedding 𝒛_𝒊 of anonymous walk 𝒘_𝒊 .
How to embed walks?
Idea: Embed walks s.t. the next walk can be predicted
Anonymous Walk Embeddings, ICML 2018
Summary
- We discussed 3 ideas to graph embeddings:
- Approach 1: Embed nodes and sum/avg them
- Approach 2: Create super-node that spans the (sub) graph and then embed that node.
- Approach 3: Anonymous Walk Embeddings
- Idea 1: Sample the anon. walks and represent the graph as fraction of times each anon walk occurs.
- Idea 2: Learn graph embedding together with anonymous walk embeddings.
Preview: Hierarchical Embeddings
- We will discuss more advanced ways to obtain graph embeddings in Lecture 8.
- We can hierarchically cluster nodes in graphs, and sum/avg the node embeddings according to these clusters.
How to Use Embeddings
- Encoder-decoder framework:
- Encoder: embedding lookup
- Decoder: predict score based on embedding to match node similarity
- Node similarity measure: (biased) random walk
- Examples: DeepWalk, Node2Vec
- Extension to Graph embedding: Node embedding aggregation and Anonymous Walk Embeddings
Graph as Matrix: Pagerank, Random Walks and Embeddings
Pagerank(aka the Google Algorithm)
Example: The Web as a Graph
- Q: What does the Web “look like” at a global level?
- Web as a graph:
- Nodes = web pages
- Edges = hyperlinks
- Side issue: What is a node?
- Dynamic pages created on the fly
- “dark matter” – inaccessible database generated pages
What Does the Web Look Like?
- How is the Web linked?
- What is the “map” of the Web?
- Web as a directed graph [Broder et al. 2000]:
- Given node v, what nodes can v reach?
- What other nodes can reach v?
Ranking Nodes on the Graph
- All web pages are not equally “important” www.thispersondoesnotexist.com vs. www.stanford.edu
- There is large diversity in the web-graph node connectivity.
- So, let’s rank the pages using the web graph link structure!
Link Analysis Algorithms
- We will cover the following Link Analysis approaches to compute the importance of nodes in a graph:
- PageRank
- Personalized PageRank (PPR)
- Random Walk with Restarts
Links as Votes
- Idea: Links as votes
- Page is more important if it has more links
- In-coming links? Out-going links?
- Page is more important if it has more links
- Think of in-links as votes:
- www.stanford.edu has 23,400 in-links
- thispersondoesnotexist.com has 1 in-link
- Are all in-links equal?
- Links from important pages count more
- Recursive question!
Pagerank: The”Flow”Model
- A “vote” from an important page is worth more:
- Each link’s vote is proportional to the importance of its source page
- If page i with importance ri has di out-links, each link gets ri / di votes
- Page j’s own importance rj is the sum of the votes on its in-links
Pagerank: Matrix Formulation
Connection to Random Walk
- Imagine a random web surfer:
- At any time 𝒕, surfer is on some page 𝑖
- At time 𝒕 + 𝟏, the surfer follows an out-link from 𝒊 uniformly at random
- Ends up on some page 𝒋 linked from 𝒊
- Process repeats indefinitely
- Let:
- 𝒑(𝒕) … vector whose 𝑖^th coordinate is the prob. that the surfer is at page 𝑖 at time 𝑡
- So, 𝒑(𝒕) is a probability distribution over pages
The Stationary Distribution
Recall Eigenvector of A Matrix
Summary
PageRank:
- Measures importance of nodes in a graph using the link structure of the web
- Models a random web surfer using the stochastic adjacency matrix 𝑴
- PageRank solves 𝒓 = 𝑴𝒓 where 𝒓 can be viewed as both the principle eigenvector of 𝑴 and as the stationary distribution of a random walk over the graph
Pagerank: How to solve?
Power Iteration Method
Given a web graph with N nodes, where the nodes are pages and edges are hyperlinks
About 50 iterations is sufficient to estimate the limiting solution.
Three Ouestions
- Does this converge?
- Does it converge to what we want?
- Are results reasonable?
Problems
Two problems:
(1) Some pages are dead ends (have no out-links)
Such pages cause importance to “leak out”
(2) Spider traps (all out-links are within the group)
Eventually spider traps absorb all importance
Does this converge
Does it converge to what we want?
Solution to Spider Traps
Solution to Dead Ends
- Teleports: Follow random teleport links with total probability 1.0 from dead-ends
- Adjust matrix accordingly
Why Teleports Solve the Problem?
- Why are dead-ends and spider traps a problem and why do teleports solve the problem?
- Spider-traps are not a problem, but with traps PageRank scores are not what we want
- Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps
- Dead-ends are a problem
- The matrix is not column stochastic so our initial assumptions are not met
- Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go
Solution: Random Teleports
Random Teleports(beta=0.8)
Summary
- PageRank solves for 𝒓 = 𝑮𝒓 and can be efficiently computed by power iteration of the stochastic adjacency matrix (𝑮)
- Adding random uniform teleportation solves issues of dead-ends and spider-traps
Random Walk with Restarts and Personalized Pagerank
Example: Recommendation
Given: A bipartite graph representing user and item interactions (e.g. purchase)
Goal: Proximity on graphs
- What items should we recommend to a user who interacts with item Q?
- Intuition: if items Q and P are interacted by similar users, recommend P when user interacts with Q
Which is more related A,A’ or B,B’?
Node proximity Measurements
Which is more related A,A’, B,B’ or C,C’?
Proximity on Graphs
PageRank:
- Ranks nodes by “importance”
- Teleports(传送) with uniform probability to any node in the network
Personalized PageRank:
- Ranks proximity of nodes to the teleport nodes 𝑺
Proximity(接近) on graphs:
- Q: What is most related item to Item Q?
- Random Walks with Restarts
- Teleport back to the starting node: 𝑺 = {𝑸}
Idea: Random Walks
- Idea
- Every node has some importance
- Importance gets evenly split among all edges and pushed to the neighbors:
- Given a set of QUERY_NODES, we simulate a random walk:
- Make a step to a random neighbor and record the visit (visit count)
- With probability ALPHA, restart the walk at one of the QUERY_NODES
- The nodes with the highest visit count have highest proximity to the QUERY_NODES
Idea:
- Every node has some importance
- Importance gets evenly split among all edges and pushed to the neighbors
- Given a set of QUERY NODES Q, simulate a random walk:
Random Walk Algorithm
Benefits
Why is this a good solution?
Because the “similarity” considers:
- Multiple connections
- Multiple paths
- Direct and indirect connections
- Degree of the node
Summary: Page Rank Variants
PageRank:
- Teleports to any node
- Nodes can have the same probability of the surfer landing: 𝑺 = [0.1,0.1,0.1, 0.1,0.1,0.1, 0.1,0.1,0.1, 0.1]
Topic-Specific PageRank aka Personalized PageRank:
- Teleports to a specific set of nodes
- Nodes can have different probabilities of the surfer landing there: 𝑺 = [0.1,0, 0,0.2, 0, 0,0.5,0,0, 0.2]
Random Walk with Restarts:
- Topic-Specific PageRank where teleport is always to the same node: 𝑺 = [0,0, 0,0, 𝟏, 0, 0, 0, 0, 0,0]
- A graph is naturally represented as a matrix
- We defined a random walk process over the graph
- Random surfer moving across the links and with random teleportation
- Stochastic adjacency matrix M
- PageRank = Limiting distribution of the surfer location represented node importance
- Corresponds to the leading eigenvector of transformed adjacency matrix M.
Matrix Factorization and Node Embeddings
Embeddings Matrix Factorization(因式分解)
Recall: encoder as an embedding lookup
Connection to Matrix Factorization
Random Walk-based Similarity
- DeepWalk and node2vec have a more complex node similarity definition based on random walks
- DeepWalk is equivalent to matrix factorization of the following complex matrix expression:
Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec, WSDM 18
Node2vec can also be formulated as a matrix factorization (albeit a more complex matrix)
Limitations
Limitations of node embeddings via matrix factorization and random walks
- Cannot obtain embeddings for nodes not in the training set
- Node 1 and 11 are structurally similar – part of one triangle, degree 2, …
- However, they have very different embeddings.
- It’s unlikely that a random walk will reach node 11 from node 1.
- DeepWalk and node2vec do not capture structural similarity.
- Cannot utilize node, edge and graph features
Solution to these limitations: Deep Representation Learning and Graph Neural Networks
Summary
- PageRank
- Measures importance of nodes in graph
- Can be efficiently computed by power iteration of adjacency matrix
- Personalized PageRank (PPR)
- Measures importance of nodes with respect to a particular node or set of nodes
- Can be efficiently computed by random walk
- Node embeddings based on random walks can be expressed as matrix factorization
- Viewing graphs as matrices plays a key role in all above algorithms!
Message Passing and Node Classification
Example Node Classification
- Given labels of some nodes
- Let’s predict labels of unlabeled nodes
- This is called semi-supervised node classification
Correlations(相关性) Exist in Networks
Behaviors of nodes are correlated across the links of the network
Correlation: Nearby nodes have the same color (belonging to the same class)
Two explanations for why behaviors of nodes in networks are correlated:
Social Homophily(趋同性)
- Homophily: The tendency of individuals to associate and bond with similar others
- “Birds of a feather flock together”
- It has been observed in a vast array of network studies, based on a variety of attributes (e.g., age, gender, organizational role, etc.)
- Example: Researchers who focus on the same research area are more likely to establish a connection (meeting at conferences, interacting in academic talks, etc.)
Example of homophily
- Online social network
- Nodes = people
- Edges = friendship
- Node color = interests (sports, arts, etc.)
- People with the same interest are more closely connected due to homophily
Social Influence Example
- Influence: Social connections can influence the individual characteristics of a person.
- Example: I recommend my musical preferences to my friends, until one of them grows to like my same favorite genres!
How do we leverage node correlations in networks?
Classification with Network Data
How do we leverage(影响力) this correlation observed in networks to help predict node labels?
How do we predict the labels for the nodes in grey?
Motivation
Similar nodes are typically close together or directly connected in the network:
- Guilt-by-association: If I am connected to a node with label X, then I am likely to have label X as well.
- Example: Malicious/benign web page: Malicious web pages link to one another to increase visibility, look credible, and rank higher in search engines
Classification label of a node v in network may depend on:
- Features of v
- Labels of the nodes in v’s neighborhood
- Features of the nodes in v’s neighborhood
Semi-supervised Learning
Formal setting:
- Given:
- Graph
- Few labeled nodes
- Find: Class (red/green) of remaining nodes
- Main assumption: There is homophily in the network
- Given:
Example task:
Problem Setting
Example applications
- Many applications under this setting:
- Document classification
- Part of speech tagging
- Link prediction
- Optical character recognition
- Image/3D data segmentation
- Entity resolution in sensor networks
- Spam and fraud detection
We focus on semi-supervised binary node classification
We will introduce three approaches:
- Relational classification
- Iterative classification
- Correct & Smooth
Relational Classification
Probabilistic Relational Classifier
Challenges:
- Convergence is not guaranteed
- Model cannot use node feature information
Example: Initialization
Initialization:
- All labeled nodes with their labels
- All unlabeled nodes 0.5 (belonging to class 1 with probability 0.5)
Example: lst Iteration, Update Node
Example: After 1st Iteration
After Iteration 1 (a round of updates for all unlabeled nodes)
Example: After 2nd Iteration
Example: After 3] Iteration
Convergence
Iterative(迭代的) Classification
Relational classifier does not use node attributes.
How can one leverage them?
Computing the Summary Z_v
Architecture of Iterative Classifiers
Example: Web Page Classification
- Input: Graph of web pages
- Node: Web page
- Edge: Hyper-link between web pages
- Directed edge: a page points to another page
- Node features: Webpage description
- For simplicity, we only consider two binary features
- Task: Predict the topic of the webpage
Baseline: Train a classifier (e.g., linear classifier) to classify pages based on node attributes
Iterative Classifier
step 1
step 2
step 3. 1
step 3. 2
Final Prediction
Summary
- We talked about 2 approaches to collective classification
- Relational classification
- Iteratively update probabilities of node belonging to a label class based on its neighbors
- Iterative classification
- Improve over collective classification to handle attribute/feature information
- Classify node v based on its features as well as labels of neighbors
Collective Classification Correct & Smooth
Correct & Smooth
Combining Label Propagation and Simple Models Out-performs Graph Neural Networks
OGB leaderboard snapshot at Oct 1st, 2021
Setting: A partially labeled graph and features over nodes.
C&S follows the three-step procedure:
- Train base predictor
- Use the base predictor to predict soft labels of all nodes.
- Post-process the predictions using graph structure to obtain the final predictions of all nodes.
Train Base Predictor
Train a base predictor that predict soft labels (class probabilities) over all nodes.
- Labeled nodes are used for train/validation data.
- Base predictor can be simple:
- Linear model/Multi-Layer-Perceptron(MLP) over node features
Predict Over All Nodes
Given a trained base predictor, we apply it to obtain soft labels for all the nodes.
- We expect these soft labels to be decently accurate.
- Can we use graph structure to post-process the predictions to make them more accurate?
Post-process Predictions
C&S uses the 2-step procedure to post-process the soft predictions.
- Correct step
- Smooth step
C&S Post-processing: Correct Step
- The key idea is that we expect errors in the base prediction to be positively correlated along edges in the graph.
- In other words, an error at node u increases the chance of a similar error at neighbors of u.
- Thus, we should “spread” such uncertainty over the graph
Intuition of Correct & Smooth
Correct Step
- Compute training errors of nodes.
- Training error: Ground-truth label minus soft label. Defined as 0 for unlabeled nodes.
Smooth Step
- Smoothen the corrected soft labels along the edges.
- Assumption: Neighboring nodes tend to share the same labels.
- Note: For training nodes, we use the ground-truth hard labels instead of the soft labels.
Toy Example Summary
Our toy example shows that C&S successfully improves base model performance using graph structure.
C&S on a Real-world Dataset
C&S significantly improves the performance of the base model (MLP).
C&S outperforms Smooth-only (no correct step) baseline.
Summary
- Correct & Smooth (C&S) uses graph structure to post-process the soft node labels predicted by any base model.
- Correction step: Diffuse and correct for the training errors of the base predictor.
- Smooth step: Smoothen the prediction of the base predictor.
- C&S achieves strong performance on semisupervised node classification
We learned how to leverage correlation in graphs to make prediction on nodes.
Key techniques:
- Relational classification
- Iterative classification
- Correct & Smooth
Graph Neural Networks
Basics of Deep Learning
Deep Learning for Graphs
Content
- Local network neighborhoods:
- Describe aggregation strategies
- Define computation graphs
- Stacking multiple layers:
- Describe the model, parameters, training
- How to fit the model?
- Simple example for unsupervised and supervised training
Setup
A Naive Approach
- Join adjacency matrix and features
- Feed them into a deep neural net:
Issues with this idea:
- O(|V|) parameters
- Not applicable to graphs of different sizes
- Sensitive to node ordering
Idea: Convolutional Networks
Goal is to generalize convolutions beyond simple lattices(格子)
Leverage node features/attributes (e.g., text, images)
Real-world Graphs
But our graphs look like this:
There is no fixed notion of locality or sliding window on the graph
Graph is permutation invariant
Permutation Invariance(置换不变性)
- Graph does not have a canonical order of the nodes!
- We can have many different order plans.
Graph and node representations should be the same for Order plan 1 and Order plan 2
What does it mean by “graph representation is same for two order plans”?
Permutation Equivariance(置换等变)
For node representation
Graph Neural Network Overview
Graph neural networks consist of multiple permutation equivariant / invariant functions
Next: Design graph neural networks that are permutation invariant / equivariant by passing and aggregating information from neighbors
Graph Convolutional Networks
Idea: Node’s neighborhood defines a computation graph
Learn how to propagate information across the graph to compute node features
Idea: Aggregate Neiahbors
Key idea: Generate node embeddings based on local network neighborhoods
Intuition: Nodes aggregate information from their neighbors using neural networks
Deep Model: Many Layers
Model can be of arbitrary depth:
Nodes have embeddings at each layer
Layer-0 embedding of node v is its input feature, x_v
Layer-k embedding gets information from nodes that are k hops away
Neighborhood Aggregation(聚集)
Neighborhood aggregation: Key distinctions are in how different approaches aggregate information across the layers
Basic approach: Average information from neighbors and apply a neural network
Equivariant Property
The target node (blue) has the same computation graph for different order plans
Training the Model
Need to define a loss function on the embeddings.
Model Parameters
Matrix Formulation
Many aggregations can be performed efficiently by (sparse) matrix operations
Note: not all GNNs can be expressed in matrix form, when aggregation function is complex
How to Train A GNN
Unsupervised Training
Node similarity can be anything from Lecture 3, e.g., a loss based on:
- Random walks (node2vec, DeepWalk, struc2vec)
- Matrix factorization
- Node proximity in the grap
Supervised Training
Directly train the model for a supervised task (e.g., node classification)
Directly train the model for a supervised task (e.g., node classification)
Model Design: Overview
Inductive Capability
The same aggregation parameters are shared for all nodes:
The number of model parameters is sublinear in |V| and we can generalize to unseen nodes!
Inductive node embedding => Generalize to entirely unseen graphs
E.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B
Many application settings constantly encounter previously unseen nodes:
- E.g., Reddit, YouTube, Google Scholar
Need to generate new embeddings “on the fly”
GNNs subsume CNNs and Transformers
Architecture Comparison
How does GNNs compare to prominent architectures such as Convolutional Neural Nets, and Transformers?
GNN VS. CNN
CNN can be seen as a special GNN with fixed neighbor size and ordering:
- The size of the filter is pre-defined for a CNN.
- The advantage of GNN is it processes arbitrary graphs with different degrees for each node
CNN is not permutation equivariant.
- Switching the order of pixels will leads to different outputs
Transformer
[Attention is all you need. Vaswani et al., NeurIPS 2017]
GNN vs, Transformer
A nice blog plot for this: https://towardsdatascience.com/transformers-are-graph-neural-networks-bca9f75412aa
Summary
- Basics of neural networks
- Loss, Optimization, Gradient, SGD, non-linearity, MLP
- Idea for Deep Learning for Graphs
- Multiple layers of embedding transformation
- At every layer, use the embedding at previous layer as the input
- Aggregation of neighbors and self-embeddings
- Graph Convolutional Network
- Mean aggregation; can be expressed in matrix form
- GNN is a general architecture
- CNN and Transformer can be viewed as a special GNN
Graph Neural Networks[2]
A General Perspective on Graph Neural Networks
A Genera GNN Framework
J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020
Summary
A Single Layer of a GNN
A Single GNN Layer
Message Computation
Aggregation
Issue
Putting things together
Classical GNN Layers: GCN
T. Kipf, M. Welling. Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017
Graph Convolutional Networks (GCN)
How to write this as Message + Aggregation?
GraphSAGE
Hamilton et al. Inductive Representation Learning on Large Graphs, NeurIPS 2017
L2 Normalization
GAT
Graph Attention Networks
Not all node’s neighbors are equally important
Attention is inspired by cognitive attention.
The attention a_uv focuses on the important parts of the input data and fades out the rest.
Idea: the NN should devote more computing power on that small but important part of the data.
Which part of the data is more important depends on the context and is learned through training.
Can we do better than simple neighborhood aggregation?
Attention Mechanism
What is the form of attention mechanism a?
- The approach is agnostic to the choice of a
- E.g., use a simple single-layer neural network
- a have trainable parameters (weights in the Linear layer)
Multi-head attention: Stabilizes the learning process of attention mechanism
Benefits of Attention Mechanism
GNN Layers in Practice
Many modern deep learning modules can be incorporated into a GNN layer
- Batch Normalization:
- Stabilize neural network training
- Dropout:
- Prevent overfitting
- Attention/Gating:
- Control the importance of a message
- More:
- Any other useful deep learning modules
Batch Normalization
- Goal: Stabilize neural networks training
- Idea: Given a batch of inputs (node embeddings)
- Re-center the node embeddings into zero mean
- Re-scale the variance into unit variance
Dropout
- Goal: Regularize a neural net to prevent overfitting.
- Idea:
- During training: with some probability p, randomly set neurons to zero (turn off)
- During testing: Use all the neurons for computation
Activation(Non-linearity)
Summary:
- Modern deep learning modules can be included into a GNN layer for better performance
- Designing novel GNN layers is still an active research frontier!
- Suggested resources: You can explore diverse GNN designs or try out your own ideas in GraphGym
Stacking Layers of a GNN
- How to construct a Graph Neural Network?
- The standard way: Stack GNN layers sequentially
The Over-smoothing Problem
- The Issue of stacking many GNN layers
- GNN suffers from the over-smoothing problem
- The over-smoothing problem: all the node embeddings converge to the same value
- This is bad because we want to use node embeddings to differentiate nodes
- Why does the over-smoothing problem happen?
Receptive Field of a GNN
- Receptive field: the set of nodes that determine the embedding of a node of interest
- In a k-layer GNN, each node has a receptive field of k-hop neighborhood
- Receptive field overlap for two nodes
- The shared neighbors quickly grows when we increase the number of hops (num of GNN layers)
- We can explain over-smoothing via the notion of receptive field
- We knew the embedding of a node is determined by its receptive field
- If two nodes have highly-overlapped receptive fields, then their embeddings are highly similar
- We knew the embedding of a node is determined by its receptive field
- Stack many GNN layers -> nodes will have highlyoverlapped receptive fields -> node embeddings will be highly similar -> suffer from the over-smoothing problem
- Next: how do we overcome over-smoothing problem?
Design GNN Layer Connectivity
- What do we learn from the over-smoothing problem?
- Lesson 1: Be cautious when adding GNN layers
- Unlike neural networks in other domains (CNN for image classification), adding more GNN layers do not always help
- Step 1: Analyze the necessary receptive field to solve your problem. E.g., by computing the diameter of the graph
- Step 2: Set number of GNN layers L to be a bit more than the receptive field we like. Do not set L to be unnecessarily large!
- Question: How to enhance the expressive power of a GNN, if the number of GNN layers is small?
Expressive Power for Shallow GNNS
- How to make a shallow GNN more expressive?
- Solution 1: Increase the expressive power within each GNN layer
- In our previous examples, each transformation or aggregation function only include one linear layer
- We can make aggregation / transformation become a deep neural network!
- Solution 2: Add layers that do not pass messages
- A GNN does not necessarily only contain GNN layers
- E.g., we can add MLP layers (applied to each node) before and after GNN layers, as pre-process layers and post-process layers
- Pre-processing layers: Important when encoding node features is necessary.
- E.g., when nodes represent images/text
- Post-processing layers: Important when reasoning / transformation over node embeddings are needed
- E.g., graph classification, knowledge graphs
- In practice, adding these layers works great
- A GNN does not necessarily only contain GNN layers
What if my problem still requires many GNN layers?
Lesson 2: Add skip connections in GNNs
Observation from over-smoothing: Node embeddings in earlier GNN layers can sometimes better differentiate nodes
Solution: We can increase the impact of earlier layers on the final node embeddings, by adding shortcuts in GNN
He et al. Deep Residual Learning for Image Recognition, CVPR 2015
Idea of Skip Connections
Why do skip connections work?
Veit et al. Residual Networks Behave Like Ensembles of Relatively Shallow Networks, ArXiv 2016
Example: GCN with Skip Connections
Other Options of Skip Connections
Xu et al. Representation learning on graphs with jumping knowledge networks, ICML 2018
GNN Augmentation(增强) and Training
Graph Augmentation for GNNS
Our assumption so far has been
- Raw input graph = computational graph
Reasons for breaking this assumption
Features:
- The input graph lacks features
Graph structure:
- The graph is too sparse(稀疏的) -> inefficient message passing
- The graph is too dense -> message passing is too costly
- The graph is too large -> cannot fit the computational graph into a GPU
It’s unlikely that the input graph happens to be the optimal computation graph for embeddings
Graph Augmentation Approaches
- Graph Feature augmentation
- The input graph lacks features -> feature augmentation
- Graph Structure augmentation
- The graph is too sparse -> Add virtual nodes / edges
- The graph is too dense -> Sample neighbors when doing message passing
- The graph is too large -> Sample subgraphs to compute embeddings
- Will cover later in lecture: Scaling up GNNs
Why do we need feature augmentation?
(1) Input graph does not have node features
This is common when we only have the adj. matrix ¡ Standard approaches:
a) Assign constant values to node
b) Assign unique IDs to nodes
These IDs are converted into one-hot vectors
Feature augmentation: constant vs. one-hot
J. You, J. Gomes-Selman, R. Ying, J. Leskovec. Identity-aware Graph Neural Networks, AAAI 2021
(2) Certain structures are hard to learn by GNN
Example: Cycle count feature:
Can GNN learn the length of a cycle that v1 resides in?
Unfortunately, no
v1 cannot differentiate which graph it resides in
- Because all the nodes in the graph have degree of 2
- The computational graphs will be the same binary tree
Solution:
- We can use cycle count as augmented node features
Other commonly used augmented features:
- Node degree
- Clustering coefficient
- PageRank
- Centrality
- …
Add Virtual Nodes/ Edges
Motivation: Augment sparse graphs
(1) Add virtual edges
- Common approach: Connect 2-hop neighbors via virtual edges
- Intuition: Instead of using adj. matrix A for GNN computation, use A + A^2,
Use cases: Bipartite graphs
- Author-to-papers (they authored)
- 2-hop virtual edges make an author-author collaboration graph
(2) Add virtual nodes
The virtual node will connect to all the nodes in the graph
Suppose in a sparse graph, two nodes have shortest path distance of 10
After adding the virtual node, all the nodes will have a distance of two
- Node A – Virtual node – Node B
Benefits: Greatly improves message passing in sparse graphs
Node Neighborhood Sampling
Hamilton et al. Inductive Representation Learning on Large Graphs, NeurIPS 2017
Previously:
- All the nodes are used for message passing
New idea: (Randomly) sample a node’s neighborhood for message passing
For example, we can randomly choose 2 neighbors to pass messages in a given layer
- Only nodes B and D will pass messages to A
In the next layer when we compute the embeddings, we can sample different neighbors
- Only nodes C and D will pass messages to A
In expectation, we get embeddings similar to the case where all the neighbors are used
- Benefits: Greatly reduces computational cost
- Allows for scaling to large graphs (more about this later)
- And in practice it works great!
Prediction with GNNS
GNN Training Pipeline
Pipeline
(1) Different prediction heads:
- Node-level tasks
- Edge-level tasks
- Graph-level tasks
GNN Prediction Heads
Idea: Different task levels require different prediction heads
Node-level
Node-level prediction: We can directly make prediction using node embeddings!
Edae-level
Edge-level prediction: Make prediction using pairs of node embeddings
(1) Concatenation + Linear
(2) Dot product
Graph-level
Graph-level prediction: Make prediction using all the node embeddings in our graph
K. Xu, W. Hu, J. Leskovec, S. Jegelka. How Powerful Are Graph Neural Networks, ICLR 2019
Issue of Global Pooling
Issue: Global pooling over a (large) graph will lose information
Hierarchical Global Pooling
A solution: Let’s aggregate all the node embeddings hierarchically(等级的)
Ying et al. Hierarchical Graph Representation Learning with Differentiable Pooling, NeurIPS 2018
DiffPool idea:
- Hierarchically pool node embedding
Leverage 2 independent GNNs at each level
- GNN A: Compute node embeddings
- GNN B: Compute the cluster that a node belongs to
GNNs A and B at each level can be executed in parallel
For each Pooling layer
- Use clustering assignments from GNN B to aggregate node embeddings generated by GNN A
- Create a single new node for each cluster, maintaining edges between clusters to generated a new pooled network
Jointly train GNN A and GNN B
Training Graph Neural Networks
Pipeline
(2) Where does ground-truth come from?
- Supervised labels
- Unsupervised signals
Supervised Vs Unsupervised
- Supervised learning on graphs
- Labels come from external sources
- E.g., predict drug likeness of a molecular graph
- Labels come from external sources
- Unsupervised learning on graphs
- Signals come from graphs themselves
- E.g., link prediction: predict if two nodes are connected
- Signals come from graphs themselves
- Sometimes the differences are blurry
- We still have “supervision” in unsupervised learning
- E.g., train a GNN to predict node clustering coefficient
- An alternative name for “unsupervised” is “self-supervised”
- We still have “supervision” in unsupervised learning
Supervised Labels on Graphs
Supervised labels come from the specific use cases. For example:
Advice: Reduce your task to node / edge / graph labels, since they are easy to work with
- E.g., we knew some nodes form a cluster. We can treat the cluster that a node belongs to as a node label
Unsupervised Signals on Graphs
- The problem: sometimes we only have a graph, without any external labels
- The solution: “self-supervised learning”, we can find supervision signals within the graph.
- For example, we can let GNN predict the following:
These tasks do not require any external labels!
Pipeline
(3) How do we compute the final loss?
- Classification loss
- Regression loss
Settings for GNN Training
Classification or Rearession
Classification LOSS
As discussed in lecture 6, cross entropy (CE) is a very common loss function in classification
Regression Loss
For regression tasks we often use Mean Squared Error (MSE) a.k.a. L2 loss
Pipeline
(4) How do we measure the success of a GNN?
- Accuracy
- ROC AUC
Evaluation Metrics: Rearession
- We use standard evaluation metrics for GNN
- (Content below can be found in any ML course)
- In practice we will use sklearn for implementation
- Suppose we make predictions for N data points
- Evaluate regression tasks on graphs:
Evaluation Metrics: Classification
Evaluate classification tasks on graphs:
Metrics for Binary Classification
ROC Curve: Captures the tradeoff in TPR and FPR as the classification threshold is varied for a binary classifier.
ROC AUC: Area under the ROC Curve.
Intuition: The probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
Setting-up GNN Prediction Tasks
Pipeline
(5) How do we split our dataset into train / validation / test set?
Dataset Split: Fixed/ Random Split
- Fixed split: We will split our dataset once
- Training set: used for optimizing GNN parameters
- Validation set: develop model/hyperparameters
- Test set: held out until we report final performance
- A concern: sometimes we cannot guarantee that the test set will really be held out
- Random split: we will randomly split our dataset into training / validation / test
- We report average performance over different random seeds
Why Splitting Graphs is Special
Suppose we want to split an image dataset
- Image classification: Each data point is an image
- Here data points are independent
- Image 5 will not affect our prediction on image 1
Splitting a graph dataset is different!
- Node classification: Each data point is a node
- Here data points are NOT independent
- Node 5 will affect our prediction on node 1, because it will participate in message passing -> affect node 1’s embedding
What are our options?
Solution 1 (Transductive(传导的) setting): The input graph can be observed in all the dataset splits (training, validation and test set).
We will only split the (node) labels
- At training time, we compute embeddings using the entire graph, and train using node 1&2’s labels
- At validation time, we compute embeddings using the entire graph, and evaluate on node 3&4’s labels
Solution 2 (Inductive(归纳法的) setting): We break the edges between splits to get multiple graphs
- Now we have 3 graphs that are independent. Node 5 will not affect our prediction on node 1 any more
- At training time, we compute embeddings using the graph over node 1&2, and train using node 1&2’s labels
- At validation time, we compute embeddings using the graph over node 3&4, and evaluate on node 3&4’s labels
Transductive setting: training / validation / test sets are on the same graph
- The dataset consists of one graph
- The entire graph can be observed in all dataset splits, we only split the labels
- Only applicable to node / edge prediction tasks
Inductive setting: training / validation / test sets are on different graphs
- The dataset consists of multiple graphs
- Each split can only observe the graph(s) within the split. A successful model should generalize to unseen graphs
- Applicable to node / edge / graph tasks
Graph Classification
Only the inductive setting is well defined for graph classification
- Because we have to test on unseen graphs
- Suppose we have a dataset of 5 graphs. Each split will contain independent graph(s)
Link Prediction
Goal of link prediction: predict missing edges
Setting up link prediction is tricky:
- Link prediction is an unsupervised / self-supervised task. We need to create the labels and dataset splits on our own
- Concretely, we need to hide some edges from the GNN and the let the GNN predict if the edges exist
For link prediction, we will split edges twice
Step 1: Assign 2 types of edges in the original graph
Message edges: Used for GNN message passing
Supervision edges: Use for computing objectives
After step 1:
Only message edges will remain in the graph
Supervision edges are used as supervision for edge predictions made by the model, will not be fed into GNN!
Step 2: Split edges into train / validation / test
- Option 1: Inductive link prediction split
- Suppose we have a dataset of 3 graphs. Each inductive split will contain an independent grap
- Suppose we have a dataset of 3 graphs. Each inductive split will contain an independent graph
- In train or val or test set, each graph will have 2 types of edges: message edges + supervision edges § Supervision edges are not the input to GNN
Option 2: Transductive link prediction split:
This is the default setting when people talk about link prediction
Suppose we have a dataset of 1 graph
By definition of “transductive”, the entire graph can be observed in all dataset splits
- But since edges are both part of graph structure and the supervision, we need to hold out validation / test edges
- To train the training set, we further need to hold out supervision edges for the training set
Next: we will show the exact settings
Summary: Transductive link prediction split
- Note: Link prediction settings are tricky and complex. You may find papers do link prediction differently.
- Luckily, we have full support in PyG and GraphGym
GNN Training Pipeline
Implementation resources:
- DeepSNAP provides core modules for this pipeline
- GraphGym further implements the full pipeline to facilitate GNN design
Summary
- GNN Layer:
- Transformation + Aggregation
- Classic GNN layers: GCN, GraphSAGE, GAT
- Layer connectivity:
- The over-smoothing problem
- Solution: skip connections
- Graph Augmentation:
- Feature augmentation
- Structure augmentation
- Learning Objectives
- The full training pipeline of a GNN
When Things Dont Go As Planned
General Tips
- Data preprocessing is important:
- Node attributes can vary a lot!
- E.g. probability ranges (0,1), but some inputs could have much larger range, say (−1000, 1000)
- Use normalization
- Node attributes can vary a lot!
- Optimizer:
- ADAM is relatively robust to learning rate
- Activation function
- ReLU activation function often works well
- Other alternatives: LeakyReLU, SWISH, rational activation
- No activation function at your output layer:
- Include bias term in every layer
- Embedding dimensions:
- 32, 64 and 128 are often good starting points
Debugging Deep Networks
- Debug issues: Loss/accuracy not converging during training
- Check pipeline (e.g. in PyTorch we need zero_grad)
- Adjust hyperparameters such as learning rate
- Pay attention to weight parameter initialization
- Important for model development:
- Overfit on (part of) training data:
- With a small training dataset, loss should be essentially close to 0, with an expressive neural network
- If neural network cannot overfit a single data point, something is wrong
- Scrutinize loss function!
- Scrutinize visualizations!
- Overfit on (part of) training data:
Setting-up GNN Prediction Tasks
How Expressive are Graph Neural Networks?
Theory of GNNS
How powerful are GNNs?
- Many GNN models have been proposed (e.g., GCN, GAT, GraphSAGE, design space).
- What is the expressive power (ability to distinguish different graph structures) of these GNN models?
- How to design a maximally expressive GNN model?
Background: Many GNN Models
Many GNN models have been proposed:
- GCN, GraphSAGE, GAT, Design Space etc.
GNN Model Example
- GCN (mean-pool) [Kipf and Welling ICLR 2017]
- GraphSAGE (max-pool) [Hamilton et al. NeurIPS 2017]
Note: Node Colors
- We use node same/different colors to represent nodes with same/different features.
- For example, the graph below assumes all the nodes share the same feature.
- Key question: How well can a GNN distinguish different graph structures?
Local Neighborhood Structures
We specifically consider local neighborhood structures around each node in a graph.
- Example: Nodes 1 and 5 have different neighborhood structures because they have different node degrees.
We specifically consider local neighborhood structures around each node in a graph. 1 2 3 5 4
- Example: Nodes 1 and 4 both have the same node degree of 2. However, they still have different neighborhood structures because their neighbors have different node degrees
Node 1 has neighbors of degrees 2 and 3. Node 4 has neighbors of degrees 1 and 3.
We specifically consider local neighborhood structures around each node in a graph. 1 2 3 5 4
- Example: Nodes 1 and 2 have the same neighborhood structure because they are symmetric within the graph.
Node 1 has neighbors of degrees 2 and 3. Node 2 has neighbors of degrees 2 and 3. And even if we go a step deeper to 2nd hop neighbors, both nodes have the same degrees (Node 4 of degree 2)
- Key question: Can GNN node embeddings distinguish different node’s local neighborhood structures?
- If so, when? If not, when will a GNN fail?
- Next: We need to understand how a GNN captures local neighborhood structures.
- Key concept: Computational graph
Computational Graph
In each layer, a GNN aggregates neighboring node embeddings.
A GNN generates node embeddings through a computational graph defined by the neighborhood.
- Ex: Node 1’s computational graph (2-layer GNN)
- Ex: Nodes 1 and 2’s computational graphs.
- Ex: Nodes 1 and 2’s computational graphs.
- But GNN only sees node features (not IDs):
A GNN will generate the same embedding for nodes 1 and 2 because:
- Computational graphs are the same.
- Node features (colors) are identical.
Note: GNN does not care about node ids, it just aggregates features vectors of different nodes.
In general, different local neighborhoods define different computational graphs
Computational graphs are identical to rooted subtree structures around each node.
GNN‘s node embeddings capture rooted subtree structures.
Most expressive GNN maps different rooted subtrees into different node embeddings (represented by different colors).
Recall: Iniective Function
How Expressive is a GNN?
Most expressive GNN should map subtrees to the node embeddings injectively.
Key observation: Subtrees of the same depth can be recursively characterized from the leaf nodes to the root nodes.
If each step of GNN’s aggregation can fully retain the neighboring information, the generated node embeddings can distinguish different rooted subtrees.
In other words, most expressive GNN would use an injective neighbor aggregation function at each step.
Maps different neighbors to different embeddings.
Summary so far
- To generate a node embedding, GNNs use a computational graph corresponding to a subtree rooted around each node.
- GNN can fully distinguish different subtree structures if every step of its neighbor aggregation is injective.
Designing the Most Powerful Graph Neural Network
Expressive Power of GNNS
- Key observation: Expressive power of GNNs can be characterized by that of neighbor aggregation functions they use.
- A more expressive aggregation function leads to a more expressive a GNN.
- Injective aggregation function leads to the most expressive GNN.
- Next:
- Theoretically analyze expressive power of aggregation functions.
Neighbor Aggregation
Observation: Neighbor aggregation can be abstracted as a function over a multi-set (a set with repeating elements).
Next: We analyze aggregation functions of two popular GNN models
- GCN (mean-pool) [Kipf & Welling ICLR 2017]
- Take element-wise mean, followed by linear function and ReLU activation, i.e., max(0, x).
- Theorem [Xu et al. ICLR 2019]
- GCN’s aggregation function cannot distinguish different multi-sets with the same color proportion.
For simplicity, we assume node colors are represented by one-hot encoding.
Example) If there are two distinct colors:
This assumption is sufficient to illustrate how GCN fails.
Failure case illustration
- GraphSAGE (max-pool) [Hamilton et al. NeurIPS 2017]
- Apply an MLP, then take element-wise max.
- Theorem [Xu et al. ICLR 2019]
- GraphSAGE’s aggregation function cannot distinguish different multi-sets with the same set of distinct colors.
Failure case illustration
Summary So Far
We analyzed the expressive power of GNNs.
Main takeaways:
- Expressive power of GNNs can be characterized by that of the neighbor aggregation function.
- Neighbor aggregation is a function over multi-sets (sets with repeating elements)
- GCN and GraphSAGE’s aggregation functions fail to distinguish some basic multi-sets; hence not injective.
- Therefore, GCN and GraphSAGE are not maximally powerful GNNs.
Designing Most Expressive GNNS
- Our goal: Design maximally powerful GNNs in the class of message-passing GNNs.
- This can be achieved by designing injective neighbor aggregation function over multi-sets.
- Here, we design a neural network that can model injective multiset function.
Injective Multi-set Function
Universal Approximation Theorem
Injective Multi-set Function
We have arrived at a neural network that can model any injective multi-set function.
In practice, MLP hidden dimensionality of 100 to 500 is sufficient.
Most Expressive GNN
Graph Isomorphism Network (GIN) [Xu et al. ICLR 2019]
Apply an MLP, element-wise sum, followed by another MLP
- Theorem [Xu et al. ICLR 2019]
- GIN‘s neighbor aggregation function is injective.
- No failure cases!
- GIN is THE most expressive GNN in the class of message-passing GNNs!
Full Model of GIN
So far: We have described the neighbor aggregation part of GIN.
We now describe the full model of GIN by relating it to WL graph kernel (traditional way of obtaining graph-level features).
- We will see how GIN is a “neural network” version of the WL graph kernel.
Relation to WL Graph Kernel
Recall: Color refinement algorithm in WL kernel.
Color Refinement
Example of color refinement given two graphs
Example of color refinement given two graphs
- Process continues until a stable coloring is reached
- Two graphs are considered isomorphic if they have the same set of colors.
The Complete GIN Mode
GIN uses a neural network to model the injective HASH function.
Specifically, we will model the injective function over the tuple:
Theorem (Xu et al. ICLR 2019)
Any injective function over the tuple
GIN and WL Graph Kernel
GIN can be understood as differentiable neural version of the WL graph Kernel:
Advantages of GIN over the WL graph kernel are:
- Node embeddings are low-dimensional; hence, they can capture the fine-grained similarity of different nodes.
- Parameters of the update function can be learned for the downstream tasks.
Expressive Power of GIN
- Because of the relation between GIN and the WL graph kernel, their expressive is exactly the same.
- If two graphs can be distinguished by GIN, they can be also distinguished by the WL kernel, and vice versa.
- How powerful is this?
- WL kernel has been both theoretically and empirically shown to distinguish most of the real-world graphs [Cai et al. 1992].
- Hence, GIN is also powerful enough to distinguish most of the real graphs!
Summary of the Lecture
- We design a neural network that can model injective multi-set function.
- We use the neural network for neighbor aggregation function and arrive at GIN—the most expressive GNN model.
- The key is to use element-wise sum pooling, instead of mean-/max-pooling.
- GIN is closely related to the WL graph kernel.
- Both GIN and WL graph kernel can distinguish most of the real graphs!
The Power of Pooling
Can expressive power of GNNs be improved?
There are basic graph structures that existing GNN framework cannot distinguish, such as difference in cycles.
GNNs’ expressive power can be improved to resolve the above problem. [You et al. AAAI 2021, Li et al. NeurIPS 2020]
Summary
GNNs and connection to bijective functions on sets.
Most powerful GNN is equivalent to WL graph isomorphism test.
GIN is the most powerful GNN.
- Sum aggregator is more powerful than mean is more powerful than max.
Heterogeneous Graphs and Knowledge Graph Embeddings
Heterogeneous(异构) Graphs and Relational GCN (RGCN)
Heterogeneous(异构) Graphs
A heterogeneous graph is defined as
Relational GCN
Schlichtkrull et al., Modeling Relational Data with Graph Convolutional Networks, ESWC 2018
We will extend GCN to handle heterogeneous graphs with multiple edge/relation types
We start with a directed graph with one relation
- How do we run GCN and update the representation of the target node A on this graph?
What if the graph has multiple relation types?
What if the graph has multiple relation types?
Use different neural network weights for different relation types
Definition
RGCN: Scalability
Block Diagonal Matrices
Key insight: make the weights sparse
Key insight: Share weights across different relations!
Example: Entity/Node Classification
Goal: Predict the label of a given node
Example Link Prediction
Summary of RGCN
- Relational GCN, a graph neural network for heterogeneous graphs
- Can perform entity classification as well as link prediction tasks.
- Ideas can easily be extended into RGNN (RGraphSAGE, RGAT, etc.)
Knowledge Graphs KG Completion with Embeddings
Knowledge Graphs( KG)
Knowledge in graph form:
Capture entities, types, and relationships
- Nodes are entities
- Nodes are labeled with their types
- Edges between two nodes capture relationships between entities
- KG is an example of a heterogeneous graph
Example: Bibliographic Networks
- Node types: paper, title, author, conference, year
- Relation types: pubWhere, pubYear, hasTitle, hasAuthor, cite
Example: Bio Knowledge Graphs
- Node types: drug, disease, adverse event, protein, pathways
- Relation types: has_func, causes, assoc, treats, is_a
Knowledge Graphs in Practice
- Examples of knowledge graphs
- Google Knowledge Graph
- Amazon Product Graph
- Facebook Graph API
- IBM Watson
- Microsoft Satori
- Project Hanover/Literome
- LinkedIn Knowledge Graph
- Yandex Object Answer
Applications of Knowledge Graphs
- Serving information
- Question answering and conversation agents
Knowledge Graph Datasets
- Publicly available KGs:
- FreeBase, Wikidata, Dbpedia, YAGO, NELL, etc.
- Common characteristics:
- Massive: millions of nodes and edges
- Incomplete: many true edges are missing
Example: Freebase
Knowledge Graph Completion TransE, TransR, DistMul, ComplEx
KG Completion Task
- Given an enormous KG, can we complete the KG?
- For a given (head, relation), we predict missing tails.
- (Note this is slightly different from link prediction task)
KG Representation
TransE
Bordes, et al., Translating embeddings for modeling multi-relational data, NeurIPS 2013
Connectivity Patterns in KG
Relations in a heterogeneous KG have different properties
Can we categorize these relation patterns?
Are KG embedding methods (e.g., TransE) expressive enough to model these patterns?
Relation Patterns
Antisymmetric Relations in Transe
Inverse Relations in TransE
Composition in Transe
Limitation Symmetric Relations
Limitation: 1-to-n Relations
Knowledge Graph Completion : TransR
TransR
Lin, et al., Learning entity and relation embeddings for knowledge graph completion,AAAI 2015
TransE models translation of any relation in the same embedding space.
Can we design a new space for each relation and do translation in relation-specific space?
Symmetric Relations in TransR
1-to-n Relations in TransR
Inverse Relations in TransR
Composition Relations in TransR
High-level intuition: TransR models a triple with linear functions, they are chainable
Knowledge Graph Completion DistMult
New Idea: Bilinear Modeling
DistMult
1-to-n Relations in DistMult
Symmetric Relations in DistMult
Limitation: Antisymmetric Relations
Limitation: Inverse Relations
Limitation: Composition Relations
Knowledge Graph Completion ComplEx
ComplEx
Trouillon et al, Complex Embeddings for Simple Link Prediction, ICML 2016
Based on Distmult, ComplEx embeds entities and relations in Complex vector space
Antisymmetric Relations in Complex
Symmetric Relations in Complex
Inverse Relations in ComplEx
Composition and 1-to-n
Expressiveness of All Models
Properties and expressive power of different KG completion methods:
KG Embeddings in Practice
Sun et al, RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space, ICLR 2019
- Different KGs may have drastically different relation patterns!
- There is not a general embedding that works for all KGs, use the table to select models
- Try TransE for a quick run if the target KG does not have much symmetric relations
- Then use more expressive models, e.g., ComplEx, RotatE (TransE in Complex space)
Summary
- Link prediction / Graph completion is one of the prominent tasks on knowledge graphs
- Introduce TransE / TransR / DistMult / ComplEx models with different embedding space and expressiveness
Reasoning in Knowledge Graphs using Embeddings
Example KG: Biomedicine
Predictive Queries on KG
Can we do multi-hop reasoning, i.e., answer complex queries on an incomplete, massive KG?
Predictive One-hop Queries
We can formulate knowledge graph completion problems as answering one-hop queries.
Path Queries
Generalize one-hop queries to path queries by adding more relations on the path.
Query plan of path queries is a chain.
Question: “What proteins are associated with adverse events caused by Fulvestrant?”
Query: (e:Fulvestrant, (r:Causes, r:Assoc)) Given a KG, how to answer a path query?
Traversing Knowledge Graphs
- We answer path queries by traversing the KG: “What proteins are associated with adverse events caused by Fulvestrant?”
- Query: (e:Fulvestrant, (r:Causes, r:Assoc))
However, KGS are incomplete
- Answering queries seems easy: Just traverse the graph.
- But KGs are incomplete and unknown:
- Many relations between entities are missing or are incomplete
- For example, we lack all the biomedical knowledge
- Enumerating all the facts takes non-trivial time and cost, we cannot hope that KGs will ever be fully complete
- Many relations between entities are missing or are incomplete
- Due to KG incompleteness, one is not able to identify all the answer entities
Can we first do KG completion and then traverse the completed (probabilistic) KG?
Task: Predictive Oueries
We need a way to answer path-based queries over an incomplete knowledge graph.
We want our approach to implicitly impute and account for the incomplete KG.
Task: Predictive queries
- Want to be able to answer arbitrary queries while implicitly imputing for the missing information
- Generalization of the link prediction task
Answering Predictive Queries on Knowledge Graphs
General Idea
Map queries into embedding space. Learn to reason in that space
- Embed query into a single point in the Euclidean space: answer nodes are close to the query.
- Query2Box: Embed query into a hyper-rectangle (box) in the Euclidean space: answer nodes are enclosed in the box.
[Embedding Logical Queries on Knowledge Graphs. Hamilton, et al., NeurIPS 2018]
[Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings. Ren, et al., ICLR 2020]
Idea: Traversing KG in Vector Space
Guu, et al., Traversing knowledge graphs in vector space, EMNLP 2015
Key idea: Embed queries!
Traversing KG in Vector Space
The embedding process only involves vector addition, independent of # entities in the KG!
Embed path queries in vector space.
- Question: “What proteins are associated with adverse events caused by Fulvestrant?”
- Query: (e:Fulvestrant, (r:Causes , r:Assoc))
Insights:
- We can train TransE to optimize knowledge graph completion objective (Lecture 10)
- Since TransE can naturally handle compositional relations, it can handle path queries by translating in the latent space for multiple hops using addition of relation embeddings.
- For TransR / DistMult / ComplEx, since they cannot handle compositional relations, they cannot be easily extended to handle path queries.
Conjunctive Queries
Can we answer more complex queries with logic conjunction operation?
- Conjunctive Queries: “What are drugs that cause Short of Breath and treat diseases associated with protein ESR2?” ((e:ESR2, (r:Assoc, r:TreatedBy)), (e:Short of Breath, (r:CausedBy))
How do we answer the question by KG traversal?
Traverse KG from anchor nodes: ESR2 and Short of Breath:
- How can we use embeddings to implicitly impute the missing (ESR2, Assoc, Breast Cancer)?
- Intuition: ESR2 interacts with both BRCA1 and ESR1. Both proteins are associated with breast cancer
Traversing KG in Vector Space
“What are drugs that cause Short of Breath and treat diseases associated with protein ESR2?”
((e:ESR2, (r:Assoc, r:TreatedBy)), (e:Short of Breath, (r:CausedBy))
Each intermediate node represents a set of entities, how do we represent it? How do we define the intersection operation in the latent space?
Qvery2box: Reasoning over KGS Using Box Embeddings
Conjunctive Queries
How can we answer more complex queries with logical conjunction operation?
(1) Each intermediate node represents a set of entities; how do we represent it?
(2) How do we define the intersection operation in the latent space?
Box Embeddings
Embed queries with hyper-rectangles (boxes)
Key Insight: Intersection(交点)
- Intersection of boxes is well-defined!
- When we traverse the KG to find the answers, each step produces a set of reachable entities.
- How can we better model these sets?
- Boxes are a powerful abstraction, as we can project the center and control the offset to model the set of entities enclosed in the box
Embed with Box Embedding
Projection(预测) Operator
Embed with Box Embedding
How do we take intersection of boxes?
Intersection Operator
Use box intersection operator
Entity-to-box Distance
Extending to Union Operation
- Can we embed complex queries with union? E.g.: “What drug can treat breast cancer or lung cancer?”
- Conjunctive queries + disjunction is called Existential Positive First-order (EPFO) queries. We’ll refer to them as AND-OR queries.
- Can we also design a disjunction operator and embed AND-OR queries in low-dimensional vector space?
Embedding AND-OR Queries
- Can we embed AND-OR queries in a low-dimensional vector space?
- No! Intuition: Allowing union over arbitrary queries requires high-dimensional embeddings!
- Example:
- Example 2:
Can we embed AND-OR queries in low-dimensional vector space?
Since we cannot embed AND-OR queries in lowdimensional space, can we still handle them?
- Key idea: take all unions out and only do union at the last step!
Disjunctive Normal Form
Any AND-OR query can be transformed into equivalent DNF, i.e., disjunction of conjunctive queries.
Distance Between q and an Entity
How to Train Query2box
Training Overview
How to achieve a query, its answers, its negative answers from the KG to train the parameters?
How to split the KG for query answering?
Training
Query Generation from Templates
Generate queries from multiple query templates:
- How can we generate a complex query?
- We start with a query template
- Query template can be viewed as an abstraction of the query
- We generate a query by instantiating every variable with a concrete entity and relation from the KG
- E.g., instantiate Anchor1 with ESR2 (a node on KG)
- E.g., instantiate Rel1 with Assoc (an edge on KG)
- How to instantiate query template given a KG?
How to instantiate a query template given a KG?
Example of Query2box
Visualization
- What do box embeddings actually learn? Example: “List male instrumentalists who play string instruments”
- We use t-SNE to reduce the embedding space to a 2-dimensional space, in order to visualize the query results
Embedding Space
Summary
We introduce answering predictive queries on large knowledge graphs.
The key idea is to embed queries by navigating the embedding space!
We embed the query by composing learned operators
Embedding of the query is close to its answers in the embedding space
Fast Neural Subgraph Matching and Counting
Subgraphs
Subgraphs are the building blocks of networks:
They have the power to characterize and discriminate networks
Building Blocks of Networks
Subgraphs and Motifs
Definition
Two ways to formalize “network building blocks”
Alternate terminology: “induced subgraph”
Alternate terminology: “non-induced subgraph” or just “subgraph”
The best definition depends on the domain! Examples:
- Chemistry: Node-induced (functional groups)
- Knowledge graphs: Often edge-induced (focus is on edges representing logical relations)
Graph Isomorphism(同构关系)
Graph isomorphism problem: Check whether two graphs are identical:
Case Example of Subgraphs
All non-isomorphic, connected, undirected graphs of size 4
All non-isomorphic, connected, directed graphs of size 3
Network Motifs
Network motifs: “recurring, significant patterns of interconnections”
How to define a network motif:
- Pattern: Small (node-induced) subgraph
- Recurring: Found many times, i.e., with high frequency
- Significant: More frequent than expected, i.e., in randomly generated graphs?
- How to define frequency? How to define random graphs?
Motifs: Induced Subaraphs
Why Do We Need Motifs?
Subgraph Frequency
What if the dataset contains multiple graphs, and we want to compute frequency of subgraphs in the dataset?
Solution: Treat the dataset as a giant graph G_T with disconnected components corresponding to individual graphs
Defining Motif Significance
To define significance, we need to have a null-model (i.e., point of comparison).
Key idea: Subgraphs that occur in a real network much more often than in a random network have functional significance.
Defining Random Graphs
New Model: Configuration Mode
Alternative for Spokes: Switching
Intuition: Motifs are overrepresented in a network when compared to random graphs:
- Step 1: Count motifs in the given graph (G_real)
- Step 2: Generate random graphs with similar statistics (e.g. number of nodes, edges, degree sequence), and count motifs in the random graphs
- Step 3: Use statistical measures to evaluate how significant is each motif
- Use Z-score
Z-score for Statistical Sianificance
For each subgraph:
z-score metric is capable of classifying the subgraph “significance”:
Negative values indicate under-representation
Positive values indicate over-representation
We create a network significance profile:
- A feature vector with values for all subgraph types
Next: Compare profiles of different graphs with random graphs:
- Regulatory network (gene regulation)
- Neuronal network (synaptic connections)
- World Wide Web (hyperlinks between pages)
- Social network (friendships)
- Language networks (word adjacency)
Summary: Detecting Motifs(主题)
Variations on the Motif Concept
Extensions:
- Directed and undirected
- Colored and uncolored
- Temporal and static motifs
Variations on the concept:
- Different frequency concepts
- Different significance metrics
- Under-Representation (anti-motifs)
- Different null models
Subgraphs and motifs are the building blocks of graphs
- Subgraph isomorphism and counting are NP-hard
Understanding which motifs are frequent or significant in a dataset gives insight into the unique characteristics of that domain
Use random graphs as null model to evaluate the significance of motif via Z-score
Neural Subgraph Matching
Given:
- Large target graph (can be disconnected)
- Query graph (connected)
Decide:
- Is a query graph a subgraph in the target graph?
Isomorphism as an ML Task
- Large target graph (can be disconnected)
- Query graph (has to be connected)
- Use GNN to predict subgraph isomorphism:
- Intuition: Exploit the geometric shape of embedding space to capture the properties of subgraph isomorphism
Task Setup
Consider a binary prediction: Return True if query is isomorphic to a subgraph of the target graph, else return False
Finding node correspondences between Q and T is another challenging problem, which will not be covered in this lecture.
Overview of the Approach
Neural Architecture for Subgraphs
(1) We are going to work with node-anchored definitions:
(2) We are going to work with node-anchored neighborhoods:
Why Anchor?
Order Embedding Space
Why Order Embedding Space?
Subgraph isomorphism relationship can be nicely encoded in order embedding space
All properties have their counter-parts in the order embedding space
Order Constraint
- We use a GNN to learn to embed neighborhoods and preserve the order embedding structure
- What loss function should we use, so that the learned order embedding reflects the subgraph relationship?
- We design loss functions based on the order constraint:
- Order constraint specifies the ideal order embedding property that reflects subgraph relationships
We specify the order constraint to ensure that the subgraph properties are preserved in the order embedding space
Loss Function
GNN Embeddings are learned by minimizing a max-margin loss
Training Neural Subgraph Matching
Training Details
- How many training examples to sample?
- At every iteration, we sample new training pairs
- Benefit: Every iteration, the model sees different subgraph examples
- Improves performance and avoids overfitting – since there are exponential number of possible subgraphs to sample from
- How deep is the BFS sampling?
- A hyper-parameter that trades off runtime and performance
- Usually use 3-5, depending on size of the dataset
Subgraph Predictions on New Graphs
Summary: Neural Subgraph Matching
- Neural subgraph matching uses a machine learning-based approach to learn the NP-hard problem of subgraph isomorphism
- Given query and target graph, it embeds both graphs into an order embedding space
- Using these embeddings, it then computes E(G_q, G_t) to determine whether query is a subgraph of the target
- Embedding graphs within an order embedding space allows subgraph isomorphism to be efficiently represented and tested by the relative positions of graph embeddings
Finding Frequent Subgraphs
Intro
Generally, finding the most frequent size-k motifs requires solving two challenges:
1) Enumerating all size-k connected subgraphs
2) Counting #(occurrences of each subgraph type)
Why is it Hard?
- Just knowing if a certain subgraph exists in a graph, is a hard computational problem!
- Subgraph isomorphism is NP-complete
- Computation time grows exponentially as the size of the subgraphs increases
- Feasible motif size for traditional methods is relatively small (3 to 7)
Solution with Representation Learning
- Finding frequent subgraph patterns is computationally hard
- Combinatorial explosion of number of possible patterns
- Counting subgraph frequency is NP-hard
- Representation learning can tackle these challenges:
- Combinatorial explosion -> organize the search space
- Subgraph isomorphism -> prediction using GNN
Representation learning can tackle these challenges:
- Counting #(occurrences of each subgraph type)
- Solution: Use GNN to “predict” the frequency of the subgraph.
- Enumerating all size-k connected subgraphs
- Solution: Don’t enumerate subgraphs but construct a size-k subgraph incrementally
- Note: We are only interested in high frequency subgraphs
- Solution: Don’t enumerate subgraphs but construct a size-k subgraph incrementally
Problem Setup: Frequent Motif Mining
SpMiner: Overview
SPMiner: A neural model to identify frequent motifs
Key Idea
Motif Frequency Estimation
Spminer Search Procedure
Termination: Upon reaching a desired motif size, take the subgraph of the target graph induced by S.
Results: Small Motifs
- Ground-truth: Find most frequent 10 motifs in dataset by brute-force exact enumeration (expensive)
- Question: Can the model identify frequent motifs?
- Result: The model identifies 9 and 8 of the top 10 motifs, respectively.
Experiments: Large motifs
- Question: How do the frequencies of the identified motif compare?
- Result: SPMiner identifies motifs that appear 10-100x more frequently than the baselines
Summary
- Subgraphs and motifs are important concepts that provide insights into the structure of graphs. Their frequency can be used as features for nodes/graphs.
- We covered neural approaches to prediction subgraph isomorphism relationship.
- Order embeddings have desirable properties and can be used to encode subgraph relations
- Neural embedding-guided search in order embedding space can enable ML model to identify motifs much more frequent than existing methods
GNNS for Recommender Systems
Recommender Systems Task and Evaluation
Preliminary of Recommendation
Information Explosion in the era of Internet
- 10K+ movies in Netflix
- 12M products in Amazon
- 70M+ music tracks in Spotify
- 10B+ videos on YouTube
- 200B+ pins (images) in Pinterest
Personalized recommendation (i.e., suggesting a small number of interesting items for each user) is critical for users to effectively explore the content of their interest.
Recommender System as a Graph
Recommender system can be naturally modeled as a bipartite graph
- A graph with two node types: users and items.
- Edges connect users and items
- Indicates user-item interaction (e.g., click, purchase, review etc.)
- Often associated with timestamp (timing of the interaction).
Recommendation Task
- Given
- Past user-item interactions
- Task
- Predict new items each user will interact in the future.
- Can be cast as link prediction problem.
- Predict new user-item interaction edges given the past edges.
Top-k Recommendation
Evaluation Metric: Recall@k(1)
Recommender Systems Embedding-based Models
Notation
Score Function
Training Objective
Embedding-based models have three kinds of parameters:
Training objective: Optimize the model parameters to achieve high recall@K on seen (i.e., training) user-item interactions
- We hope this objective would lead to high recall@K on unseen (i.e., test) interactions.
Surrogate(代理) Loss Functions
- The vanilla training objective (training recall@K) is not differentiable.
- Cannot apply efficient gradient-based optimization.
- Two surrogate loss functions are widely-used to enable efficient gradient-based optimization.
- Binary loss
- Bayesian Personalized Ranking (BPR) loss
- Surrogate losses are differentiable and should align well with the original training objective.
Binary Loss
Binary loss pushes the scores of positive edges higher than those of negative edges.
- This aligns with the training recall metric since positive edges need to be recalled.
Issue:
- In the binary loss, the scores of ALL positive edges are pushed higher than those of ALL negative edges.
- This would unnecessarily penalize model predictions even if the training recall metric is perfect.
- Why? (example in the next slide)
- Key insight: The binary loss is non-personalized in the sense that the positive/negative edges are considered across ALL users at once.
- However, the recall metric is inherently personalized (defined for each user).
- The non-personalized binary loss is overly-stringent for the personalized recall metric.
Desirable Surrogate Loss
Lesson learned: Surrogate loss function should be defined in a personalized manner.
For each user, we want the scores of positive items to be higher than those of the negative items
We do not care about the score ordering across users.
Bayesian Personalized Ranking (BPR) loss achieves this!
Bayesian Personalized Ranking (BPR) loss is a personalized surrogate loss that aligns better with the recall@K metric.
Summary
- We have introduced
- Recall@K as a metric for personalized recommendation
- Embedding-based models
- Three kinds of parameters to learn
- user encoder to generate user embeddings
- item encoder to generate item embeddings
- score function to predict the user-item interaction likelihood.
- Three kinds of parameters to learn
- Surrogate loss functions to achieve the high recall metric.
- Embedding-based models have achieved SoTA in recommender systems.
- Why do they work so well?
Why Embedding Models Work?
- Underlying idea:
- Collaborative filtering
- Recommend items for a user by collecting preferences of many other similar users.
- Similar users tend to prefer similar items.
- Key question: How to capture similarity between users/items?
Embedding-based models can capture similarity of users/items!
- Low-dimensional embeddings cannot simply memorize all user-item interaction data.
- Embeddings are forced to capture similarity between users/items to fit the data.
- This allows the models to make effective prediction on unseen user-item interactions.
This Lecture: GNNS for Recsys
Neural Graph Collaborative Filtering (NGCF) [Wang et al. 2019], LightGCN [He et al. 2020]
- Improve the conventional collaborative filtering models (i.e., matrix factorization) by explicitly modeling graph structure using GNNs.
PinSAGE [Ying et al. 2018]
- Use GNNs to generate high-quality embeddings by simultaneously capturing rich node attributes (e.g., images) and the graph structure.
Neural Graph Collaborative Filtering
Conventional Collaborative Filtering
Limitations of MF
The model itself does not explicitly capture graph structure
- The graph structure is only implicitly captured in the training objective.
Only the first-order graph structure (i.e., edges) is captured in the training objective.
- High-order graph structure (e.g., k-hop paths between two nodes) is not explicitly captured.
Motivation
- We want a model that…
- explicitly captures graph structure (beyond implicitly through the training objective)
- captures high-order graph structure (beyond the first-order edge connectivity structure)
- GNNs are a natural approach to achieve both!
- Neural Graph Collaborative Filtering (NGCF) [Wang et al. 2019]
- LightGCN [He et al. 2020]
- A simplified and improved version of NGCF
NGCF: Overview
Neural Graph Collaborative Filtering (NGCF) explicitly incorporates high-order graph structure when generating user/item embeddings.
Key idea: Use a GNN to generate graph-aware user/item embeddings.
Framework
- Given: User-item bipartite graph.
- NGCF framework:
- Prepare shallow learnable embedding for each node.
- Use multi-layer GNNs to propagate embeddings along the bipartite graph.
- High-order graph structure is captured.
- Final embeddings are explicitly graph-aware!
- Two kinds of learnable params are jointly learned:
- Shallow user/item embeddings
- GNN’s parameters
Initial Node Embeddings
Neighbor Aggregation
Final Embeddings and Score Function
NGCF: Summary
- Conventional collaborative filtering uses shallow user/item embeddings.
- The embeddings do not explicitly model graph structure.
- The training objective does not model high-order graph structure.
- NGCF uses a GNN to propagate the shallow embeddings.
- The embeddings are explicitly aware of high-order graph structure.
Lightgcn
Motivation
- Recall: NGCF jointly learns two kinds of parameters:
- Shallow user/item embeddings
- GNN’s parameters
- Observation: Shallow learnable embeddings are already quite expressive.
- Can we simplify the GNN used in NGCF (e.g., remove its learnable parameters)?
- Answer: Yes!
- Bonus: Simplification improves the recommendation performance!
- Overview of the idea:
- Adjacency matrix for a bipartite graph
- Matrix formulation of GCN
- Simplification of GCN by removing non-linearity
- Related: SGC for scalable GNN [Wu et al. 2019]
Adjacency and Embedding Matrices
Adjacency matrix of a (undirected) bipartite graph.
Shallow embedding matrix
Matrix Formulation of GCN
Simplifying GCN
Multi-scale Diffusion
LightGCN: Model Overview
Given:
- Adjacency matrix A
- Initial learnable embedding matrix E
Average the embedding matrices at different scales
Score function:
LightGCN: Intuition
Question: Why does the simple diffusion propagation work well?
Answer: The diffusion directly encourages the embeddings of similar users/items to be similar.
- Similar users share many common neighbors (items) and are expected to have similar future preferences (interact with similar items)
Lightgcn and GCN/C&S
LIGHTGCN and MF: Comparison
- Both LightGCN and Matrix Factorization (MF) learn a unique embedding for each user/item.
- The difference is that
- MF directly uses the shallow user/item embeddings for scoring.
- LightGCN uses the diffused user/item embeddings for scoring.
- LightGCN performs better than MF but are also more computationally expensive due to the additional diffusion step.
- The final embedding of a user/item is obtained by aggregating embeddings of its multi-hop neighboring nodes.
Summary
- LightGCN simplifies NGCF by removing the learnable parameters of GNNs.
- Learnable parameters are all in the shallow input node embeddings.
- Diffusion propagation only involves matrix-vector multiplication.
- The simplification leads to better empirical performance than NGCF
PINSAGE
Motivation
P2P recommendation
PINSAGE: Pin Embedding
- Unifies visual, textual, and graph information.
- The largest industry deployment of a Graph Convolutional Networks
- Huge Adoption across Pinterest
- Works for fresh content and is available in a few seconds after pin creation
Graph Convolutional Neural Networks for Web-Scale Recommender Systems
PinSage graph convolutional network:
Goal: Generate embeddings for nodes in a large-scale Pinterest graph containing billions of objects
Key Idea: Borrow information from nearby nodes
- E.g., bed rail Pin might look like a garden fence, but gates and beds are rarely adjacent in the graph
Pin embeddings are essential to various tasks like recommendation of Pins, classification, ranking
Services like “Related Pins”, “Search”, “Shopping”, “Ads”
Harnessing Pins and Boards
PINSAGE: Graph Neural Network
- Graph has tens of billions of nodes and edges
- Further resolves embeddings across the Pinterest graph
PINSAGE: Methods for Scaling Up
In addition to the GNN model, the PinSAGE paper introduces several methods to scale the GNN to a billion-scale recommender system (e.g., Pinterest).
- Shared negative samples across users in a mini-batch
- Hard negative samples
- Curriculum learning
- Mini-batch training of GNNs on a large-graph (to be covered in the future lecture)
PINSAGE Model
Training Data
1+B repin pairs:
From Related Pins surface
Capture semantic relatedness
Goal: Embed such pairs to be “neighbors”
Example positive training pairs (Q,X):
Shared Negative Samples
Curriculum Learning
- Key insight: It is effective to make the negative samples gradually harder in the process of training.
- At n-th epoch, we add n − 1 hard negative items.
- #(Hard negatives) gradually increases in the process of training.
- The model will gradually learn to make finer-grained predictions
Idea: use harder and harder negative samples
Include more and more hard negative samples for each epoch
Hard Negatives
Challenge: Industrial recsys needs to make extremely fine-grained predictions.
#Total items: Up to billions.
#Items to recommend for each user: 10 to 100.
Issue: The shared negative items are randomly sampled from all items
- Most of them are “easy negatives”, i.e., a model does not need to be fine-grained to distinguish them from positive items.
We need a way to sample “hard negatives” to force the model to be fine-grained
- For each user node, the hard negatives are item nodes that are close (but not connected) to the user node in the graph.
- Hard negatives for user u ∈ U are obtained as follows:
- Compute personalized page rank (PPR) for user u.
- Sort items in the descending order of their PPR scores.
- Randomly sample item nodes that are ranked high but not too high, e.g., 2000th —5000th .
- Item nodes that are close but not too close (connected) to the user node.
- The hard negatives for each user are used in addition to the shared negatives.
PINSAGE: Negative Sampling
- (q, p) positive pairs are given but various methods to sample negatives to form (q, p, n)
- Distance Weighted Sampling (Wu et al., 2017) - Sample negatives so that query-negative distance distribution is approx U[0.5, 1.4]
Fine-grained Object Similarity
Compare against Prod
PINSAGE: Summary
- PinSAGE uses GNNs to generate high-quality user/item embeddings that capture both the rich node attributes and graph structure.
- The PinSAGE model is effectively trained using sophisticated negative sampling strategies.
- PinSAGE is successfully deployed at Pinterest, a billion-scale image content recommendation service.
- Uncovered in this lecture: How to scale up GNNs to large-scale graphs. Will be covered in a later lecture