CS224W-Machine Learning with Graphs

CS224W | Home (stanford.edu)


Machine Learning with Graphs

Why is Graph Deep Learning Hard?

  • Networks are complex
    • Arbitrary size and complex topological structure (i.e., no spatial locality like grids)
    • No fixed node ordering or reference point
    • Often dynamic and have multimodal features

Deep Learning in Graphs

Representation Learning

(Supervised) Machine Learning Lifecycle: This feature, that feature. Every single time!

Map nodes to d-dimensional embeddings such that similar nodes in the network are embedded close together

Course Outline

We are going to cover various topics in Machine Learning and Representation Learning for graph structured data:

  • Traditional methods: Graphlets, Graph Kernels
  • Methods for node embeddings: DeepWalk, Node2Vec
  • Graph Neural Networks: GCN, GraphSAGE, GAT, Theory of GNNs
  • Knowledge graphs and reasoning: TransE, BetaE
  • Deep generative models for graphs: GraphRNN
  • Applications to Biomedicine, Science, Industry

Applications of Graph ML

Different Types of Tasks

Classic Graph ML Tasks

  • Node classification: Predict a property of a node
    • Example: Categorize online users / items
  • Link prediction: Predict whether there are missing links between two nodes
    • Example: Knowledge graph completion
  • Graph classification: Categorize different graphs
    • Example: Molecule property prediction
  • Clustering: Detect if nodes form a community
    • Example: Social circle detection
  • Other tasks:
    • Graph generation: Drug discovery
    • Graph evolution: Physical simulation

Node-level ML Tasks

Protein Folding

A protein chain acquires its native 3D structure

The Protein Folding Problem

Computationally predict a protein’s 3D structure based solely on its amino acid sequence

Alphafold: Impact

  • Key idea: “Spatial graph”
    • Nodes: Amino acids in a protein sequence
    • Edges: Proximity between amino acids (residues)

Edge-level ML Tasks

Recommender Systems

  • Users interacts with items
    • Watch movies, buy merchandise, listen to music
    • Nodes: Users and items
    • Edges: User-item interactions
  • Goal: Recommend items users might like

Pinsage: Graph-based Recommender

Ying et al., Graph Convolutional Neural Networks for Web-Scale Recommender Systems, KDD 2018

Drug Side Effects

  • Many patients take multiple drugs to treat complex or co-existing diseases:
    • 46% of people ages 70-79 take more than 5 drugs
    • Many patients take more than 20 drugs to treat heart disease, depression, insomnia, etc.
  • Task: Given a pair of drugs predict adverse side effects

Zitnik et al., Modeling Polypharmacy Side Effects with Graph Convolutional Networks, Bioinformatics 2018

Subgraph-level ML Tasks

Traffic Prediction

Road Network as a Graph

  • Nodes: Road segments
  • Edges: Connectivity between road segments
  • Prediction: Time of Arrival (ETA)

Graph-level ML Tasks

Drug Discovery

  • Antibiotics are small molecular graphs
    • Nodes: Atoms
    • Edges: Chemical bonds

Konaklieva, Monika I. “Molecular targets of β-lactam-based antimicrobials: beyond the usual suspects.” Antibiotics 3.2 (2014): 128-142.

Deep Learning for Antibiotic Discovery

Stokeset al., A Deep Learning Approach to Antibiotic Discovery, Cell 2020

  • A Graph Neural Network graph classification model
  • Predict promising molecules from a pool of candidates

Molecule Generation/Optimization

Youet al., Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation, NeurIPS 2018

Physics Simulation

Sanchez-Gonzalez et al., Learning to simulate complex physics with graph networks, ICML 2020

  • Physical simulation as a graph:
    • Nodes: Particles
    • Edges: Interaction between particles

Simulation Learning Framework

A graph evolution task:

  • Goal: Predict how a graph will evolve over

Choice of Graph Representation

Components of a Network

Choosing a Proper Representation

  • If you connect individuals that work with each other, you will explore a professional network
  • If you connect those that have a sexual relationship, you will be exploring sexual networks
  • If you connect scientific papers that cite each other, you will be studying the citation network
  • If you connect all papers with the same word in the title, what will you be exploring? It is a network, nevertheless

How do you define a graph

  • How to build a graph:
    • What are nodes?
    • What are edges?
  • Choice of the proper network representation of a given domain/problem determines our ability to use networks successfully:
    • In some cases, there is a unique, unambiguous representation
    • In other cases, the representation is by no means unique
    • The way you assign links will determine the nature of the question you can study

Directed vs Undirected Graphs

Heterogeneous Graphs

Node Degrees

Bipartite Graph

  • Bipartite graph is a graph whose nodes can be divided into two disjoint sets U and V such that every link connects a node in U to one in V; that is, U and V are independent sets
  • Examples:
    • Authors-to-Papers (they authored)
    • Actors-to-Movies (they appeared in)
    • Users-to-Movies (they rated)
    • Recipes-to-Ingredients (they contain)
  • “Folded” networks:
    • Author collaboration networks
    • Movie co-rating networks

FoldedProjected Bipartite Graphs

Representing Graphs: Adjacency Matrix

Networks are Sparse Graphs

Representing Graphs: Edge list

Representing Graphs: Adjacency list

  • Adjacency list:

    • Easier to work with if network is

      • Large

      • Sparse

    • Allows us to quickly retrieve all neighbors of a given node

      • 1:

      • 2: 3, 4

      • 3: 2, 4

      • 4: 5

      • 5: 1, 2

Node and Edge Attributes

  • Possible options:
    • Weight (e.g., frequency of communication)
    • Ranking (best friend, second best friend…)
    • Type (friend, relative, co-worker)
    • Sign: Friend vs. Foe, Trust vs. Distrust
    • Properties depending on the structure of the rest of the graph: Number of common friends

More Types of Graphs

Connectivity of Undirected Graphs

  • Connected (undirected) graph:
    • Any two vertices can be joined by a path
  • A disconnected graph is made up by two or more connected components

Connectivity: Example

The adjacency matrix of a network with several components can be written in a block- diagonal form, so that nonzero elements are confined to squares, with all other elements being zero:

Connectivity of Directed Graphs

  • Strongly connected directed graph
    • has a path from each node to every other node and vice versa (e.g., A-B path and B-A path)
  • Weakly connected directed graph
    • is connected if we disregard the edge directions

Strongly connected components (SCCs) can be identified, but not every node is part of a nontrivial strongly connected component.

Traditional Methods for Machine Learning in Graphs

Traditional Methods for Machine Learning in Graphs

Traditional ML Pipeline

Design features for nodes/links/graphs

Obtain features for all training data

Feature Desian

  • Using effective features over graphs is the key to achieving good model performance.

  • Traditional ML pipeline uses hand-designed features.

  • In this lecture, we overview the traditional features for:

    • Node-level prediction
    • Link-level prediction
    • Graph-level prediction
  • For simplicity, we focus on undirected graphs.

  • Goal: Make predictions for a set of objects

  • Design choices:

    • Features: d-dimensional vectors
    • Objects: Nodes, edges, sets of nodes, entire graphs
    • Objective function:
      • What task are we aiming to solve?

Node-level Tasks and Features

Node-level Features: Overview

  • Goal: Characterize the structure and position of a node in the network:
    • Node degree
    • Node centrality
    • Clustering coefficient
    • Graphlets

Node Degree

Node Centrality

  • Node degree counts the neighboring nodes without capturing their importance.
  • Node centrality 𝑐_𝑣 takes the node importance in a graph into account
  • Different ways to model importance:
    • Engienvector centrality
    • Betweenness centrality
    • Closeness centrality
    • and many others…

Eigenvector centrality

Betweenness centrality

Closeness centrality

Clustering Coefficient


  • Goal: Describe network structure around node 𝑢

    • Graphlets are small subgraphs that describe the structure of node 𝑢’s network neighborhood Analogy:
  • Degree counts #(edges) that a node touches

  • Clustering coefficient counts #(triangles) that a node touches.

  • Graphlet Degree Vector (GDV): Graphlet-base features for nodes

    • GDV counts #(graphlets) that a node touches
  • Considering graphlets of size 2-5 nodes we get:

    • Vector of 73 coordinates is a signature of a node that describes the topology of node’s neighborhood
  • Graphlet degree vector provides a measure of a node’s local network topology:

    • Comparing vectors of two nodes provides a more detailed measure of local topological similarity than node degrees or clustering coefficient.

Induced Subgraph Isomorphism

Def: Induced subgraph is another graph, formed from a subset of vertices and all of the edges connecting the vertices in that subset.

Def: Graph Isomorphism(同构)

  • Two graphs which contain the same number of nodes connected in the same way are said to be isomorphic.

Graphlet Degree Vector (GDV): A count vector of graphlets rooted at a given node

Node-level Feature: Summary

  • We have introduced different ways to obtain node features.

  • They can be categorized as:

    • Importance-based features:
      • Node degree
      • Different node centrality measures
    • Structure-based features:
      • Node degree
      • Clustering coefficient
      • Graphlet count vector
  • Importance-based features: capture the importance of a node in a graph

    • Node degree:
      • Simply counts the number of neighboring nodes
    • Node centrality:
      • Models importance of neighboring nodes in a graph
      • Different modeling choices: eigenvector centrality, betweenness centrality, closeness centrality
  • Useful for predicting influential nodes in a graph

    • Example: predicting celebrity users in a social network
  • Structure-based features: Capture topological properties of local neighborhood around a node.

    • Node degree:
      • Counts the number of neighboring nodes
    • Clustering coefficient:
      • Measures how connected neighboring nodes are
    • Graphlet degree vector:
      • Counts the occurrences of different graphlets
  • Useful for predicting a particular role a node plays in a graph:

    • Example: Predicting protein functionality in a protein-protein interaction network


  • The task is to predict new links based on the existing links.
  • At test time, node pairs (with no existing links) are ranked, and top 𝐾 node pairs are predicted.
  • The key is to design features for a pair of nodes.

Two formulations of the link prediction task:

  • Methodology:
    • For each pair of nodes (x,y) compute score c(x,y)
      • For example, c(x,y) could be the # of common neighbors of x and y
    • Sort pairs (x,y) by the decreasing score c(x,y)
    • Predict top n pairs as new links
    • See which of these links actually appear in 𝐺[𝑡1 ,𝑡1 ′ ]

  • Distance-based feature
  • Local neighborhood overlap
  • Global neighborhood overlap

Distance-based Features

Shortest-path distance between two nodes

However, this does not capture the degree of neighborhood overlap:

  • Node pair (B, H) has 2 shared neighboring nodes, while pairs (B, E) and (A, B) only have 1 such node

Local Neighborhood Overlap

Global Neighborhood Overlap

  • Limitation of local neighborhood features:

    • Metric is always zero if the two nodes do not have any neighbors in common.

    • However, the two nodes may still potentially be connected in the future.

  • Global neighborhood overlap metrics resolve the limitation by considering the entire graph.

  • Katz index: count the number of walks of all lengths between a given pair of nodes.
  • Q: How to compute #walks between two nodes?
    • Use powers of the graph adjacency matrix!

Intuition: Powers of Adj Matrices

Computing #walks between two nodes


  • Distance-based features:
    • Uses the shortest path length between two nodes but does not capture how neighborhood overlaps.
  • Local neighborhood overlap:
    • Captures how many neighboring nodes are shared by two nodes.
    • Becomes zero when no neighbor nodes are shared.
  • Global neighborhood overlap:
    • Uses global graph structure to score two nodes.
    • Katz index counts #walks of all lengths between two nodes.

Graph-level Features and Graph Kernels

Graph-level Features

Goal: We want features that characterize the structure of an entire graph

Backaround Kernel Methods

  • Kernel methods are widely-used for traditional ML for graph-level prediction.
  • Idea: Design kernels instead of feature vectors.
  • A quick introduction to Kernels:


  • Graph Kernels: Measure similarity between two graphs:
    • Graphlet Kernel [1]
    • Weisfeiler-Lehman Kernel [2]
    • Other kernels are also proposed in the literature (beyond the scope of this lecture)
      • Random-walk kernel
      • Shortest-path graph kernel
      • And many more…

[1] Shervashidze, Nino, et al. “Efficient graphlet kernels for large graph comparison.” Artificial Intelligence and Statistics. 2009.

[2] Shervashidze, Nino, et al. “Weisfeiler-lehman graph kernels.” Journal of Machine Learning Research 12.9 (2011).

Graph Kernel: Key Idea

  • Goal: Design graph feature vector 𝜙(𝐺)
  • Key idea: Bag-of-Words (BoW) for a graph
    • Recall: BoW simply uses the word counts as features for documents (no ordering considered).
    • Naïve extension to a graph: Regard nodes as words.
    • Since both graphs have 4 red nodes, we get the same feature vector for two different graphs…

What if we use Bag of node degrees?

Both Graphlet Kernel and Weisfeiler-Lehman (WL) Kernel use Bag-of-* representation of graph, where * is more sophisticated than node degrees!

Graphlet Features

  • Key idea: Count the number of different graphlets in a graph.
    • Note: Definition of graphlets here is slightly different from node-level features.
    • The two differences are:
      • Nodes in graphlets here do not need to be connected (allows for isolated nodes)
      • The graphlets here are not rooted.
      • Examples in the next slide illustrate this.

Graphlet Kernel

  • Limitations: Counting graphlets is expensive!

  • Counting size-𝑘 graphlets for a graph with size 𝑛 by enumeration takes 𝑛^𝑘 .

  • This is unavoidable in the worst-case since subgraph isomorphism test (judging whether a graph is a subgraph of another graph) is NP-hard.

  • If a graph’s node degree is bounded by 𝑑, an 𝑂(𝑛𝑑^(𝑘−1)) algorithm exists to count all the graphlets of size 𝑘.

    Can we design a more efficient graph kernel?

Weisfeiler-lehman Kernel

  • Goal: Design an efficient graph feature descriptor 𝜙 (𝐺)
  • Idea: Use neighborhood structure to iteratively enrich node vocabulary.
  • Generalized version of Bag of node degrees since node degrees are one-hop neighborhood information.
  • Algorithm to achieve this: Color refinement

Example of color refinement given two graphs

After color refinement, WL kernel counts number of nodes with a given color.

The WL kernel value is computed by the inner product of the color count vectors:

  • WL kernel is computationally efficient
    • The time complexity for color refinement at each step is linear in #(edges), since it involves aggregating neighboring colors.
  • When computing a kernel value, only colors appeared in the two graphs need to be tracked.
    • Thus, #(colors) is at most the total number of nodes.
  • Counting colors takes linear-time w.r.t. #(nodes).
  • In total, time complexity is linear in #(edges)


  • Graphlet Kernel

    • Graph is represented as Bag-of-graphlets
    • Computationally expensive
  • Weisfeiler-Lehman Kernel

    • Apply 𝐾-step color refinement algorithm to enrich node colors
      • Different colors capture different 𝐾-hop neighborhood structures
    • Graph is represented as Bag-of-colors
    • Computationally efficient
    • Closely related to Graph Neural Networks (as we will see!)
  • Traditional ML Pipeline

    • Hand-crafted feature + ML model
  • Hand-crafted features for graph data

    • Node-level:
      • Node degree, centrality, clustering coefficient, graphlets
    • Link-level:
      • distance-based feature
      • local/global neighborhood overlap
    • Graph-level:
      • Graphlet kernel, WL kernel

Node Embeddings

Goal: Efficient task-independent feature learning for machine learning with graphs!

Why Embedding?

  • Task: Map nodes into an embedding space
    • Similarity of embeddings between nodes indicates their similarity in the network. For example:
      • Both nodes are close to each other (connected by an edge)
    • Encode network information
    • Potentially used for many downstream predictions

2D embedding of nodes of the Zachary’s Karate Club network:

Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014

Node Embeddings Encoder and Decoder


  • Assume we have a graph G:
    • V is the vertex set.
    • A is the adjacency matrix (assume binary).
    • For simplicity: No node features or extra information is used

Embedding Nodes

Goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the graph

Learning Node Embeddings

  1. Encoder maps from nodes to embeddings
  2. Define a node similarity function (i.e., a measure of similarity in the original network)
  3. Decoder 𝐃𝐄𝐂 maps from embeddings to the similarity score
  4. Optimize the parameters of the encoder so that:

Two Key Components


Simplest encoding approach: Encoder is just an embedding-lookup

  • Each node is assigned a unique embedding vector (i.e., we directly optimize the embedding of each node)
  • Many methods: DeepWalk, node2vec

Framework Summary

Encoder + Decoder Framework

How to Define Node Similarit

  • Key choice of methods is how they define node similarity.
  • Should two nodes have a similar embedding if they…
    • are linked?
    • share neighbors?
    • have similar “structural roles”?
  • We will now learn node similarity definition that uses random walks, and how to optimize embeddings for such a similarity measure.

Note on Node Embeddings

  • This is unsupervised/self-supervisedway of learning node embeddings.
    • We are not utilizing node labels
    • We are not utilizing node features
    • The goal is to directly estimate a set of coordinates (i.e., the embedding) of a node so that some aspect of the network structure (captured by DEC) is preserved.
  • These embeddings are task independent
    • They are not trained for a specific task but can be used for any task.

Random Walk Approaches for Node Embeddings


Random Walk

Random-walk Embeddings

Why Random Walks?

  1. Expressivity: Flexible stochastic(随机的) definition of node similarity that incorporates both local and higher-order neighborhood information Idea: if random walk starting from node 𝑢 visits 𝑣 with high probability, 𝑢 and 𝑣 are similar (high-order multi-hop information)
  2. Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks

Unsupervised Feature Learning

  • Intuition: Find embedding of nodes in 𝑑-dimensional space that preserves similarity
  • Idea: Learn node embedding such that nearby nodes are close together in the network
  • Given a node 𝑢, how do we define nearby nodes?

Feature Learning as Optimization

But doing this naively is too expensive!

Negative Sampling

Stochastic Gradient Descent

Stochastic Gradient Descent: Instead of evaluating gradients over all examples, evaluate it for each individual training example.

Random Walks:Summary

How should we randomly walk?

  • So far we have described how to optimize embeddings given a random walk strategy R -
  • What strategies should we use to run these random walks?
    • Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013)
      • The issue is that such notion of similarity is too constrained
  • How can we generalize this?

Reference: Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD

Overview of node2vec

Reference: Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD

  • Goal: Embed nodes with similar network neighborhoods close in the feature space.
  • We frame this goal as a maximum likelihood optimization problem, independent to the downstream prediction task.
  • Key observation: Flexible notion of network neighborhood 𝑁_𝑅(𝑢) of node 𝑢 leads to rich node embeddings
  • Develop biased 2^nd order random walk 𝑅 to generate network neighborhood 𝑁_𝑅(𝑢) of node 𝑢

node2vec: Biased Walks

Idea: use flexible, biased random walks that can trade off between local and global views of the network (Grover and Leskovec, 2016).


  • Biased fixed-length random walk 𝑹 that given a node 𝒖 generates neighborhood 𝑵_𝑹(𝒖)
    • Two parameters:
      • Return parameter 𝒑:
        • Return back to the previous node
    • In-out parameter 𝒒:
      • Moving outwards (DFS) vs. inwards (BFS)
      • Intuitively, 𝑞 is the “ratio” of BFS vs. DFS

Biased Random Walks

  • Biased 2nd -order random walks explore network neighborhoods:
    • Rnd. walk just traversed edge (𝑠1 , 𝑤) and is now at 𝑤
    • Insight: Neighbors of 𝑤 can only be:

Idea: Remember where the walk came from

  • Walker came over edge (𝐬𝟏, 𝐰) and is at 𝐰. Where to go next?
  • 𝑝, 𝑞 model transition probabilities
    • 𝑝 … return parameter
    • 𝑞 … ”walk away” parameter

  • Walker came over edge (𝐬𝟏, 𝐰) and is at 𝐰. Where to go next?

node2vec algorithm

1) Compute random walk probabilities

2) Simulate 𝑟 random walks of length 𝑙 starting from each node 𝑢

3) Optimize the node2vec objective using Stochastic Gradient Descent

Linear-time complexity

All 3 steps are individually parallelizable

Other Random Walk Ideas

Summary so far

  • Core idea: Embed nodes so that distances in embedding space reflect node similarities in the original network.

  • Different notions of node similarity:

    • Naïve: similar if two nodes are connected
    • Neighborhood overlap (covered in Lecture 2)
    • Random walk approaches (covered today)
  • So what method should I use..?

  • No one method wins in all cases….

  • Random walk approaches are generally more efficient.

  • In general: Must choose definition of node similarity that matches your application.

Embedding Entire Graphs

Embedding Entire Graphs

Approach l

  • Simple (but effective) approach 1:

    • Run a standard graph embedding technique on the (sub)graph 𝐺.
    • Then just sum (or average) the node embeddings in the (sub)graph 𝐺.

Used by Duvenaud et al., 2016 to classify molecules based on their graph structure

Approach 2

Approach 2: Introduce a “virtual node” to represent the (sub)graph and run a standard graph embedding technique

Proposed by Li et al., 2016 as a general technique for subgraph embedding

Approach 3: Anonymous Walk Embeddings

States in anonymous(匿名的) walks correspond to the index of the first time we visited the node in a random walk

Anonymous Walk Embeddings, ICML 2018

  • Agnostic to the identity of the nodes visited (hence anonymous)

  • Example: Random walk w1 :

Number of Walks Grows

Simple Use of Anonymous Walks

  • Simulate anonymous walks 𝑤𝑖 of 𝑙 steps and record their counts.

  • Represent the graph as a probability distribution over these walks.

  • For example:

    • Set 𝑙 = 3
    • Then we can represent the graph as a 5-dim vector
      • Since there are 5 anonymous walks 𝑤𝑖 of length 3: 111, 112, 121, 122, 123
    • 𝒛_𝑮[𝑖] = probability of anonymous walk 𝑤𝑖 in graph 𝐺.
  • Sampling anonymous walks: Generate independently a set of 𝑚 random walks.

  • Represent the graph as a probability distribution over these walks.

  • How many random walks 𝑚 do we need?

New idea: Learn Walk Embeddings

Rather than simply representing each walk by the fraction of times it occurs, we learn embedding 𝒛_𝒊 of anonymous walk 𝒘_𝒊 .

How to embed walks?

Idea: Embed walks s.t. the next walk can be predicted

Anonymous Walk Embeddings, ICML 2018


  • We discussed 3 ideas to graph embeddings:
  • Approach 1: Embed nodes and sum/avg them
  • Approach 2: Create super-node that spans the (sub) graph and then embed that node.
  • Approach 3: Anonymous Walk Embeddings
    • Idea 1: Sample the anon. walks and represent the graph as fraction of times each anon walk occurs.
    • Idea 2: Learn graph embedding together with anonymous walk embeddings.

Preview: Hierarchical Embeddings

  • We will discuss more advanced ways to obtain graph embeddings in Lecture 8.
  • We can hierarchically cluster nodes in graphs, and sum/avg the node embeddings according to these clusters.

How to Use Embeddings

  • Encoder-decoder framework:
    • Encoder: embedding lookup
    • Decoder: predict score based on embedding to match node similarity
  • Node similarity measure: (biased) random walk
    • Examples: DeepWalk, Node2Vec
  • Extension to Graph embedding: Node embedding aggregation and Anonymous Walk Embeddings

Graph as Matrix: Pagerank, Random Walks and Embeddings

Pagerank(aka the Google Algorithm)

Example: The Web as a Graph

  • Q: What does the Web “look like” at a global level?
  • Web as a graph:
    • Nodes = web pages
    • Edges = hyperlinks
    • Side issue: What is a node?
      • Dynamic pages created on the fly
      • “dark matter” – inaccessible database generated pages

What Does the Web Look Like?

  • How is the Web linked?
  • What is the “map” of the Web?
  • Web as a directed graph [Broder et al. 2000]:
    • Given node v, what nodes can v reach?
    • What other nodes can reach v?

Ranking Nodes on the Graph

  • We will cover the following Link Analysis approaches to compute the importance of nodes in a graph:
    • PageRank
    • Personalized PageRank (PPR)
    • Random Walk with Restarts
  • Idea: Links as votes
    • Page is more important if it has more links
      • In-coming links? Out-going links?
  • Think of in-links as votes:
  • Are all in-links equal?
    • Links from important pages count more
    • Recursive question!

Pagerank: The”Flow”Model

  • A “vote” from an important page is worth more:
    • Each link’s vote is proportional to the importance of its source page
    • If page i with importance ri has di out-links, each link gets ri / di votes
    • Page j’s own importance rj is the sum of the votes on its in-links

Pagerank: Matrix Formulation

Connection to Random Walk

  • Imagine a random web surfer:
    • At any time 𝒕, surfer is on some page 𝑖
    • At time 𝒕 + 𝟏, the surfer follows an out-link from 𝒊 uniformly at random
    • Ends up on some page 𝒋 linked from 𝒊
    • Process repeats indefinitely
  • Let:
    • 𝒑(𝒕) … vector whose 𝑖^th coordinate is the prob. that the surfer is at page 𝑖 at time 𝑡
    • So, 𝒑(𝒕) is a probability distribution over pages

The Stationary Distribution

Recall Eigenvector of A Matrix



  • Measures importance of nodes in a graph using the link structure of the web
  • Models a random web surfer using the stochastic adjacency matrix 𝑴
  • PageRank solves 𝒓 = 𝑴𝒓 where 𝒓 can be viewed as both the principle eigenvector of 𝑴 and as the stationary distribution of a random walk over the graph

Pagerank: How to solve?

Power Iteration Method

Given a web graph with N nodes, where the nodes are pages and edges are hyperlinks

About 50 iterations is sufficient to estimate the limiting solution.

Three Ouestions

  • Does this converge?
  • Does it converge to what we want?
  • Are results reasonable?


Two problems:

(1) Some pages are dead ends (have no out-links)

​ Such pages cause importance to “leak out”

(2) Spider traps (all out-links are within the group)

​ Eventually spider traps absorb all importance

Does this converge

Does it converge to what we want?

Solution to Spider Traps

Solution to Dead Ends

  • Teleports: Follow random teleport links with total probability 1.0 from dead-ends
    • Adjust matrix accordingly

Why Teleports Solve the Problem?

  • Why are dead-ends and spider traps a problem and why do teleports solve the problem?
  • Spider-traps are not a problem, but with traps PageRank scores are not what we want
    • Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps
  • Dead-ends are a problem
    • The matrix is not column stochastic so our initial assumptions are not met
    • Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go

Solution: Random Teleports

Random Teleports(beta=0.8)


  • PageRank solves for 𝒓 = 𝑮𝒓 and can be efficiently computed by power iteration of the stochastic adjacency matrix (𝑮)
  • Adding random uniform teleportation solves issues of dead-ends and spider-traps

Random Walk with Restarts and Personalized Pagerank

Example: Recommendation

Given: A bipartite graph representing user and item interactions (e.g. purchase)

Goal: Proximity on graphs

  • What items should we recommend to a user who interacts with item Q?
  • Intuition: if items Q and P are interacted by similar users, recommend P when user interacts with Q

Which is more related A,A’ or B,B’?

Node proximity Measurements

Which is more related A,A’, B,B’ or C,C’?

Proximity on Graphs


  • Ranks nodes by “importance”
  • Teleports(传送) with uniform probability to any node in the network

Personalized PageRank:

  • Ranks proximity of nodes to the teleport nodes 𝑺

Proximity(接近) on graphs:

  • Q: What is most related item to Item Q?
  • Random Walks with Restarts
    • Teleport back to the starting node: 𝑺 = {𝑸}

Idea: Random Walks

  • Idea
    • Every node has some importance
    • Importance gets evenly split among all edges and pushed to the neighbors:
  • Given a set of QUERY_NODES, we simulate a random walk:
    • Make a step to a random neighbor and record the visit (visit count)
    • With probability ALPHA, restart the walk at one of the QUERY_NODES
    • The nodes with the highest visit count have highest proximity to the QUERY_NODES


  • Every node has some importance
    • Importance gets evenly split among all edges and pushed to the neighbors
    • Given a set of QUERY NODES Q, simulate a random walk:

Random Walk Algorithm


Why is this a good solution?

Because the “similarity” considers:

  • Multiple connections
  • Multiple paths
  • Direct and indirect connections
  • Degree of the node

Summary: Page Rank Variants

  • PageRank:

    • Teleports to any node
    • Nodes can have the same probability of the surfer landing: 𝑺 = [0.1,0.1,0.1, 0.1,0.1,0.1, 0.1,0.1,0.1, 0.1]
  • Topic-Specific PageRank aka Personalized PageRank:

    • Teleports to a specific set of nodes
    • Nodes can have different probabilities of the surfer landing there: 𝑺 = [0.1,0, 0,0.2, 0, 0,0.5,0,0, 0.2]
  • Random Walk with Restarts:

    • Topic-Specific PageRank where teleport is always to the same node: 𝑺 = [0,0, 0,0, 𝟏, 0, 0, 0, 0, 0,0]
  • A graph is naturally represented as a matrix
  • We defined a random walk process over the graph
    • Random surfer moving across the links and with random teleportation
    • Stochastic adjacency matrix M
  • PageRank = Limiting distribution of the surfer location represented node importance
    • Corresponds to the leading eigenvector of transformed adjacency matrix M.

Matrix Factorization and Node Embeddings

Embeddings Matrix Factorization(因式分解)

Recall: encoder as an embedding lookup

Connection to Matrix Factorization

Random Walk-based Similarity

  • DeepWalk and node2vec have a more complex node similarity definition based on random walks
  • DeepWalk is equivalent to matrix factorization of the following complex matrix expression:

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec, WSDM 18

Node2vec can also be formulated as a matrix factorization (albeit a more complex matrix)


Limitations of node embeddings via matrix factorization and random walks

  • Cannot obtain embeddings for nodes not in the training set

  • Node 1 and 11 are structurally similar – part of one triangle, degree 2, …
  • However, they have very different embeddings.
    • It’s unlikely that a random walk will reach node 11 from node 1.
  • DeepWalk and node2vec do not capture structural similarity.
  • Cannot utilize node, edge and graph features

Solution to these limitations: Deep Representation Learning and Graph Neural Networks


  • PageRank
    • Measures importance of nodes in graph
    • Can be efficiently computed by power iteration of adjacency matrix
  • Personalized PageRank (PPR)
    • Measures importance of nodes with respect to a particular node or set of nodes
    • Can be efficiently computed by random walk
  • Node embeddings based on random walks can be expressed as matrix factorization
  • Viewing graphs as matrices plays a key role in all above algorithms!

Message Passing and Node Classification

Example Node Classification

  • Given labels of some nodes
  • Let’s predict labels of unlabeled nodes
  • This is called semi-supervised node classification

Correlations(相关性) Exist in Networks

  • Behaviors of nodes are correlated across the links of the network

  • Correlation: Nearby nodes have the same color (belonging to the same class)

  • Two explanations for why behaviors of nodes in networks are correlated:

Social Homophily(趋同性)

  • Homophily: The tendency of individuals to associate and bond with similar others
    • “Birds of a feather flock together”
    • It has been observed in a vast array of network studies, based on a variety of attributes (e.g., age, gender, organizational role, etc.)
    • Example: Researchers who focus on the same research area are more likely to establish a connection (meeting at conferences, interacting in academic talks, etc.)

Example of homophily

  • Online social network
    • Nodes = people
    • Edges = friendship
    • Node color = interests (sports, arts, etc.)
  • People with the same interest are more closely connected due to homophily

Social Influence Example

  • Influence: Social connections can influence the individual characteristics of a person.
    • Example: I recommend my musical preferences to my friends, until one of them grows to like my same favorite genres!

How do we leverage node correlations in networks?

Classification with Network Data

How do we leverage(影响力) this correlation observed in networks to help predict node labels?

How do we predict the labels for the nodes in grey?


  • Similar nodes are typically close together or directly connected in the network:

    • Guilt-by-association: If I am connected to a node with label X, then I am likely to have label X as well.
    • Example: Malicious/benign web page: Malicious web pages link to one another to increase visibility, look credible, and rank higher in search engines
  • Classification label of a node v in network may depend on:

    • Features of v
    • Labels of the nodes in v’s neighborhood
    • Features of the nodes in v’s neighborhood

Semi-supervised Learning

  • Formal setting:

    • Given:
      • Graph
      • Few labeled nodes
    • Find: Class (red/green) of remaining nodes
    • Main assumption: There is homophily in the network
  • Example task:

Problem Setting

Example applications

  • Many applications under this setting:
    • Document classification
    • Part of speech tagging
    • Link prediction
    • Optical character recognition
    • Image/3D data segmentation
    • Entity resolution in sensor networks
    • Spam and fraud detection

We focus on semi-supervised binary node classification

We will introduce three approaches:

  • Relational classification
  • Iterative classification
  • Correct & Smooth

Relational Classification

Probabilistic Relational Classifier


  • Convergence is not guaranteed
  • Model cannot use node feature information

Example: Initialization


  • All labeled nodes with their labels
  • All unlabeled nodes 0.5 (belonging to class 1 with probability 0.5)

Example: lst Iteration, Update Node

Example: After 1st Iteration

After Iteration 1 (a round of updates for all unlabeled nodes)

Example: After 2nd Iteration

Example: After 3] Iteration


Iterative(迭代的) Classification

Relational classifier does not use node attributes.

How can one leverage them?

Computing the Summary Z_v

Architecture of Iterative Classifiers

Example: Web Page Classification

  • Input: Graph of web pages
  • Node: Web page
  • Edge: Hyper-link between web pages
  • Directed edge: a page points to another page
    • Node features: Webpage description
  • For simplicity, we only consider two binary features
    • Task: Predict the topic of the webpage

Baseline: Train a classifier (e.g., linear classifier) to classify pages based on node attributes

Iterative Classifier

step 1

step 2

step 3. 1

step 3. 2

Final Prediction


  • We talked about 2 approaches to collective classification
  • Relational classification
    • Iteratively update probabilities of node belonging to a label class based on its neighbors
  • Iterative classification
    • Improve over collective classification to handle attribute/feature information
    • Classify node v based on its features as well as labels of neighbors

Collective Classification Correct & Smooth

Correct & Smooth

Combining Label Propagation and Simple Models Out-performs Graph Neural Networks

OGB leaderboard snapshot at Oct 1st, 2021

Setting: A partially labeled graph and features over nodes.

C&S follows the three-step procedure:

  1. Train base predictor
  2. Use the base predictor to predict soft labels of all nodes.
  3. Post-process the predictions using graph structure to obtain the final predictions of all nodes.

Train Base Predictor

Train a base predictor that predict soft labels (class probabilities) over all nodes.

  • Labeled nodes are used for train/validation data.
  • Base predictor can be simple:
    • Linear model/Multi-Layer-Perceptron(MLP) over node features

Predict Over All Nodes

Given a trained base predictor, we apply it to obtain soft labels for all the nodes.

  • We expect these soft labels to be decently accurate.
  • Can we use graph structure to post-process the predictions to make them more accurate?

Post-process Predictions

C&S uses the 2-step procedure to post-process the soft predictions.

  1. Correct step
  2. Smooth step

C&S Post-processing: Correct Step

  • The key idea is that we expect errors in the base prediction to be positively correlated along edges in the graph.
  • In other words, an error at node u increases the chance of a similar error at neighbors of u.
  • Thus, we should “spread” such uncertainty over the graph

Intuition of Correct & Smooth

Correct Step

  • Compute training errors of nodes.
    • Training error: Ground-truth label minus soft label. Defined as 0 for unlabeled nodes.

Zhu et al. ICML 2013

Smooth Step

  • Smoothen the corrected soft labels along the edges.
    • Assumption: Neighboring nodes tend to share the same labels.
    • Note: For training nodes, we use the ground-truth hard labels instead of the soft labels.

Toy Example Summary

Our toy example shows that C&S successfully improves base model performance using graph structure.

C&S on a Real-world Dataset

C&S significantly improves the performance of the base model (MLP).

C&S outperforms Smooth-only (no correct step) baseline.


  • Correct & Smooth (C&S) uses graph structure to post-process the soft node labels predicted by any base model.
  • Correction step: Diffuse and correct for the training errors of the base predictor.
  • Smooth step: Smoothen the prediction of the base predictor.
  • C&S achieves strong performance on semisupervised node classification

We learned how to leverage correlation in graphs to make prediction on nodes.

Key techniques:

  • Relational classification
  • Iterative classification
  • Correct & Smooth

Graph Neural Networks

Basics of Deep Learning

Deep Learning for Graphs


  • Local network neighborhoods:
    • Describe aggregation strategies
    • Define computation graphs
  • Stacking multiple layers:
    • Describe the model, parameters, training
    • How to fit the model?
    • Simple example for unsupervised and supervised training


A Naive Approach

  • Join adjacency matrix and features
  • Feed them into a deep neural net:

Issues with this idea:

  • O(|V|) parameters
  • Not applicable to graphs of different sizes
  • Sensitive to node ordering

Idea: Convolutional Networks

Goal is to generalize convolutions beyond simple lattices(格子)

Leverage node features/attributes (e.g., text, images)

Real-world Graphs

But our graphs look like this:

There is no fixed notion of locality or sliding window on the graph

Graph is permutation invariant

Permutation Invariance(置换不变性)

  • Graph does not have a canonical order of the nodes!
  • We can have many different order plans.

Graph and node representations should be the same for Order plan 1 and Order plan 2

What does it mean by “graph representation is same for two order plans”?

Permutation Equivariance(置换等变)

For node representation

Graph Neural Network Overview

Graph neural networks consist of multiple permutation equivariant / invariant functions

Next: Design graph neural networks that are permutation invariant / equivariant by passing and aggregating information from neighbors

Graph Convolutional Networks

Idea: Node’s neighborhood defines a computation graph

Learn how to propagate information across the graph to compute node features

Idea: Aggregate Neiahbors

Key idea: Generate node embeddings based on local network neighborhoods

Intuition: Nodes aggregate information from their neighbors using neural networks

Deep Model: Many Layers

Model can be of arbitrary depth:

Nodes have embeddings at each layer

Layer-0 embedding of node v is its input feature, x_v

Layer-k embedding gets information from nodes that are k hops away

Neighborhood Aggregation(聚集)

Neighborhood aggregation: Key distinctions are in how different approaches aggregate information across the layers

Basic approach: Average information from neighbors and apply a neural network

Equivariant Property

The target node (blue) has the same computation graph for different order plans

Training the Model

Need to define a loss function on the embeddings.

Model Parameters

Matrix Formulation

Many aggregations can be performed efficiently by (sparse) matrix operations

Note: not all GNNs can be expressed in matrix form, when aggregation function is complex

How to Train A GNN

Unsupervised Training

Node similarity can be anything from Lecture 3, e.g., a loss based on:

  • Random walks (node2vec, DeepWalk, struc2vec)
  • Matrix factorization
  • Node proximity in the grap

Supervised Training

Directly train the model for a supervised task (e.g., node classification)

Directly train the model for a supervised task (e.g., node classification)

Model Design: Overview

Inductive Capability

The same aggregation parameters are shared for all nodes:

  • The number of model parameters is sublinear in |V| and we can generalize to unseen nodes!

Inductive node embedding => Generalize to entirely unseen graphs

E.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B

Many application settings constantly encounter previously unseen nodes:

  • E.g., Reddit, YouTube, Google Scholar

Need to generate new embeddings “on the fly”

GNNs subsume CNNs and Transformers

Architecture Comparison

How does GNNs compare to prominent architectures such as Convolutional Neural Nets, and Transformers?


CNN can be seen as a special GNN with fixed neighbor size and ordering:

  • The size of the filter is pre-defined for a CNN.
  • The advantage of GNN is it processes arbitrary graphs with different degrees for each node

CNN is not permutation equivariant.

  • Switching the order of pixels will leads to different outputs


[Attention is all you need. Vaswani et al., NeurIPS 2017]

GNN vs, Transformer

A nice blog plot for this: https://towardsdatascience.com/transformers-are-graph-neural-networks-bca9f75412aa


  • Basics of neural networks
    • Loss, Optimization, Gradient, SGD, non-linearity, MLP
  • Idea for Deep Learning for Graphs
    • Multiple layers of embedding transformation
    • At every layer, use the embedding at previous layer as the input
    • Aggregation of neighbors and self-embeddings
  • Graph Convolutional Network
    • Mean aggregation; can be expressed in matrix form
  • GNN is a general architecture
    • CNN and Transformer can be viewed as a special GNN

Graph Neural Networks[2]

A General Perspective on Graph Neural Networks

A Genera GNN Framework

J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020


A Single Layer of a GNN

A Single GNN Layer

Message Computation



Putting things together

Classical GNN Layers: GCN

T. Kipf, M. Welling. Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017

Graph Convolutional Networks (GCN)

How to write this as Message + Aggregation?


Hamilton et al. Inductive Representation Learning on Large Graphs, NeurIPS 2017

L2 Normalization


Graph Attention Networks

Not all node’s neighbors are equally important

  • Attention is inspired by cognitive attention.

  • The attention a_uv focuses on the important parts of the input data and fades out the rest.

    • Idea: the NN should devote more computing power on that small but important part of the data.

    • Which part of the data is more important depends on the context and is learned through training.

Can we do better than simple neighborhood aggregation?

Attention Mechanism

What is the form of attention mechanism a?

  • The approach is agnostic to the choice of a
  • E.g., use a simple single-layer neural network
  • a have trainable parameters (weights in the Linear layer)

Multi-head attention: Stabilizes the learning process of attention mechanism

Benefits of Attention Mechanism

GNN Layers in Practice

Many modern deep learning modules can be incorporated into a GNN layer

  • Batch Normalization:
    • Stabilize neural network training
  • Dropout:
    • Prevent overfitting
  • Attention/Gating:
    • Control the importance of a message
  • More:
    • Any other useful deep learning modules

Batch Normalization

S. Loffe, C.Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, ICML 2015

  • Goal: Stabilize neural networks training
  • Idea: Given a batch of inputs (node embeddings)
    • Re-center the node embeddings into zero mean
    • Re-scale the variance into unit variance


  • Goal: Regularize a neural net to prevent overfitting.
  • Idea:
    • During training: with some probability p, randomly set neurons to zero (turn off)
    • During testing: Use all the neurons for computation



  • Modern deep learning modules can be included into a GNN layer for better performance
  • Designing novel GNN layers is still an active research frontier!
  • Suggested resources: You can explore diverse GNN designs or try out your own ideas in GraphGym

Stacking Layers of a GNN

  • How to construct a Graph Neural Network?
  • The standard way: Stack GNN layers sequentially

The Over-smoothing Problem

  • The Issue of stacking many GNN layers
    • GNN suffers from the over-smoothing problem
  • The over-smoothing problem: all the node embeddings converge to the same value
    • This is bad because we want to use node embeddings to differentiate nodes
  • Why does the over-smoothing problem happen?

Receptive Field of a GNN

  • Receptive field: the set of nodes that determine the embedding of a node of interest
    • In a k-layer GNN, each node has a receptive field of k-hop neighborhood

  • Receptive field overlap for two nodes
    • The shared neighbors quickly grows when we increase the number of hops (num of GNN layers)

  • We can explain over-smoothing via the notion of receptive field
    • We knew the embedding of a node is determined by its receptive field
      • If two nodes have highly-overlapped receptive fields, then their embeddings are highly similar
  • Stack many GNN layers -> nodes will have highlyoverlapped receptive fields -> node embeddings will be highly similar -> suffer from the over-smoothing problem
  • Next: how do we overcome over-smoothing problem?

Design GNN Layer Connectivity

  • What do we learn from the over-smoothing problem?
  • Lesson 1: Be cautious when adding GNN layers
    • Unlike neural networks in other domains (CNN for image classification), adding more GNN layers do not always help
    • Step 1: Analyze the necessary receptive field to solve your problem. E.g., by computing the diameter of the graph
    • Step 2: Set number of GNN layers L to be a bit more than the receptive field we like. Do not set L to be unnecessarily large!
  • Question: How to enhance the expressive power of a GNN, if the number of GNN layers is small?

Expressive Power for Shallow GNNS

  • How to make a shallow GNN more expressive?
  • Solution 1: Increase the expressive power within each GNN layer
    • In our previous examples, each transformation or aggregation function only include one linear layer
    • We can make aggregation / transformation become a deep neural network!

  • Solution 2: Add layers that do not pass messages
    • A GNN does not necessarily only contain GNN layers
      • E.g., we can add MLP layers (applied to each node) before and after GNN layers, as pre-process layers and post-process layers
    • Pre-processing layers: Important when encoding node features is necessary.
    • E.g., when nodes represent images/text
    • Post-processing layers: Important when reasoning / transformation over node embeddings are needed
    • E.g., graph classification, knowledge graphs
    • In practice, adding these layers works great

What if my problem still requires many GNN layers?

  • Lesson 2: Add skip connections in GNNs

    • Observation from over-smoothing: Node embeddings in earlier GNN layers can sometimes better differentiate nodes

    • Solution: We can increase the impact of earlier layers on the final node embeddings, by adding shortcuts in GNN

He et al. Deep Residual Learning for Image Recognition, CVPR 2015

Idea of Skip Connections

Why do skip connections work?

Veit et al. Residual Networks Behave Like Ensembles of Relatively Shallow Networks, ArXiv 2016

Example: GCN with Skip Connections

Other Options of Skip Connections

Xu et al. Representation learning on graphs with jumping knowledge networks, ICML 2018

GNN Augmentation(增强) and Training

Graph Augmentation for GNNS

Our assumption so far has been

  • Raw input graph = computational graph

Reasons for breaking this assumption

  • Features:

    • The input graph lacks features
  • Graph structure:

    • The graph is too sparse(稀疏的) -> inefficient message passing
    • The graph is too dense -> message passing is too costly
    • The graph is too large -> cannot fit the computational graph into a GPU
  • It’s unlikely that the input graph happens to be the optimal computation graph for embeddings

Graph Augmentation Approaches

  • Graph Feature augmentation
    • The input graph lacks features -> feature augmentation
  • Graph Structure augmentation
    • The graph is too sparse -> Add virtual nodes / edges
    • The graph is too dense -> Sample neighbors when doing message passing
    • The graph is too large -> Sample subgraphs to compute embeddings
  • Will cover later in lecture: Scaling up GNNs

Why do we need feature augmentation?

(1) Input graph does not have node features

This is common when we only have the adj. matrix ¡ Standard approaches:

​ a) Assign constant values to node

​ b) Assign unique IDs to nodes

​ These IDs are converted into one-hot vectors

Feature augmentation: constant vs. one-hot

J. You, J. Gomes-Selman, R. Ying, J. Leskovec. Identity-aware Graph Neural Networks, AAAI 2021

(2) Certain structures are hard to learn by GNN

​ Example: Cycle count feature:

​ Can GNN learn the length of a cycle that v1 resides in?

​ Unfortunately, no

​ v1 cannot differentiate which graph it resides in

  • Because all the nodes in the graph have degree of 2
  • The computational graphs will be the same binary tree


  • We can use cycle count as augmented node features

Other commonly used augmented features:

  • Node degree
  • Clustering coefficient
  • PageRank
  • Centrality

Add Virtual Nodes/ Edges

Motivation: Augment sparse graphs

(1) Add virtual edges

  • Common approach: Connect 2-hop neighbors via virtual edges
  • Intuition: Instead of using adj. matrix A for GNN computation, use A + A^2,

Use cases: Bipartite graphs

  • Author-to-papers (they authored)
  • 2-hop virtual edges make an author-author collaboration graph

(2) Add virtual nodes

  • The virtual node will connect to all the nodes in the graph

    • Suppose in a sparse graph, two nodes have shortest path distance of 10

    • After adding the virtual node, all the nodes will have a distance of two

      • Node A – Virtual node – Node B
  • Benefits: Greatly improves message passing in sparse graphs

Node Neighborhood Sampling

Hamilton et al. Inductive Representation Learning on Large Graphs, NeurIPS 2017


  • All the nodes are used for message passing

New idea: (Randomly) sample a node’s neighborhood for message passing

For example, we can randomly choose 2 neighbors to pass messages in a given layer

  • Only nodes B and D will pass messages to A

In the next layer when we compute the embeddings, we can sample different neighbors

  • Only nodes C and D will pass messages to A

In expectation, we get embeddings similar to the case where all the neighbors are used

  • Benefits: Greatly reduces computational cost
    • Allows for scaling to large graphs (more about this later)
  • And in practice it works great!

Prediction with GNNS

GNN Training Pipeline


(1) Different prediction heads:

  • Node-level tasks
  • Edge-level tasks
  • Graph-level tasks

GNN Prediction Heads

Idea: Different task levels require different prediction heads


Node-level prediction: We can directly make prediction using node embeddings!


Edge-level prediction: Make prediction using pairs of node embeddings

(1) Concatenation + Linear

(2) Dot product


Graph-level prediction: Make prediction using all the node embeddings in our graph

K. Xu, W. Hu, J. Leskovec, S. Jegelka. How Powerful Are Graph Neural Networks, ICLR 2019

Issue of Global Pooling

Issue: Global pooling over a (large) graph will lose information

Hierarchical Global Pooling

A solution: Let’s aggregate all the node embeddings hierarchically(等级的)

Ying et al. Hierarchical Graph Representation Learning with Differentiable Pooling, NeurIPS 2018

DiffPool idea:

  • Hierarchically pool node embedding

Leverage 2 independent GNNs at each level

  • GNN A: Compute node embeddings
  • GNN B: Compute the cluster that a node belongs to

GNNs A and B at each level can be executed in parallel

For each Pooling layer

  • Use clustering assignments from GNN B to aggregate node embeddings generated by GNN A
  • Create a single new node for each cluster, maintaining edges between clusters to generated a new pooled network

Jointly train GNN A and GNN B

Training Graph Neural Networks


(2) Where does ground-truth come from?

  • Supervised labels
  • Unsupervised signals

Supervised Vs Unsupervised

  • Supervised learning on graphs
    • Labels come from external sources
      • E.g., predict drug likeness of a molecular graph
  • Unsupervised learning on graphs
    • Signals come from graphs themselves
      • E.g., link prediction: predict if two nodes are connected
  • Sometimes the differences are blurry
    • We still have “supervision” in unsupervised learning
      • E.g., train a GNN to predict node clustering coefficient
    • An alternative name for “unsupervised” is “self-supervised”

Supervised Labels on Graphs

Supervised labels come from the specific use cases. For example:

Advice: Reduce your task to node / edge / graph labels, since they are easy to work with

  • E.g., we knew some nodes form a cluster. We can treat the cluster that a node belongs to as a node label

Unsupervised Signals on Graphs

  • The problem: sometimes we only have a graph, without any external labels
  • The solution: “self-supervised learning”, we can find supervision signals within the graph.
    • For example, we can let GNN predict the following:

These tasks do not require any external labels!


(3) How do we compute the final loss?

  • Classification loss
  • Regression loss

Settings for GNN Training

Classification or Rearession

Classification LOSS

As discussed in lecture 6, cross entropy (CE) is a very common loss function in classification

Regression Loss

For regression tasks we often use Mean Squared Error (MSE) a.k.a. L2 loss


(4) How do we measure the success of a GNN?

  • Accuracy

Evaluation Metrics: Rearession

  • We use standard evaluation metrics for GNN
    • (Content below can be found in any ML course)
    • In practice we will use sklearn for implementation
    • Suppose we make predictions for N data points
  • Evaluate regression tasks on graphs:

Evaluation Metrics: Classification

Evaluate classification tasks on graphs:

Metrics for Binary Classification

ROC Curve: Captures the tradeoff in TPR and FPR as the classification threshold is varied for a binary classifier.

ROC AUC: Area under the ROC Curve.

Intuition: The probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one

Setting-up GNN Prediction Tasks


(5) How do we split our dataset into train / validation / test set?

Dataset Split: Fixed/ Random Split

  • Fixed split: We will split our dataset once
    • Training set: used for optimizing GNN parameters
    • Validation set: develop model/hyperparameters
    • Test set: held out until we report final performance
  • A concern: sometimes we cannot guarantee that the test set will really be held out
  • Random split: we will randomly split our dataset into training / validation / test
    • We report average performance over different random seeds

Why Splitting Graphs is Special

Suppose we want to split an image dataset

  • Image classification: Each data point is an image
  • Here data points are independent
    • Image 5 will not affect our prediction on image 1

Splitting a graph dataset is different!

  • Node classification: Each data point is a node
  • Here data points are NOT independent
    • Node 5 will affect our prediction on node 1, because it will participate in message passing -> affect node 1’s embedding

What are our options?

Solution 1 (Transductive(传导的) setting): The input graph can be observed in all the dataset splits (training, validation and test set).

We will only split the (node) labels

  • At training time, we compute embeddings using the entire graph, and train using node 1&2’s labels
  • At validation time, we compute embeddings using the entire graph, and evaluate on node 3&4’s labels

Solution 2 (Inductive(归纳法的) setting): We break the edges between splits to get multiple graphs

  • Now we have 3 graphs that are independent. Node 5 will not affect our prediction on node 1 any more
  • At training time, we compute embeddings using the graph over node 1&2, and train using node 1&2’s labels
  • At validation time, we compute embeddings using the graph over node 3&4, and evaluate on node 3&4’s labels

Transductive setting: training / validation / test sets are on the same graph

  • The dataset consists of one graph
  • The entire graph can be observed in all dataset splits, we only split the labels
  • Only applicable to node / edge prediction tasks

Inductive setting: training / validation / test sets are on different graphs

  • The dataset consists of multiple graphs
  • Each split can only observe the graph(s) within the split. A successful model should generalize to unseen graphs
  • Applicable to node / edge / graph tasks

Graph Classification

Only the inductive setting is well defined for graph classification

  • Because we have to test on unseen graphs
  • Suppose we have a dataset of 5 graphs. Each split will contain independent graph(s)

Goal of link prediction: predict missing edges

Setting up link prediction is tricky:

  • Link prediction is an unsupervised / self-supervised task. We need to create the labels and dataset splits on our own
  • Concretely, we need to hide some edges from the GNN and the let the GNN predict if the edges exist

For link prediction, we will split edges twice

Step 1: Assign 2 types of edges in the original graph

  • Message edges: Used for GNN message passing

  • Supervision edges: Use for computing objectives

  • After step 1:

    • Only message edges will remain in the graph

    • Supervision edges are used as supervision for edge predictions made by the model, will not be fed into GNN!

Step 2: Split edges into train / validation / test

  • Option 1: Inductive link prediction split
    • Suppose we have a dataset of 3 graphs. Each inductive split will contain an independent grap
    • Suppose we have a dataset of 3 graphs. Each inductive split will contain an independent graph
    • In train or val or test set, each graph will have 2 types of edges: message edges + supervision edges § Supervision edges are not the input to GNN

  • Option 2: Transductive link prediction split:

    • This is the default setting when people talk about link prediction

    • Suppose we have a dataset of 1 graph

    • By definition of “transductive”, the entire graph can be observed in all dataset splits

      • But since edges are both part of graph structure and the supervision, we need to hold out validation / test edges
      • To train the training set, we further need to hold out supervision edges for the training set
    • Next: we will show the exact settings

Summary: Transductive link prediction split

  • Note: Link prediction settings are tricky and complex. You may find papers do link prediction differently.
  • Luckily, we have full support in PyG and GraphGym

GNN Training Pipeline

Implementation resources:

  • DeepSNAP provides core modules for this pipeline
  • GraphGym further implements the full pipeline to facilitate GNN design


  • GNN Layer:
    • Transformation + Aggregation
    • Classic GNN layers: GCN, GraphSAGE, GAT
  • Layer connectivity:
    • The over-smoothing problem
    • Solution: skip connections
  • Graph Augmentation:
    • Feature augmentation
    • Structure augmentation
  • Learning Objectives
    • The full training pipeline of a GNN

When Things Dont Go As Planned

General Tips

  • Data preprocessing is important:
    • Node attributes can vary a lot!
      • E.g. probability ranges (0,1), but some inputs could have much larger range, say (−1000, 1000)
    • Use normalization
  • Optimizer:
    • ADAM is relatively robust to learning rate
  • Activation function
    • ReLU activation function often works well
    • Other alternatives: LeakyReLU, SWISH, rational activation
    • No activation function at your output layer:
  • Include bias term in every layer
  • Embedding dimensions:
    • 32, 64 and 128 are often good starting points

Debugging Deep Networks

  • Debug issues: Loss/accuracy not converging during training
    • Check pipeline (e.g. in PyTorch we need zero_grad)
    • Adjust hyperparameters such as learning rate
    • Pay attention to weight parameter initialization
  • Important for model development:
    • Overfit on (part of) training data:
      • With a small training dataset, loss should be essentially close to 0, with an expressive neural network
      • If neural network cannot overfit a single data point, something is wrong
    • Scrutinize loss function!
    • Scrutinize visualizations!

Setting-up GNN Prediction Tasks

How Expressive are Graph Neural Networks?

Theory of GNNS

How powerful are GNNs?

  • Many GNN models have been proposed (e.g., GCN, GAT, GraphSAGE, design space).
  • What is the expressive power (ability to distinguish different graph structures) of these GNN models?
  • How to design a maximally expressive GNN model?

Background: Many GNN Models

Many GNN models have been proposed:

  • GCN, GraphSAGE, GAT, Design Space etc.

GNN Model Example

  • GCN (mean-pool) [Kipf and Welling ICLR 2017]

  • GraphSAGE (max-pool) [Hamilton et al. NeurIPS 2017]

Note: Node Colors

  • We use node same/different colors to represent nodes with same/different features.
    • For example, the graph below assumes all the nodes share the same feature.

  • Key question: How well can a GNN distinguish different graph structures?

Local Neighborhood Structures

We specifically consider local neighborhood structures around each node in a graph.

  • Example: Nodes 1 and 5 have different neighborhood structures because they have different node degrees.

We specifically consider local neighborhood structures around each node in a graph. 1 2 3 5 4

  • Example: Nodes 1 and 4 both have the same node degree of 2. However, they still have different neighborhood structures because their neighbors have different node degrees

Node 1 has neighbors of degrees 2 and 3. Node 4 has neighbors of degrees 1 and 3.

We specifically consider local neighborhood structures around each node in a graph. 1 2 3 5 4

  • Example: Nodes 1 and 2 have the same neighborhood structure because they are symmetric within the graph.

Node 1 has neighbors of degrees 2 and 3. Node 2 has neighbors of degrees 2 and 3. And even if we go a step deeper to 2nd hop neighbors, both nodes have the same degrees (Node 4 of degree 2)

  • Key question: Can GNN node embeddings distinguish different node’s local neighborhood structures?
    • If so, when? If not, when will a GNN fail?
  • Next: We need to understand how a GNN captures local neighborhood structures.
    • Key concept: Computational graph

Computational Graph

In each layer, a GNN aggregates neighboring node embeddings.

A GNN generates node embeddings through a computational graph defined by the neighborhood.

  • Ex: Node 1’s computational graph (2-layer GNN)

  • Ex: Nodes 1 and 2’s computational graphs.

  • Ex: Nodes 1 and 2’s computational graphs.
  • But GNN only sees node features (not IDs):

A GNN will generate the same embedding for nodes 1 and 2 because:

  • Computational graphs are the same.
  • Node features (colors) are identical.

Note: GNN does not care about node ids, it just aggregates features vectors of different nodes.

In general, different local neighborhoods define different computational graphs

Computational graphs are identical to rooted subtree structures around each node.

GNN‘s node embeddings capture rooted subtree structures.

Most expressive GNN maps different rooted subtrees into different node embeddings (represented by different colors).

Recall: Iniective Function

How Expressive is a GNN?

Most expressive GNN should map subtrees to the node embeddings injectively.

Key observation: Subtrees of the same depth can be recursively characterized from the leaf nodes to the root nodes.

If each step of GNN’s aggregation can fully retain the neighboring information, the generated node embeddings can distinguish different rooted subtrees.

In other words, most expressive GNN would use an injective neighbor aggregation function at each step.

Maps different neighbors to different embeddings.

Summary so far

  • To generate a node embedding, GNNs use a computational graph corresponding to a subtree rooted around each node.

  • GNN can fully distinguish different subtree structures if every step of its neighbor aggregation is injective.

Designing the Most Powerful Graph Neural Network

Expressive Power of GNNS

  • Key observation: Expressive power of GNNs can be characterized by that of neighbor aggregation functions they use.
    • A more expressive aggregation function leads to a more expressive a GNN.
    • Injective aggregation function leads to the most expressive GNN.
  • Next:
    • Theoretically analyze expressive power of aggregation functions.

Neighbor Aggregation

Observation: Neighbor aggregation can be abstracted as a function over a multi-set (a set with repeating elements).

Next: We analyze aggregation functions of two popular GNN models

  • GCN (mean-pool) [Kipf & Welling ICLR 2017]
    • Take element-wise mean, followed by linear function and ReLU activation, i.e., max(0, x).
    • Theorem [Xu et al. ICLR 2019]
      • GCN’s aggregation function cannot distinguish different multi-sets with the same color proportion.

For simplicity, we assume node colors are represented by one-hot encoding.

Example) If there are two distinct colors:

This assumption is sufficient to illustrate how GCN fails.

Failure case illustration

  • GraphSAGE (max-pool) [Hamilton et al. NeurIPS 2017]
    • Apply an MLP, then take element-wise max.
    • Theorem [Xu et al. ICLR 2019]
      • GraphSAGE’s aggregation function cannot distinguish different multi-sets with the same set of distinct colors.

Failure case illustration

Summary So Far

We analyzed the expressive power of GNNs.

Main takeaways:

  • Expressive power of GNNs can be characterized by that of the neighbor aggregation function.
  • Neighbor aggregation is a function over multi-sets (sets with repeating elements)
  • GCN and GraphSAGE’s aggregation functions fail to distinguish some basic multi-sets; hence not injective.
  • Therefore, GCN and GraphSAGE are not maximally powerful GNNs.

Designing Most Expressive GNNS

  • Our goal: Design maximally powerful GNNs in the class of message-passing GNNs.
  • This can be achieved by designing injective neighbor aggregation function over multi-sets.
  • Here, we design a neural network that can model injective multiset function.

Injective Multi-set Function

Universal Approximation Theorem

Injective Multi-set Function

We have arrived at a neural network that can model any injective multi-set function.

In practice, MLP hidden dimensionality of 100 to 500 is sufficient.

Most Expressive GNN

Graph Isomorphism Network (GIN) [Xu et al. ICLR 2019]

Apply an MLP, element-wise sum, followed by another MLP

  • Theorem [Xu et al. ICLR 2019]
    • GIN‘s neighbor aggregation function is injective.
  • No failure cases!
  • GIN is THE most expressive GNN in the class of message-passing GNNs!

Full Model of GIN

So far: We have described the neighbor aggregation part of GIN.

We now describe the full model of GIN by relating it to WL graph kernel (traditional way of obtaining graph-level features).

  • We will see how GIN is a “neural network” version of the WL graph kernel.

Relation to WL Graph Kernel

Recall: Color refinement algorithm in WL kernel.

Color Refinement

Example of color refinement given two graphs

Example of color refinement given two graphs

  • Process continues until a stable coloring is reached
  • Two graphs are considered isomorphic if they have the same set of colors.

The Complete GIN Mode

GIN uses a neural network to model the injective HASH function.

Specifically, we will model the injective function over the tuple:

Theorem (Xu et al. ICLR 2019)

Any injective function over the tuple

GIN and WL Graph Kernel

GIN can be understood as differentiable neural version of the WL graph Kernel:

Advantages of GIN over the WL graph kernel are:

  • Node embeddings are low-dimensional; hence, they can capture the fine-grained similarity of different nodes.
  • Parameters of the update function can be learned for the downstream tasks.

Expressive Power of GIN

  • Because of the relation between GIN and the WL graph kernel, their expressive is exactly the same.
    • If two graphs can be distinguished by GIN, they can be also distinguished by the WL kernel, and vice versa.
  • How powerful is this?
    • WL kernel has been both theoretically and empirically shown to distinguish most of the real-world graphs [Cai et al. 1992].
    • Hence, GIN is also powerful enough to distinguish most of the real graphs!

Summary of the Lecture

  • We design a neural network that can model injective multi-set function.
  • We use the neural network for neighbor aggregation function and arrive at GIN—the most expressive GNN model.
  • The key is to use element-wise sum pooling, instead of mean-/max-pooling.
  • GIN is closely related to the WL graph kernel.
  • Both GIN and WL graph kernel can distinguish most of the real graphs!

The Power of Pooling

Can expressive power of GNNs be improved?

There are basic graph structures that existing GNN framework cannot distinguish, such as difference in cycles.

GNNs’ expressive power can be improved to resolve the above problem. [You et al. AAAI 2021, Li et al. NeurIPS 2020]


GNNs and connection to bijective functions on sets.

Most powerful GNN is equivalent to WL graph isomorphism test.

GIN is the most powerful GNN.

  • Sum aggregator is more powerful than mean is more powerful than max.

Heterogeneous Graphs and Knowledge Graph Embeddings

Heterogeneous(异构) Graphs and Relational GCN (RGCN)

Heterogeneous(异构) Graphs

A heterogeneous graph is defined as

Relational GCN

Schlichtkrull et al., Modeling Relational Data with Graph Convolutional Networks, ESWC 2018

We will extend GCN to handle heterogeneous graphs with multiple edge/relation types

We start with a directed graph with one relation

  • How do we run GCN and update the representation of the target node A on this graph?

What if the graph has multiple relation types?

What if the graph has multiple relation types?

Use different neural network weights for different relation types


RGCN: Scalability

Block Diagonal Matrices

Key insight: make the weights sparse

Key insight: Share weights across different relations!

Example: Entity/Node Classification

Goal: Predict the label of a given node

Summary of RGCN

  • Relational GCN, a graph neural network for heterogeneous graphs
  • Can perform entity classification as well as link prediction tasks.
  • Ideas can easily be extended into RGNN (RGraphSAGE, RGAT, etc.)

Knowledge Graphs KG Completion with Embeddings

Knowledge Graphs( KG)

Knowledge in graph form:

Capture entities, types, and relationships

  • Nodes are entities
  • Nodes are labeled with their types
  • Edges between two nodes capture relationships between entities
  • KG is an example of a heterogeneous graph

Example: Bibliographic Networks

  • Node types: paper, title, author, conference, year
  • Relation types: pubWhere, pubYear, hasTitle, hasAuthor, cite

Example: Bio Knowledge Graphs

  • Node types: drug, disease, adverse event, protein, pathways
  • Relation types: has_func, causes, assoc, treats, is_a

Knowledge Graphs in Practice

  • Examples of knowledge graphs
  • Google Knowledge Graph
  • Amazon Product Graph
  • Facebook Graph API
  • IBM Watson
  • Microsoft Satori
  • Project Hanover/Literome
  • LinkedIn Knowledge Graph
  • Yandex Object Answer

Applications of Knowledge Graphs

  • Serving information
  • Question answering and conversation agents

Knowledge Graph Datasets

  • Publicly available KGs:
    • FreeBase, Wikidata, Dbpedia, YAGO, NELL, etc.
  • Common characteristics:
    • Massive: millions of nodes and edges
    • Incomplete: many true edges are missing

Example: Freebase

Knowledge Graph Completion TransE, TransR, DistMul, ComplEx

KG Completion Task

  • Given an enormous KG, can we complete the KG?
  • For a given (head, relation), we predict missing tails.
    • (Note this is slightly different from link prediction task)

KG Representation


Bordes, et al., Translating embeddings for modeling multi-relational data, NeurIPS 2013

Connectivity Patterns in KG

Relations in a heterogeneous KG have different properties

Can we categorize these relation patterns?

Are KG embedding methods (e.g., TransE) expressive enough to model these patterns?

Relation Patterns

Antisymmetric Relations in Transe

Inverse Relations in TransE

Composition in Transe

Limitation Symmetric Relations

Limitation: 1-to-n Relations

Knowledge Graph Completion : TransR


Lin, et al., Learning entity and relation embeddings for knowledge graph completion,AAAI 2015

TransE models translation of any relation in the same embedding space.

Can we design a new space for each relation and do translation in relation-specific space?

Symmetric Relations in TransR

1-to-n Relations in TransR

Inverse Relations in TransR

Composition Relations in TransR

High-level intuition: TransR models a triple with linear functions, they are chainable

Knowledge Graph Completion DistMult

New Idea: Bilinear Modeling

Yang et al, Embedding Entities and Relations for Learning and Inference in Knowledge Bases, ICLR 2015


1-to-n Relations in DistMult

Symmetric Relations in DistMult

Limitation: Antisymmetric Relations

Limitation: Inverse Relations

Limitation: Composition Relations

Knowledge Graph Completion ComplEx


Trouillon et al, Complex Embeddings for Simple Link Prediction, ICML 2016

Based on Distmult, ComplEx embeds entities and relations in Complex vector space

Antisymmetric Relations in Complex

Symmetric Relations in Complex

Inverse Relations in ComplEx

Composition and 1-to-n

Expressiveness of All Models

Properties and expressive power of different KG completion methods:

KG Embeddings in Practice

Sun et al, RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space, ICLR 2019

  • Different KGs may have drastically different relation patterns!
  • There is not a general embedding that works for all KGs, use the table to select models
  • Try TransE for a quick run if the target KG does not have much symmetric relations
  • Then use more expressive models, e.g., ComplEx, RotatE (TransE in Complex space)


  • Link prediction / Graph completion is one of the prominent tasks on knowledge graphs
  • Introduce TransE / TransR / DistMult / ComplEx models with different embedding space and expressiveness

Reasoning in Knowledge Graphs using Embeddings

Example KG: Biomedicine

Predictive Queries on KG

Can we do multi-hop reasoning, i.e., answer complex queries on an incomplete, massive KG?

Predictive One-hop Queries

We can formulate knowledge graph completion problems as answering one-hop queries.

Path Queries

Generalize one-hop queries to path queries by adding more relations on the path.

Query plan of path queries is a chain.

Question: “What proteins are associated with adverse events caused by Fulvestrant?”

Query: (e:Fulvestrant, (r:Causes, r:Assoc)) Given a KG, how to answer a path query?

Traversing Knowledge Graphs

  • We answer path queries by traversing the KG: “What proteins are associated with adverse events caused by Fulvestrant?”
  • Query: (e:Fulvestrant, (r:Causes, r:Assoc))

However, KGS are incomplete

  • Answering queries seems easy: Just traverse the graph.
  • But KGs are incomplete and unknown:
    • Many relations between entities are missing or are incomplete
      • For example, we lack all the biomedical knowledge
      • Enumerating all the facts takes non-trivial time and cost, we cannot hope that KGs will ever be fully complete
  • Due to KG incompleteness, one is not able to identify all the answer entities

Can we first do KG completion and then traverse the completed (probabilistic) KG?

Task: Predictive Oueries

We need a way to answer path-based queries over an incomplete knowledge graph.

We want our approach to implicitly impute and account for the incomplete KG.

Task: Predictive queries

  • Want to be able to answer arbitrary queries while implicitly imputing for the missing information
  • Generalization of the link prediction task

Answering Predictive Queries on Knowledge Graphs

General Idea

Map queries into embedding space. Learn to reason in that space

  • Embed query into a single point in the Euclidean space: answer nodes are close to the query.
  • Query2Box: Embed query into a hyper-rectangle (box) in the Euclidean space: answer nodes are enclosed in the box.

[Embedding Logical Queries on Knowledge Graphs. Hamilton, et al., NeurIPS 2018]

[Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings. Ren, et al., ICLR 2020]

Idea: Traversing KG in Vector Space

Guu, et al., Traversing knowledge graphs in vector space, EMNLP 2015

Key idea: Embed queries!

Traversing KG in Vector Space

The embedding process only involves vector addition, independent of # entities in the KG!

Embed path queries in vector space.

  • Question: “What proteins are associated with adverse events caused by Fulvestrant?”
  • Query: (e:Fulvestrant, (r:Causes , r:Assoc))


  • We can train TransE to optimize knowledge graph completion objective (Lecture 10)
  • Since TransE can naturally handle compositional relations, it can handle path queries by translating in the latent space for multiple hops using addition of relation embeddings.
  • For TransR / DistMult / ComplEx, since they cannot handle compositional relations, they cannot be easily extended to handle path queries.

Conjunctive Queries

Can we answer more complex queries with logic conjunction operation?

  • Conjunctive Queries: “What are drugs that cause Short of Breath and treat diseases associated with protein ESR2?” ((e:ESR2, (r:Assoc, r:TreatedBy)), (e:Short of Breath, (r:CausedBy))

How do we answer the question by KG traversal?

Traverse KG from anchor nodes: ESR2 and Short of Breath:

  • How can we use embeddings to implicitly impute the missing (ESR2, Assoc, Breast Cancer)?
  • Intuition: ESR2 interacts with both BRCA1 and ESR1. Both proteins are associated with breast cancer

Traversing KG in Vector Space

“What are drugs that cause Short of Breath and treat diseases associated with protein ESR2?”

((e:ESR2, (r:Assoc, r:TreatedBy)), (e:Short of Breath, (r:CausedBy))

Each intermediate node represents a set of entities, how do we represent it? How do we define the intersection operation in the latent space?

Qvery2box: Reasoning over KGS Using Box Embeddings

Conjunctive Queries

How can we answer more complex queries with logical conjunction operation?

(1) Each intermediate node represents a set of entities; how do we represent it?

(2) How do we define the intersection operation in the latent space?

Box Embeddings

Ren et al., Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings, ICLR 2020.

Embed queries with hyper-rectangles (boxes)

Key Insight: Intersection(交点)

  • Intersection of boxes is well-defined!
  • When we traverse the KG to find the answers, each step produces a set of reachable entities.
  • How can we better model these sets?
    • Boxes are a powerful abstraction, as we can project the center and control the offset to model the set of entities enclosed in the box

Embed with Box Embedding

Projection(预测) Operator

Embed with Box Embedding

How do we take intersection of boxes?

Intersection Operator

Use box intersection operator

Entity-to-box Distance

Extending to Union Operation

  • Can we embed complex queries with union? E.g.: “What drug can treat breast cancer or lung cancer?”
  • Conjunctive queries + disjunction is called Existential Positive First-order (EPFO) queries. We’ll refer to them as AND-OR queries.
  • Can we also design a disjunction operator and embed AND-OR queries in low-dimensional vector space?

Embedding AND-OR Queries

  • Can we embed AND-OR queries in a low-dimensional vector space?
  • No! Intuition: Allowing union over arbitrary queries requires high-dimensional embeddings!
  • Example:

  • Example 2:

Can we embed AND-OR queries in low-dimensional vector space?

Since we cannot embed AND-OR queries in lowdimensional space, can we still handle them?

  • Key idea: take all unions out and only do union at the last step!

Disjunctive Normal Form

Any AND-OR query can be transformed into equivalent DNF, i.e., disjunction of conjunctive queries.

Distance Between q and an Entity

How to Train Query2box

Training Overview

How to achieve a query, its answers, its negative answers from the KG to train the parameters?

How to split the KG for query answering?


Query Generation from Templates

Generate queries from multiple query templates:

  • How can we generate a complex query?
  • We start with a query template
  • Query template can be viewed as an abstraction of the query
  • We generate a query by instantiating every variable with a concrete entity and relation from the KG
    • E.g., instantiate Anchor1 with ESR2 (a node on KG)
    • E.g., instantiate Rel1 with Assoc (an edge on KG)
  • How to instantiate query template given a KG?

How to instantiate a query template given a KG?

Example of Query2box


  • What do box embeddings actually learn? Example: “List male instrumentalists who play string instruments”
  • We use t-SNE to reduce the embedding space to a 2-dimensional space, in order to visualize the query results

Embedding Space


We introduce answering predictive queries on large knowledge graphs.

The key idea is to embed queries by navigating the embedding space!

We embed the query by composing learned operators

Embedding of the query is close to its answers in the embedding space

Fast Neural Subgraph Matching and Counting


Subgraphs are the building blocks of networks:

They have the power to characterize and discriminate networks

Building Blocks of Networks

Subgraphs and Motifs


Two ways to formalize “network building blocks”

Alternate terminology: “induced subgraph”

Alternate terminology: “non-induced subgraph” or just “subgraph”

The best definition depends on the domain! Examples:

  • Chemistry: Node-induced (functional groups)
  • Knowledge graphs: Often edge-induced (focus is on edges representing logical relations)

Graph Isomorphism(同构关系)

Graph isomorphism problem: Check whether two graphs are identical:

Case Example of Subgraphs

All non-isomorphic, connected, undirected graphs of size 4

All non-isomorphic, connected, directed graphs of size 3

Network Motifs

Network motifs: “recurring, significant patterns of interconnections”

How to define a network motif:

  • Pattern: Small (node-induced) subgraph
  • Recurring: Found many times, i.e., with high frequency
  • Significant: More frequent than expected, i.e., in randomly generated graphs?
  • How to define frequency? How to define random graphs?

Motifs: Induced Subaraphs

Why Do We Need Motifs?

Subgraph Frequency

What if the dataset contains multiple graphs, and we want to compute frequency of subgraphs in the dataset?

Solution: Treat the dataset as a giant graph G_T with disconnected components corresponding to individual graphs

Defining Motif Significance

To define significance, we need to have a null-model (i.e., point of comparison).

Key idea: Subgraphs that occur in a real network much more often than in a random network have functional significance.

Defining Random Graphs

New Model: Configuration Mode

Alternative for Spokes: Switching

Intuition: Motifs are overrepresented in a network when compared to random graphs:

  • Step 1: Count motifs in the given graph (G_real)
  • Step 2: Generate random graphs with similar statistics (e.g. number of nodes, edges, degree sequence), and count motifs in the random graphs
  • Step 3: Use statistical measures to evaluate how significant is each motif
    • Use Z-score

Z-score for Statistical Sianificance

  • For each subgraph:

    • z-score metric is capable of classifying the subgraph “significance”:

      • Negative values indicate under-representation

      • Positive values indicate over-representation

  • We create a network significance profile:

    • A feature vector with values for all subgraph types
  • Next: Compare profiles of different graphs with random graphs:

    • Regulatory network (gene regulation)
    • Neuronal network (synaptic connections)
    • World Wide Web (hyperlinks between pages)
    • Social network (friendships)
    • Language networks (word adjacency)

Summary: Detecting Motifs(主题)

Variations on the Motif Concept

  • Extensions:

    • Directed and undirected
    • Colored and uncolored
    • Temporal and static motifs
  • Variations on the concept:

    • Different frequency concepts
    • Different significance metrics
    • Under-Representation (anti-motifs)
    • Different null models
  • Subgraphs and motifs are the building blocks of graphs

    • Subgraph isomorphism and counting are NP-hard
  • Understanding which motifs are frequent or significant in a dataset gives insight into the unique characteristics of that domain

  • Use random graphs as null model to evaluate the significance of motif via Z-score

Neural Subgraph Matching


  • Large target graph (can be disconnected)
  • Query graph (connected)


  • Is a query graph a subgraph in the target graph?

Isomorphism as an ML Task

  • Large target graph (can be disconnected)
  • Query graph (has to be connected)
  • Use GNN to predict subgraph isomorphism:
  • Intuition: Exploit the geometric shape of embedding space to capture the properties of subgraph isomorphism

Task Setup

Consider a binary prediction: Return True if query is isomorphic to a subgraph of the target graph, else return False

Finding node correspondences between Q and T is another challenging problem, which will not be covered in this lecture.

Overview of the Approach

Neural Architecture for Subgraphs

(1) We are going to work with node-anchored definitions:

(2) We are going to work with node-anchored neighborhoods:

Why Anchor?

Order Embedding Space

Why Order Embedding Space?

Subgraph isomorphism relationship can be nicely encoded in order embedding space

All properties have their counter-parts in the order embedding space

Order Constraint

  • We use a GNN to learn to embed neighborhoods and preserve the order embedding structure
  • What loss function should we use, so that the learned order embedding reflects the subgraph relationship?
  • We design loss functions based on the order constraint:
    • Order constraint specifies the ideal order embedding property that reflects subgraph relationships

We specify the order constraint to ensure that the subgraph properties are preserved in the order embedding space

Loss Function

GNN Embeddings are learned by minimizing a max-margin loss

Training Neural Subgraph Matching

Training Details

  • How many training examples to sample?
    • At every iteration, we sample new training pairs
    • Benefit: Every iteration, the model sees different subgraph examples
    • Improves performance and avoids overfitting – since there are exponential number of possible subgraphs to sample from
  • How deep is the BFS sampling?
    • A hyper-parameter that trades off runtime and performance
    • Usually use 3-5, depending on size of the dataset

Subgraph Predictions on New Graphs

Summary: Neural Subgraph Matching

  • Neural subgraph matching uses a machine learning-based approach to learn the NP-hard problem of subgraph isomorphism
    • Given query and target graph, it embeds both graphs into an order embedding space
    • Using these embeddings, it then computes E(G_q, G_t) to determine whether query is a subgraph of the target
  • Embedding graphs within an order embedding space allows subgraph isomorphism to be efficiently represented and tested by the relative positions of graph embeddings

Finding Frequent Subgraphs


Generally, finding the most frequent size-k motifs requires solving two challenges:

1) Enumerating all size-k connected subgraphs
2) Counting #(occurrences of each subgraph type)

Why is it Hard?

  • Just knowing if a certain subgraph exists in a graph, is a hard computational problem!
    • Subgraph isomorphism is NP-complete
  • Computation time grows exponentially as the size of the subgraphs increases
    • Feasible motif size for traditional methods is relatively small (3 to 7)

Solution with Representation Learning

  • Finding frequent subgraph patterns is computationally hard
    • Combinatorial explosion of number of possible patterns
    • Counting subgraph frequency is NP-hard
  • Representation learning can tackle these challenges:
    • Combinatorial explosion -> organize the search space
    • Subgraph isomorphism -> prediction using GNN

Representation learning can tackle these challenges:

  • Counting #(occurrences of each subgraph type)
    • Solution: Use GNN to “predict” the frequency of the subgraph.
  • Enumerating all size-k connected subgraphs
    • Solution: Don’t enumerate subgraphs but construct a size-k subgraph incrementally
      • Note: We are only interested in high frequency subgraphs

Problem Setup: Frequent Motif Mining

SpMiner: Overview

SPMiner: A neural model to identify frequent motifs

Key Idea

Motif Frequency Estimation

Spminer Search Procedure

Termination: Upon reaching a desired motif size, take the subgraph of the target graph induced by S.

Results: Small Motifs

  • Ground-truth: Find most frequent 10 motifs in dataset by brute-force exact enumeration (expensive)
  • Question: Can the model identify frequent motifs?
  • Result: The model identifies 9 and 8 of the top 10 motifs, respectively.

Experiments: Large motifs

  • Question: How do the frequencies of the identified motif compare?
  • Result: SPMiner identifies motifs that appear 10-100x more frequently than the baselines


  • Subgraphs and motifs are important concepts that provide insights into the structure of graphs. Their frequency can be used as features for nodes/graphs.
  • We covered neural approaches to prediction subgraph isomorphism relationship.
  • Order embeddings have desirable properties and can be used to encode subgraph relations
  • Neural embedding-guided search in order embedding space can enable ML model to identify motifs much more frequent than existing methods

GNNS for Recommender Systems

Recommender Systems Task and Evaluation

Preliminary of Recommendation

Information Explosion in the era of Internet

  • 10K+ movies in Netflix
  • 12M products in Amazon
  • 70M+ music tracks in Spotify
  • 10B+ videos on YouTube
  • 200B+ pins (images) in Pinterest

Personalized recommendation (i.e., suggesting a small number of interesting items for each user) is critical for users to effectively explore the content of their interest.

Recommender System as a Graph

Recommender system can be naturally modeled as a bipartite graph

  • A graph with two node types: users and items.
  • Edges connect users and items
    • Indicates user-item interaction (e.g., click, purchase, review etc.)
    • Often associated with timestamp (timing of the interaction).

Recommendation Task

  • Given
    • Past user-item interactions
  • Task
    • Predict new items each user will interact in the future.
    • Can be cast as link prediction problem.
      • Predict new user-item interaction edges given the past edges.

Top-k Recommendation

Evaluation Metric: Recall@k(1)

Recommender Systems Embedding-based Models


Score Function

Training Objective

Embedding-based models have three kinds of parameters:

Training objective: Optimize the model parameters to achieve high recall@K on seen (i.e., training) user-item interactions

  • We hope this objective would lead to high recall@K on unseen (i.e., test) interactions.

Surrogate(代理) Loss Functions

  • The vanilla training objective (training recall@K) is not differentiable.
    • Cannot apply efficient gradient-based optimization.
  • Two surrogate loss functions are widely-used to enable efficient gradient-based optimization.
    • Binary loss
    • Bayesian Personalized Ranking (BPR) loss
  • Surrogate losses are differentiable and should align well with the original training objective.

Binary Loss

Binary loss pushes the scores of positive edges higher than those of negative edges.

  • This aligns with the training recall metric since positive edges need to be recalled.


  • In the binary loss, the scores of ALL positive edges are pushed higher than those of ALL negative edges.
  • This would unnecessarily penalize model predictions even if the training recall metric is perfect.
  • Why? (example in the next slide)

  • Key insight: The binary loss is non-personalized in the sense that the positive/negative edges are considered across ALL users at once.
  • However, the recall metric is inherently personalized (defined for each user).
    • The non-personalized binary loss is overly-stringent for the personalized recall metric.

Desirable Surrogate Loss

Lesson learned: Surrogate loss function should be defined in a personalized manner.

  • For each user, we want the scores of positive items to be higher than those of the negative items

  • We do not care about the score ordering across users.

Bayesian Personalized Ranking (BPR) loss achieves this!

Bayesian Personalized Ranking (BPR) loss is a personalized surrogate loss that aligns better with the recall@K metric.


  • We have introduced
    • Recall@K as a metric for personalized recommendation
    • Embedding-based models
      • Three kinds of parameters to learn
        • user encoder to generate user embeddings
        • item encoder to generate item embeddings
        • score function to predict the user-item interaction likelihood.
    • Surrogate loss functions to achieve the high recall metric.
  • Embedding-based models have achieved SoTA in recommender systems.
    • Why do they work so well?

Why Embedding Models Work?

  • Underlying idea:
  • Collaborative filtering
    • Recommend items for a user by collecting preferences of many other similar users.
    • Similar users tend to prefer similar items.
  • Key question: How to capture similarity between users/items?

Embedding-based models can capture similarity of users/items!

  • Low-dimensional embeddings cannot simply memorize all user-item interaction data.
  • Embeddings are forced to capture similarity between users/items to fit the data.
  • This allows the models to make effective prediction on unseen user-item interactions.

This Lecture: GNNS for Recsys

Neural Graph Collaborative Filtering (NGCF) [Wang et al. 2019], LightGCN [He et al. 2020]

  • Improve the conventional collaborative filtering models (i.e., matrix factorization) by explicitly modeling graph structure using GNNs.

PinSAGE [Ying et al. 2018]

  • Use GNNs to generate high-quality embeddings by simultaneously capturing rich node attributes (e.g., images) and the graph structure.

Neural Graph Collaborative Filtering

Conventional Collaborative Filtering

Limitations of MF

The model itself does not explicitly capture graph structure

  • The graph structure is only implicitly captured in the training objective.

Only the first-order graph structure (i.e., edges) is captured in the training objective.

  • High-order graph structure (e.g., k-hop paths between two nodes) is not explicitly captured.


  • We want a model that…
    • explicitly captures graph structure (beyond implicitly through the training objective)
    • captures high-order graph structure (beyond the first-order edge connectivity structure)
  • GNNs are a natural approach to achieve both!
    • Neural Graph Collaborative Filtering (NGCF) [Wang et al. 2019]
    • LightGCN [He et al. 2020]
      • A simplified and improved version of NGCF

NGCF: Overview

Neural Graph Collaborative Filtering (NGCF) explicitly incorporates high-order graph structure when generating user/item embeddings.

Key idea: Use a GNN to generate graph-aware user/item embeddings.


  • Given: User-item bipartite graph.
  • NGCF framework:
    • Prepare shallow learnable embedding for each node.
    • Use multi-layer GNNs to propagate embeddings along the bipartite graph.
      • High-order graph structure is captured.
    • Final embeddings are explicitly graph-aware!
  • Two kinds of learnable params are jointly learned:
    • Shallow user/item embeddings
    • GNN’s parameters

Initial Node Embeddings

Neighbor Aggregation

Final Embeddings and Score Function

NGCF: Summary

  • Conventional collaborative filtering uses shallow user/item embeddings.
    • The embeddings do not explicitly model graph structure.
    • The training objective does not model high-order graph structure.
  • NGCF uses a GNN to propagate the shallow embeddings.
    • The embeddings are explicitly aware of high-order graph structure.



  • Recall: NGCF jointly learns two kinds of parameters:
    • Shallow user/item embeddings
    • GNN’s parameters
  • Observation: Shallow learnable embeddings are already quite expressive.
  • Can we simplify the GNN used in NGCF (e.g., remove its learnable parameters)?
    • Answer: Yes!
    • Bonus: Simplification improves the recommendation performance!
  • Overview of the idea:
    • Adjacency matrix for a bipartite graph
    • Matrix formulation of GCN
    • Simplification of GCN by removing non-linearity
      • Related: SGC for scalable GNN [Wu et al. 2019]

Adjacency and Embedding Matrices

Adjacency matrix of a (undirected) bipartite graph.

Shallow embedding matrix

Matrix Formulation of GCN

Simplifying GCN

Multi-scale Diffusion

LightGCN: Model Overview


  • Adjacency matrix A
  • Initial learnable embedding matrix E

Average the embedding matrices at different scales

Score function:

LightGCN: Intuition

Question: Why does the simple diffusion propagation work well?

Answer: The diffusion directly encourages the embeddings of similar users/items to be similar.

  • Similar users share many common neighbors (items) and are expected to have similar future preferences (interact with similar items)

Lightgcn and GCN/C&S

LIGHTGCN and MF: Comparison

  • Both LightGCN and Matrix Factorization (MF) learn a unique embedding for each user/item.
  • The difference is that
    • MF directly uses the shallow user/item embeddings for scoring.
    • LightGCN uses the diffused user/item embeddings for scoring.
  • LightGCN performs better than MF but are also more computationally expensive due to the additional diffusion step.
    • The final embedding of a user/item is obtained by aggregating embeddings of its multi-hop neighboring nodes.


  • LightGCN simplifies NGCF by removing the learnable parameters of GNNs.
  • Learnable parameters are all in the shallow input node embeddings.
    • Diffusion propagation only involves matrix-vector multiplication.
    • The simplification leads to better empirical performance than NGCF



P2P recommendation

PINSAGE: Pin Embedding

  • Unifies visual, textual, and graph information.
  • The largest industry deployment of a Graph Convolutional Networks
  • Huge Adoption across Pinterest
  • Works for fresh content and is available in a few seconds after pin creation

Graph Convolutional Neural Networks for Web-Scale Recommender Systems

PinSage graph convolutional network:

  • Goal: Generate embeddings for nodes in a large-scale Pinterest graph containing billions of objects

  • Key Idea: Borrow information from nearby nodes

    • E.g., bed rail Pin might look like a garden fence, but gates and beds are rarely adjacent in the graph
  • Pin embeddings are essential to various tasks like recommendation of Pins, classification, ranking

  • Services like “Related Pins”, “Search”, “Shopping”, “Ads”

Harnessing Pins and Boards

PINSAGE: Graph Neural Network

  • Graph has tens of billions of nodes and edges
  • Further resolves embeddings across the Pinterest graph

PINSAGE: Methods for Scaling Up

In addition to the GNN model, the PinSAGE paper introduces several methods to scale the GNN to a billion-scale recommender system (e.g., Pinterest).

  • Shared negative samples across users in a mini-batch
  • Hard negative samples
  • Curriculum learning
  • Mini-batch training of GNNs on a large-graph (to be covered in the future lecture)


Training Data

1+B repin pairs:

  • From Related Pins surface

  • Capture semantic relatedness

  • Goal: Embed such pairs to be “neighbors”

    Example positive training pairs (Q,X):

Shared Negative Samples

Curriculum Learning

  • Key insight: It is effective to make the negative samples gradually harder in the process of training.
  • At n-th epoch, we add n − 1 hard negative items.
    • #(Hard negatives) gradually increases in the process of training.
  • The model will gradually learn to make finer-grained predictions

Idea: use harder and harder negative samples

Include more and more hard negative samples for each epoch

Hard Negatives

Challenge: Industrial recsys needs to make extremely fine-grained predictions.

  • #Total items: Up to billions.

  • #Items to recommend for each user: 10 to 100.

Issue: The shared negative items are randomly sampled from all items

  • Most of them are “easy negatives”, i.e., a model does not need to be fine-grained to distinguish them from positive items.

We need a way to sample “hard negatives” to force the model to be fine-grained

  • For each user node, the hard negatives are item nodes that are close (but not connected) to the user node in the graph.
  • Hard negatives for user u ∈ U are obtained as follows:
    • Compute personalized page rank (PPR) for user u.
    • Sort items in the descending order of their PPR scores.
    • Randomly sample item nodes that are ranked high but not too high, e.g., 2000th —5000th .
      • Item nodes that are close but not too close (connected) to the user node.
  • The hard negatives for each user are used in addition to the shared negatives.

PINSAGE: Negative Sampling

  • (q, p) positive pairs are given but various methods to sample negatives to form (q, p, n)
  • Distance Weighted Sampling (Wu et al., 2017) - Sample negatives so that query-negative distance distribution is approx U[0.5, 1.4]

Fine-grained Object Similarity

Compare against Prod

PINSAGE: Summary

  • PinSAGE uses GNNs to generate high-quality user/item embeddings that capture both the rich node attributes and graph structure.
  • The PinSAGE model is effectively trained using sophisticated negative sampling strategies.
  • PinSAGE is successfully deployed at Pinterest, a billion-scale image content recommendation service.
    • Uncovered in this lecture: How to scale up GNNs to large-scale graphs. Will be covered in a later lecture

文章作者: 杰克成
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 杰克成 !