Lesson

Deep Learning

发布日期: 2021-11-12

更新日期: 2021-11-25

文章字数: 15.8k

阅读时长: 98 分

阅读次数:

CS224W | Home (stanford.edu)

Introduction

Machine Learning with Graphs

Why is Graph Deep Learning Hard?

Networks are complex
- Arbitrary size and complex topological structure (i.e., no spatial locality like grids)
- No fixed node ordering or reference point
- Often dynamic and have multimodal features

Deep Learning in Graphs

Representation Learning

(Supervised) Machine Learning Lifecycle: This feature, that feature. Every single time!

Map nodes to d-dimensional embeddings such that similar nodes in the network are embedded close together

Course Outline

We are going to cover various topics in Machine Learning and Representation Learning for graph structured data:

Traditional methods: Graphlets, Graph Kernels
Methods for node embeddings: DeepWalk, Node2Vec
Graph Neural Networks: GCN, GraphSAGE, GAT, Theory of GNNs
Knowledge graphs and reasoning: TransE, BetaE
Deep generative models for graphs: GraphRNN
Applications to Biomedicine, Science, Industry

Applications of Graph ML

Different Types of Tasks

Classic Graph ML Tasks

Node classification: Predict a property of a node
- Example: Categorize online users / items
Link prediction: Predict whether there are missing links between two nodes
- Example: Knowledge graph completion
Graph classification: Categorize different graphs
- Example: Molecule property prediction
Clustering: Detect if nodes form a community
- Example: Social circle detection
Other tasks:
- Graph generation: Drug discovery
- Graph evolution: Physical simulation

Node-level ML Tasks

Protein Folding

A protein chain acquires its native 3D structure

The Protein Folding Problem

Computationally predict a protein’s 3D structure based solely on its amino acid sequence

Alphafold: Impact

Key idea: “Spatial graph”
- Nodes: Amino acids in a protein sequence
- Edges: Proximity between amino acids (residues)

Edge-level ML Tasks

Recommender Systems

Users interacts with items
- Watch movies, buy merchandise, listen to music
- Nodes: Users and items
- Edges: User-item interactions
Goal: Recommend items users might like

Pinsage: Graph-based Recommender

Ying et al., Graph Convolutional Neural Networks for Web-Scale Recommender Systems, KDD 2018

Drug Side Effects

Many patients take multiple drugs to treat complex or co-existing diseases:
- 46% of people ages 70-79 take more than 5 drugs
- Many patients take more than 20 drugs to treat heart disease, depression, insomnia, etc.
Task: Given a pair of drugs predict adverse side effects

Biomedical Graph Link Prediction

Zitnik et al., Modeling Polypharmacy Side Effects with Graph Convolutional Networks, Bioinformatics 2018

Subgraph-level ML Tasks

Traffic Prediction

Road Network as a Graph

Nodes: Road segments
Edges: Connectivity between road segments
Prediction: Time of Arrival (ETA)

Graph-level ML Tasks

Drug Discovery

Antibiotics are small molecular graphs
- Nodes: Atoms
- Edges: Chemical bonds

Konaklieva, Monika I. “Molecular targets of β-lactam-based antimicrobials: beyond the usual suspects.” Antibiotics 3.2 (2014): 128-142.

Deep Learning for Antibiotic Discovery

Stokeset al., A Deep Learning Approach to Antibiotic Discovery, Cell 2020

A Graph Neural Network graph classification model
Predict promising molecules from a pool of candidates

Molecule Generation/Optimization

Youet al., Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation, NeurIPS 2018

Physics Simulation

Sanchez-Gonzalez et al., Learning to simulate complex physics with graph networks, ICML 2020

Physical simulation as a graph:
- Nodes: Particles
- Edges: Interaction between particles

Simulation Learning Framework

A graph evolution task:

Goal: Predict how a graph will evolve over

Choice of Graph Representation

Components of a Network

Choosing a Proper Representation

If you connect individuals that work with each other, you will explore a professional network
If you connect those that have a sexual relationship, you will be exploring sexual networks
If you connect scientific papers that cite each other, you will be studying the citation network
If you connect all papers with the same word in the title, what will you be exploring? It is a network, nevertheless

How do you define a graph

How to build a graph:
- What are nodes?
- What are edges?
Choice of the proper network representation of a given domain/problem determines our ability to use networks successfully:
- In some cases, there is a unique, unambiguous representation
- In other cases, the representation is by no means unique
- The way you assign links will determine the nature of the question you can study

Directed vs Undirected Graphs

Heterogeneous Graphs

Node Degrees

Bipartite Graph

Bipartite graph is a graph whose nodes can be divided into two disjoint sets U and V such that every link connects a node in U to one in V; that is, U and V are independent sets
Examples:
- Authors-to-Papers (they authored)
- Actors-to-Movies (they appeared in)
- Users-to-Movies (they rated)
- Recipes-to-Ingredients (they contain)
“Folded” networks:
- Author collaboration networks
- Movie co-rating networks

FoldedProjected Bipartite Graphs

Representing Graphs: Adjacency Matrix

Networks are Sparse Graphs

Representing Graphs: Edge list

Representing Graphs: Adjacency list

Adjacency list:
- Easier to work with if network is
  - Large
  - Sparse
- Allows us to quickly retrieve all neighbors of a given node
  - 1:
  - 2: 3, 4
  - 3: 2, 4
  - 4: 5
  - 5: 1, 2

Node and Edge Attributes

Possible options:
- Weight (e.g., frequency of communication)
- Ranking (best friend, second best friend…)
- Type (friend, relative, co-worker)
- Sign: Friend vs. Foe, Trust vs. Distrust
- Properties depending on the structure of the rest of the graph: Number of common friends

More Types of Graphs

Connectivity of Undirected Graphs

Connected (undirected) graph:
- Any two vertices can be joined by a path
A disconnected graph is made up by two or more connected components

Connectivity: Example

The adjacency matrix of a network with several components can be written in a block- diagonal form, so that nonzero elements are confined to squares, with all other elements being zero:

Connectivity of Directed Graphs

Strongly connected directed graph
- has a path from each node to every other node and vice versa (e.g., A-B path and B-A path)
Weakly connected directed graph
- is connected if we disregard the edge directions

Strongly connected components (SCCs) can be identified, but not every node is part of a nontrivial strongly connected component.

Traditional Methods for Machine Learning in Graphs

Traditional ML Pipeline

Design features for nodes/links/graphs

Obtain features for all training data

Feature Desian

Using effective features over graphs is the key to achieving good model performance.
Traditional ML pipeline uses hand-designed features.
In this lecture, we overview the traditional features for:
- Node-level prediction
- Link-level prediction
- Graph-level prediction
For simplicity, we focus on undirected graphs.
Goal: Make predictions for a set of objects
Design choices:
- Features: d-dimensional vectors
- Objects: Nodes, edges, sets of nodes, entire graphs
- Objective function:
  - What task are we aiming to solve?

Node-level Tasks and Features

Node-level Features: Overview

Goal: Characterize the structure and position of a node in the network:
- Node degree
- Node centrality
- Clustering coefficient
- Graphlets

Node Degree

Node Centrality

Node degree counts the neighboring nodes without capturing their importance.
Node centrality 𝑐_𝑣 takes the node importance in a graph into account
Different ways to model importance:
- Engienvector centrality
- Betweenness centrality
- Closeness centrality
- and many others…

Eigenvector centrality

Betweenness centrality

Closeness centrality

Clustering Coefficient

Graphlets

Goal: Describe network structure around node 𝑢
- Graphlets are small subgraphs that describe the structure of node 𝑢’s network neighborhood Analogy:
Degree counts #(edges) that a node touches
Clustering coefficient counts #(triangles) that a node touches.
Graphlet Degree Vector (GDV): Graphlet-base features for nodes
- GDV counts #(graphlets) that a node touches
Considering graphlets of size 2-5 nodes we get:
- Vector of 73 coordinates is a signature of a node that describes the topology of node’s neighborhood
Graphlet degree vector provides a measure of a node’s local network topology:
- Comparing vectors of two nodes provides a more detailed measure of local topological similarity than node degrees or clustering coefficient.

Induced Subgraph Isomorphism

Def: Induced subgraph is another graph, formed from a subset of vertices and all of the edges connecting the vertices in that subset.

Def: Graph Isomorphism(同构)

Two graphs which contain the same number of nodes connected in the same way are said to be isomorphic.

Graphlet Degree Vector (GDV): A count vector of graphlets rooted at a given node

Node-level Feature: Summary

We have introduced different ways to obtain node features.
They can be categorized as:
- Importance-based features:
  - Node degree
  - Different node centrality measures
- Structure-based features:
  - Node degree
  - Clustering coefficient
  - Graphlet count vector
Importance-based features: capture the importance of a node in a graph
- Node degree:
  - Simply counts the number of neighboring nodes
- Node centrality:
  - Models importance of neighboring nodes in a graph
  - Different modeling choices: eigenvector centrality, betweenness centrality, closeness centrality
Useful for predicting influential nodes in a graph
- Example: predicting celebrity users in a social network
Structure-based features: Capture topological properties of local neighborhood around a node.
- Node degree:
  - Counts the number of neighboring nodes
- Clustering coefficient:
  - Measures how connected neighboring nodes are
- Graphlet degree vector:
  - Counts the occurrences of different graphlets
Useful for predicting a particular role a node plays in a graph:
- Example: Predicting protein functionality in a protein-protein interaction network

Link Prediction Task and Features

Recap

The task is to predict new links based on the existing links.
At test time, node pairs (with no existing links) are ranked, and top 𝐾 node pairs are predicted.
The key is to design features for a pair of nodes.

Two formulations of the link prediction task:

Link Prediction via Proximity

Methodology:
- For each pair of nodes (x,y) compute score c(x,y)
  - For example, c(x,y) could be the # of common neighbors of x and y
- Sort pairs (x,y) by the decreasing score c(x,y)
- Predict top n pairs as new links
- See which of these links actually appear in 𝐺[𝑡1 ,𝑡1 ′ ]

Link-level Features: Overview

Distance-based feature
Local neighborhood overlap
Global neighborhood overlap

Distance-based Features

Shortest-path distance between two nodes

However, this does not capture the degree of neighborhood overlap:

Node pair (B, H) has 2 shared neighboring nodes, while pairs (B, E) and (A, B) only have 1 such node

Local Neighborhood Overlap

Global Neighborhood Overlap

Limitation of local neighborhood features:
- Metric is always zero if the two nodes do not have any neighbors in common.
- However, the two nodes may still potentially be connected in the future.
Global neighborhood overlap metrics resolve the limitation by considering the entire graph.

Katz index: count the number of walks of all lengths between a given pair of nodes.
Q: How to compute #walks between two nodes?
- Use powers of the graph adjacency matrix!

Intuition: Powers of Adj Matrices

Computing #walks between two nodes

Summary

Distance-based features:
- Uses the shortest path length between two nodes but does not capture how neighborhood overlaps.
Local neighborhood overlap:
- Captures how many neighboring nodes are shared by two nodes.
- Becomes zero when no neighbor nodes are shared.
Global neighborhood overlap:
- Uses global graph structure to score two nodes.
- Katz index counts #walks of all lengths between two nodes.

Graph-level Features and Graph Kernels

Graph-level Features

Goal: We want features that characterize the structure of an entire graph

Backaround Kernel Methods

Kernel methods are widely-used for traditional ML for graph-level prediction.
Idea: Design kernels instead of feature vectors.
A quick introduction to Kernels:

Overview

Graph Kernels: Measure similarity between two graphs:
- Graphlet Kernel [1]
- Weisfeiler-Lehman Kernel [2]
- Other kernels are also proposed in the literature (beyond the scope of this lecture)
  - Random-walk kernel
  - Shortest-path graph kernel
  - And many more…

[1] Shervashidze, Nino, et al. “Efficient graphlet kernels for large graph comparison.” Artificial Intelligence and Statistics. 2009.

[2] Shervashidze, Nino, et al. “Weisfeiler-lehman graph kernels.” Journal of Machine Learning Research 12.9 (2011).

Graph Kernel: Key Idea

Goal: Design graph feature vector 𝜙(𝐺)
Key idea: Bag-of-Words (BoW) for a graph
- Recall: BoW simply uses the word counts as features for documents (no ordering considered).
- Naïve extension to a graph: Regard nodes as words.
- Since both graphs have 4 red nodes, we get the same feature vector for two different graphs…

What if we use Bag of node degrees?

Both Graphlet Kernel and Weisfeiler-Lehman (WL) Kernel use Bag-of-* representation of graph, where * is more sophisticated than node degrees!

Graphlet Features

Key idea: Count the number of different graphlets in a graph.
- Note: Definition of graphlets here is slightly different from node-level features.
- The two differences are:
  - Nodes in graphlets here do not need to be connected (allows for isolated nodes)
  - The graphlets here are not rooted.
  - Examples in the next slide illustrate this.

Graphlet Kernel

Limitations: Counting graphlets is expensive!
Counting size-𝑘 graphlets for a graph with size 𝑛 by enumeration takes 𝑛^𝑘 .
This is unavoidable in the worst-case since subgraph isomorphism test (judging whether a graph is a subgraph of another graph) is NP-hard.
If a graph’s node degree is bounded by 𝑑, an 𝑂(𝑛𝑑^(𝑘−1)) algorithm exists to count all the graphlets of size 𝑘.

Can we design a more efficient graph kernel?

Weisfeiler-lehman Kernel

Goal: Design an efficient graph feature descriptor 𝜙 (𝐺)
Idea: Use neighborhood structure to iteratively enrich node vocabulary.
Generalized version of Bag of node degrees since node degrees are one-hop neighborhood information.
Algorithm to achieve this: Color refinement

Example of color refinement given two graphs

After color refinement, WL kernel counts number of nodes with a given color.

The WL kernel value is computed by the inner product of the color count vectors:

WL kernel is computationally efficient
- The time complexity for color refinement at each step is linear in #(edges), since it involves aggregating neighboring colors.
When computing a kernel value, only colors appeared in the two graphs need to be tracked.
- Thus, #(colors) is at most the total number of nodes.
Counting colors takes linear-time w.r.t. #(nodes).
In total, time complexity is linear in #(edges)

Summary

Graphlet Kernel
- Graph is represented as Bag-of-graphlets
- Computationally expensive
Weisfeiler-Lehman Kernel
- Apply 𝐾-step color refinement algorithm to enrich node colors
  - Different colors capture different 𝐾-hop neighborhood structures
- Graph is represented as Bag-of-colors
- Computationally efficient
- Closely related to Graph Neural Networks (as we will see!)
Traditional ML Pipeline
- Hand-crafted feature + ML model
Hand-crafted features for graph data
- Node-level:
  - Node degree, centrality, clustering coefficient, graphlets
- Link-level:
  - distance-based feature
  - local/global neighborhood overlap
- Graph-level:
  - Graphlet kernel, WL kernel

Node Embeddings

Goal: Efficient task-independent feature learning for machine learning with graphs!

Why Embedding?

Task: Map nodes into an embedding space
- Similarity of embeddings between nodes indicates their similarity in the network. For example:
  - Both nodes are close to each other (connected by an edge)
- Encode network information
- Potentially used for many downstream predictions

2D embedding of nodes of the Zachary’s Karate Club network:

Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014

Node Embeddings Encoder and Decoder

Setup

Assume we have a graph G:
- V is the vertex set.
- A is the adjacency matrix (assume binary).
- For simplicity: No node features or extra information is used

Embedding Nodes

Goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the graph

Learning Node Embeddings

Encoder maps from nodes to embeddings
Define a node similarity function (i.e., a measure of similarity in the original network)
Decoder 𝐃𝐄𝐂 maps from embeddings to the similarity score
Optimize the parameters of the encoder so that:

Two Key Components

“shallow”Encoding

Simplest encoding approach: Encoder is just an embedding-lookup

Each node is assigned a unique embedding vector (i.e., we directly optimize the embedding of each node)
Many methods: DeepWalk, node2vec

Framework Summary

Encoder + Decoder Framework

How to Define Node Similarit

Key choice of methods is how they define node similarity.
Should two nodes have a similar embedding if they…
- are linked?
- share neighbors?
- have similar “structural roles”?
We will now learn node similarity definition that uses random walks, and how to optimize embeddings for such a similarity measure.

Note on Node Embeddings

This is unsupervised/self-supervisedway of learning node embeddings.
- We are not utilizing node labels
- We are not utilizing node features
- The goal is to directly estimate a set of coordinates (i.e., the embedding) of a node so that some aspect of the network structure (captured by DEC) is preserved.
These embeddings are task independent
- They are not trained for a specific task but can be used for any task.

Random Walk Approaches for Node Embeddings

Notation

Random Walk

Random-walk Embeddings

Why Random Walks?

Expressivity: Flexible stochastic(随机的) definition of node similarity that incorporates both local and higher-order neighborhood information Idea: if random walk starting from node 𝑢 visits 𝑣 with high probability, 𝑢 and 𝑣 are similar (high-order multi-hop information)
Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks

Unsupervised Feature Learning

Intuition: Find embedding of nodes in 𝑑-dimensional space that preserves similarity
Idea: Learn node embedding such that nearby nodes are close together in the network
Given a node 𝑢, how do we define nearby nodes?

Feature Learning as Optimization

But doing this naively is too expensive!

Negative Sampling

Stochastic Gradient Descent

Stochastic Gradient Descent: Instead of evaluating gradients over all examples, evaluate it for each individual training example.

Random Walks：Summary

How should we randomly walk?

So far we have described how to optimize embeddings given a random walk strategy R -
What strategies should we use to run these random walks?
- Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013)
  - The issue is that such notion of similarity is too constrained
How can we generalize this?

Reference: Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD

Overview of node2vec

Reference: Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD

Goal: Embed nodes with similar network neighborhoods close in the feature space.
We frame this goal as a maximum likelihood optimization problem, independent to the downstream prediction task.
Key observation: Flexible notion of network neighborhood 𝑁_𝑅(𝑢) of node 𝑢 leads to rich node embeddings
Develop biased 2^nd order random walk 𝑅 to generate network neighborhood 𝑁_𝑅(𝑢) of node 𝑢

node2vec: Biased Walks

Idea: use flexible, biased random walks that can trade off between local and global views of the network (Grover and Leskovec, 2016).

BFS VS.DFS

Biased fixed-length random walk 𝑹 that given a node 𝒖 generates neighborhood 𝑵_𝑹(𝒖)
- Two parameters:
  - Return parameter 𝒑:
    - Return back to the previous node
- In-out parameter 𝒒:
  - Moving outwards (DFS) vs. inwards (BFS)
  - Intuitively, 𝑞 is the “ratio” of BFS vs. DFS

Biased Random Walks

Biased 2nd -order random walks explore network neighborhoods:
- Rnd. walk just traversed edge (𝑠1 , 𝑤) and is now at 𝑤
- Insight: Neighbors of 𝑤 can only be:

Idea: Remember where the walk came from

Walker came over edge (𝐬𝟏, 𝐰) and is at 𝐰. Where to go next?
𝑝, 𝑞 model transition probabilities
- 𝑝 … return parameter
- 𝑞 … ”walk away” parameter

Walker came over edge (𝐬𝟏, 𝐰) and is at 𝐰. Where to go next?

node2vec algorithm

1) Compute random walk probabilities

2) Simulate 𝑟 random walks of length 𝑙 starting from each node 𝑢

3) Optimize the node2vec objective using Stochastic Gradient Descent

Linear-time complexity

All 3 steps are individually parallelizable

Other Random Walk Ideas

Different kinds of biased random walks:
- Based on node attributes (Dong et al., 2017).
- Based on learned weights (Abu-El-Haija et al., 2017)
Alternative optimization schemes:
- Directly optimize based on 1-hop and 2-hop random walk probabilities (as in LINE from Tang et al. 2015).
Network preprocessing techniques:
- Run random walks on modified versions of the original network (e.g., Ribeiro et al. 2017’s struct2vec, Chen et al. 2016’s HARP).

Summary so far

Core idea: Embed nodes so that distances in embedding space reflect node similarities in the original network.
Different notions of node similarity:
- Naïve: similar if two nodes are connected
- Neighborhood overlap (covered in Lecture 2)
- Random walk approaches (covered today)
So what method should I use..?
No one method wins in all cases….
- E.g., node2vec performs better on node classification while alternative methods perform better on link prediction (Goyal and Ferrara, 2017 survey).
Random walk approaches are generally more efficient.
In general: Must choose definition of node similarity that matches your application.

Embedding Entire Graphs

Approach l

Simple (but effective) approach 1:
- Run a standard graph embedding technique on the (sub)graph 𝐺.
- Then just sum (or average) the node embeddings in the (sub)graph 𝐺.

Used by Duvenaud et al., 2016 to classify molecules based on their graph structure

Approach 2

Approach 2: Introduce a “virtual node” to represent the (sub)graph and run a standard graph embedding technique

Proposed by Li et al., 2016 as a general technique for subgraph embedding

Approach 3: Anonymous Walk Embeddings

States in anonymous(匿名的) walks correspond to the index of the first time we visited the node in a random walk

Anonymous Walk Embeddings, ICML 2018

Agnostic to the identity of the nodes visited (hence anonymous)
Example: Random walk w1 :

Number of Walks Grows

Simple Use of Anonymous Walks

Simulate anonymous walks 𝑤𝑖 of 𝑙 steps and record their counts.
Represent the graph as a probability distribution over these walks.
For example:
- Set 𝑙 = 3
- Then we can represent the graph as a 5-dim vector
  - Since there are 5 anonymous walks 𝑤𝑖 of length 3: 111, 112, 121, 122, 123
- 𝒛_𝑮[𝑖] = probability of anonymous walk 𝑤𝑖 in graph 𝐺.
Sampling anonymous walks: Generate independently a set of 𝑚 random walks.
Represent the graph as a probability distribution over these walks.
How many random walks 𝑚 do we need?

New idea: Learn Walk Embeddings

Rather than simply representing each walk by the fraction of times it occurs, we learn embedding 𝒛_𝒊 of anonymous walk 𝒘_𝒊 .

How to embed walks?

Idea: Embed walks s.t. the next walk can be predicted

Anonymous Walk Embeddings, ICML 2018

Summary

We discussed 3 ideas to graph embeddings:
Approach 1: Embed nodes and sum/avg them
Approach 2: Create super-node that spans the (sub) graph and then embed that node.
Approach 3: Anonymous Walk Embeddings
- Idea 1: Sample the anon. walks and represent the graph as fraction of times each anon walk occurs.
- Idea 2: Learn graph embedding together with anonymous walk embeddings.

Preview: Hierarchical Embeddings

We will discuss more advanced ways to obtain graph embeddings in Lecture 8.
We can hierarchically cluster nodes in graphs, and sum/avg the node embeddings according to these clusters.

How to Use Embeddings

Encoder-decoder framework:
- Encoder: embedding lookup
- Decoder: predict score based on embedding to match node similarity
Node similarity measure: (biased) random walk
- Examples: DeepWalk, Node2Vec
Extension to Graph embedding: Node embedding aggregation and Anonymous Walk Embeddings

Graph as Matrix: Pagerank, Random Walks and Embeddings

Pagerank(aka the Google Algorithm)

Example: The Web as a Graph

Q: What does the Web “look like” at a global level?
Web as a graph:
- Nodes = web pages
- Edges = hyperlinks
- Side issue: What is a node?
  - Dynamic pages created on the fly
  - “dark matter” – inaccessible database generated pages

What Does the Web Look Like?

How is the Web linked?
What is the “map” of the Web?
Web as a directed graph [Broder et al. 2000]:
- Given node v, what nodes can v reach?
- What other nodes can reach v?

Ranking Nodes on the Graph

All web pages are not equally “important” www.thispersondoesnotexist.com vs. www.stanford.edu
There is large diversity in the web-graph node connectivity.
So, let’s rank the pages using the web graph link structure!

Link Analysis Algorithms

We will cover the following Link Analysis approaches to compute the importance of nodes in a graph:
- PageRank
- Personalized PageRank (PPR)
- Random Walk with Restarts

Links as Votes

Idea: Links as votes
- Page is more important if it has more links
  - In-coming links? Out-going links?
Think of in-links as votes:
- www.stanford.edu has 23,400 in-links
- thispersondoesnotexist.com has 1 in-link
Are all in-links equal?
- Links from important pages count more
- Recursive question!

Pagerank: The”Flow”Model

A “vote” from an important page is worth more:
- Each link’s vote is proportional to the importance of its source page
- If page i with importance ri has di out-links, each link gets ri / di votes
- Page j’s own importance rj is the sum of the votes on its in-links

Pagerank: Matrix Formulation

Connection to Random Walk

Imagine a random web surfer:
- At any time 𝒕, surfer is on some page 𝑖
- At time 𝒕 + 𝟏, the surfer follows an out-link from 𝒊 uniformly at random
- Ends up on some page 𝒋 linked from 𝒊
- Process repeats indefinitely
Let:
- 𝒑(𝒕) … vector whose 𝑖^th coordinate is the prob. that the surfer is at page 𝑖 at time 𝑡
- So, 𝒑(𝒕) is a probability distribution over pages

The Stationary Distribution

Recall Eigenvector of A Matrix

Summary

PageRank:

Measures importance of nodes in a graph using the link structure of the web
Models a random web surfer using the stochastic adjacency matrix 𝑴
PageRank solves 𝒓 = 𝑴𝒓 where 𝒓 can be viewed as both the principle eigenvector of 𝑴 and as the stationary distribution of a random walk over the graph

Pagerank: How to solve?

Power Iteration Method

Given a web graph with N nodes, where the nodes are pages and edges are hyperlinks

About 50 iterations is sufficient to estimate the limiting solution.

Three Ouestions

Does this converge?
Does it converge to what we want?
Are results reasonable?

Problems

Two problems:

(1) Some pages are dead ends (have no out-links)

Such pages cause importance to “leak out”

(2) Spider traps (all out-links are within the group)

Eventually spider traps absorb all importance

Does this converge

Does it converge to what we want?

Solution to Spider Traps

Solution to Dead Ends

Teleports: Follow random teleport links with total probability 1.0 from dead-ends
- Adjust matrix accordingly

Why Teleports Solve the Problem?

Why are dead-ends and spider traps a problem and why do teleports solve the problem?
Spider-traps are not a problem, but with traps PageRank scores are not what we want
- Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps
Dead-ends are a problem
- The matrix is not column stochastic so our initial assumptions are not met
- Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go

Solution: Random Teleports

Random Teleports(beta=0.8)

Summary

PageRank solves for 𝒓 = 𝑮𝒓 and can be efficiently computed by power iteration of the stochastic adjacency matrix (𝑮)
Adding random uniform teleportation solves issues of dead-ends and spider-traps

Random Walk with Restarts and Personalized Pagerank

Example: Recommendation

Given: A bipartite graph representing user and item interactions (e.g. purchase)

Goal: Proximity on graphs

What items should we recommend to a user who interacts with item Q?
Intuition: if items Q and P are interacted by similar users, recommend P when user interacts with Q

Which is more related A,A’ or B,B’?

Node proximity Measurements

Which is more related A,A’, B,B’ or C,C’?

Proximity on Graphs

PageRank:

Ranks nodes by “importance”
Teleports(传送) with uniform probability to any node in the network

Personalized PageRank:

Ranks proximity of nodes to the teleport nodes 𝑺

Proximity(接近) on graphs:

Q: What is most related item to Item Q?
Random Walks with Restarts
- Teleport back to the starting node: 𝑺 = {𝑸}

Idea: Random Walks

Idea
- Every node has some importance
- Importance gets evenly split among all edges and pushed to the neighbors:
Given a set of QUERY_NODES, we simulate a random walk:
- Make a step to a random neighbor and record the visit (visit count)
- With probability ALPHA, restart the walk at one of the QUERY_NODES
- The nodes with the highest visit count have highest proximity to the QUERY_NODES

Idea:

Every node has some importance
- Importance gets evenly split among all edges and pushed to the neighbors
- Given a set of QUERY NODES Q, simulate a random walk:

Random Walk Algorithm

Benefits

Why is this a good solution?

Because the “similarity” considers:

Multiple connections
Multiple paths
Direct and indirect connections
Degree of the node

Summary: Page Rank Variants

PageRank:
- Teleports to any node
- Nodes can have the same probability of the surfer landing: 𝑺 = [0.1,0.1,0.1, 0.1,0.1,0.1, 0.1,0.1,0.1, 0.1]
Topic-Specific PageRank aka Personalized PageRank:
- Teleports to a specific set of nodes
- Nodes can have different probabilities of the surfer landing there: 𝑺 = [0.1,0, 0,0.2, 0, 0,0.5,0,0, 0.2]
Random Walk with Restarts:
- Topic-Specific PageRank where teleport is always to the same node: 𝑺 = [0,0, 0,0, 𝟏, 0, 0, 0, 0, 0,0]

A graph is naturally represented as a matrix
We defined a random walk process over the graph
- Random surfer moving across the links and with random teleportation
- Stochastic adjacency matrix M
PageRank = Limiting distribution of the surfer location represented node importance
- Corresponds to the leading eigenvector of transformed adjacency matrix M.

Matrix Factorization and Node Embeddings

Embeddings Matrix Factorization(因式分解)

Recall: encoder as an embedding lookup

Connection to Matrix Factorization

Random Walk-based Similarity

DeepWalk and node2vec have a more complex node similarity definition based on random walks
DeepWalk is equivalent to matrix factorization of the following complex matrix expression:

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec, WSDM 18

Node2vec can also be formulated as a matrix factorization (albeit a more complex matrix)

Limitations

Limitations of node embeddings via matrix factorization and random walks

Cannot obtain embeddings for nodes not in the training set

Node 1 and 11 are structurally similar – part of one triangle, degree 2, …
However, they have very different embeddings.
- It’s unlikely that a random walk will reach node 11 from node 1.
DeepWalk and node2vec do not capture structural similarity.
Cannot utilize node, edge and graph features

Solution to these limitations: Deep Representation Learning and Graph Neural Networks

Summary

PageRank
- Measures importance of nodes in graph
- Can be efficiently computed by power iteration of adjacency matrix
Personalized PageRank (PPR)
- Measures importance of nodes with respect to a particular node or set of nodes
- Can be efficiently computed by random walk
Node embeddings based on random walks can be expressed as matrix factorization
Viewing graphs as matrices plays a key role in all above algorithms!

Message Passing and Node Classification

Example Node Classification

Given labels of some nodes
Let’s predict labels of unlabeled nodes
This is called semi-supervised node classification

Correlations(相关性) Exist in Networks

Behaviors of nodes are correlated across the links of the network
Correlation: Nearby nodes have the same color (belonging to the same class)
Two explanations for why behaviors of nodes in networks are correlated:

Homophily: The tendency of individuals to associate and bond with similar others
- “Birds of a feather flock together”
- It has been observed in a vast array of network studies, based on a variety of attributes (e.g., age, gender, organizational role, etc.)
- Example: Researchers who focus on the same research area are more likely to establish a connection (meeting at conferences, interacting in academic talks, etc.)

Example of homophily

Online social network
- Nodes = people
- Edges = friendship
- Node color = interests (sports, arts, etc.)
People with the same interest are more closely connected due to homophily

Influence: Social connections can influence the individual characteristics of a person.
- Example: I recommend my musical preferences to my friends, until one of them grows to like my same favorite genres!

How do we leverage node correlations in networks?

Classification with Network Data

How do we leverage(影响力) this correlation observed in networks to help predict node labels?

How do we predict the labels for the nodes in grey?

Motivation

Similar nodes are typically close together or directly connected in the network:
- Guilt-by-association: If I am connected to a node with label X, then I am likely to have label X as well.
- Example: Malicious/benign web page: Malicious web pages link to one another to increase visibility, look credible, and rank higher in search engines
Classification label of a node v in network may depend on:
- Features of v
- Labels of the nodes in v’s neighborhood
- Features of the nodes in v’s neighborhood

Semi-supervised Learning

Formal setting:
- Given:
  - Graph
  - Few labeled nodes
- Find: Class (red/green) of remaining nodes
- Main assumption: There is homophily in the network
Example task:

Problem Setting

Example applications

Many applications under this setting:
- Document classification
- Part of speech tagging
- Link prediction
- Optical character recognition
- Image/3D data segmentation
- Entity resolution in sensor networks
- Spam and fraud detection

We focus on semi-supervised binary node classification

We will introduce three approaches:

Relational classification
Iterative classification
Correct & Smooth

Relational Classification

Probabilistic Relational Classifier

Challenges:

Convergence is not guaranteed
Model cannot use node feature information

Example: Initialization

Initialization:

All labeled nodes with their labels
All unlabeled nodes 0.5 (belonging to class 1 with probability 0.5)

Example: lst Iteration, Update Node

Example: After 1st Iteration

After Iteration 1 (a round of updates for all unlabeled nodes)

Example: After 2nd Iteration

Example: After 3] Iteration

Convergence

Iterative(迭代的) Classification

Relational classifier does not use node attributes.

How can one leverage them?

Computing the Summary Z_v

Architecture of Iterative Classifiers

Example: Web Page Classification

Input: Graph of web pages
Node: Web page
Edge: Hyper-link between web pages
Directed edge: a page points to another page
- Node features: Webpage description
For simplicity, we only consider two binary features
- Task: Predict the topic of the webpage

Baseline: Train a classifier (e.g., linear classifier) to classify pages based on node attributes

Iterative Classifier

step 1

step 2

step 3. 1

step 3. 2

Final Prediction

Summary

We talked about 2 approaches to collective classification
Relational classification
- Iteratively update probabilities of node belonging to a label class based on its neighbors
Iterative classification
- Improve over collective classification to handle attribute/feature information
- Classify node v based on its features as well as labels of neighbors

Collective Classification Correct & Smooth

Correct & Smooth

Combining Label Propagation and Simple Models Out-performs Graph Neural Networks

OGB leaderboard snapshot at Oct 1st, 2021

Setting: A partially labeled graph and features over nodes.

C&S follows the three-step procedure:

Train base predictor
Use the base predictor to predict soft labels of all nodes.
Post-process the predictions using graph structure to obtain the final predictions of all nodes.

Train Base Predictor

Train a base predictor that predict soft labels (class probabilities) over all nodes.

Labeled nodes are used for train/validation data.
Base predictor can be simple:
- Linear model/Multi-Layer-Perceptron(MLP) over node features

Predict Over All Nodes

Given a trained base predictor, we apply it to obtain soft labels for all the nodes.

We expect these soft labels to be decently accurate.
Can we use graph structure to post-process the predictions to make them more accurate?

Post-process Predictions

C&S uses the 2-step procedure to post-process the soft predictions.

Correct step
Smooth step

C&S Post-processing: Correct Step

The key idea is that we expect errors in the base prediction to be positively correlated along edges in the graph.
In other words, an error at node u increases the chance of a similar error at neighbors of u.
Thus, we should “spread” such uncertainty over the graph

Intuition of Correct & Smooth

Correct Step

Compute training errors of nodes.
- Training error: Ground-truth label minus soft label. Defined as 0 for unlabeled nodes.

Zhu et al. ICML 2013

Smooth Step

Smoothen the corrected soft labels along the edges.
- Assumption: Neighboring nodes tend to share the same labels.
- Note: For training nodes, we use the ground-truth hard labels instead of the soft labels.

Toy Example Summary

Our toy example shows that C&S successfully improves base model performance using graph structure.

C&S on a Real-world Dataset

C&S significantly improves the performance of the base model (MLP).

C&S outperforms Smooth-only (no correct step) baseline.

Summary

Correct & Smooth (C&S) uses graph structure to post-process the soft node labels predicted by any base model.
Correction step: Diffuse and correct for the training errors of the base predictor.
Smooth step: Smoothen the prediction of the base predictor.
C&S achieves strong performance on semisupervised node classification

We learned how to leverage correlation in graphs to make prediction on nodes.

Key techniques:

Relational classification
Iterative classification
Correct & Smooth

Graph Neural Networks

Basics of Deep Learning

Deep Learning for Graphs

Content

Local network neighborhoods:
- Describe aggregation strategies
- Define computation graphs
Stacking multiple layers:
- Describe the model, parameters, training
- How to fit the model?
- Simple example for unsupervised and supervised training

Setup

A Naive Approach

Join adjacency matrix and features
Feed them into a deep neural net:

Issues with this idea:

O(|V|) parameters
Not applicable to graphs of different sizes
Sensitive to node ordering

Idea: Convolutional Networks

Goal is to generalize convolutions beyond simple lattices（格子）

Leverage node features/attributes (e.g., text, images)

Real-world Graphs

But our graphs look like this:

There is no fixed notion of locality or sliding window on the graph

Graph is permutation invariant

Permutation Invariance（置换不变性）

Graph does not have a canonical order of the nodes!
We can have many different order plans.

Graph and node representations should be the same for Order plan 1 and Order plan 2

What does it mean by “graph representation is same for two order plans”?

Permutation Equivariance（置换等变）

For node representation

Graph Neural Network Overview

Graph neural networks consist of multiple permutation equivariant / invariant functions

Next: Design graph neural networks that are permutation invariant / equivariant by passing and aggregating information from neighbors

Graph Convolutional Networks

Idea: Node’s neighborhood defines a computation graph

Learn how to propagate information across the graph to compute node features

Idea: Aggregate Neiahbors

Key idea: Generate node embeddings based on local network neighborhoods

Intuition: Nodes aggregate information from their neighbors using neural networks

Deep Model: Many Layers

Model can be of arbitrary depth:

Nodes have embeddings at each layer

Layer-0 embedding of node v is its input feature, x_v

Layer-k embedding gets information from nodes that are k hops away

Neighborhood Aggregation(聚集)

Neighborhood aggregation: Key distinctions are in how different approaches aggregate information across the layers

Basic approach: Average information from neighbors and apply a neural network

Equivariant Property

The target node (blue) has the same computation graph for different order plans

Training the Model

Need to define a loss function on the embeddings.

Model Parameters

Matrix Formulation

Many aggregations can be performed efficiently by (sparse) matrix operations

Note: not all GNNs can be expressed in matrix form, when aggregation function is complex

How to Train A GNN

Unsupervised Training

Node similarity can be anything from Lecture 3, e.g., a loss based on:

Random walks (node2vec, DeepWalk, struc2vec)
Matrix factorization
Node proximity in the grap

Supervised Training

Directly train the model for a supervised task (e.g., node classification)

Model Design: Overview

Inductive Capability

The same aggregation parameters are shared for all nodes:

The number of model parameters is sublinear in |V| and we can generalize to unseen nodes!

Inductive node embedding => Generalize to entirely unseen graphs

E.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B

Many application settings constantly encounter previously unseen nodes:

E.g., Reddit, YouTube, Google Scholar

Need to generate new embeddings “on the fly”

GNNs subsume CNNs and Transformers

Architecture Comparison

How does GNNs compare to prominent architectures such as Convolutional Neural Nets, and Transformers?

GNN VS. CNN

CNN can be seen as a special GNN with fixed neighbor size and ordering:

The size of the filter is pre-defined for a CNN.
The advantage of GNN is it processes arbitrary graphs with different degrees for each node

CNN is not permutation equivariant.

Switching the order of pixels will leads to different outputs

Transformer

[Attention is all you need. Vaswani et al., NeurIPS 2017]

GNN vs, Transformer

A nice blog plot for this: https://towardsdatascience.com/transformers-are-graph-neural-networks-bca9f75412aa

Summary

Basics of neural networks
- Loss, Optimization, Gradient, SGD, non-linearity, MLP
Idea for Deep Learning for Graphs
- Multiple layers of embedding transformation
- At every layer, use the embedding at previous layer as the input
- Aggregation of neighbors and self-embeddings
Graph Convolutional Network
- Mean aggregation; can be expressed in matrix form
GNN is a general architecture
- CNN and Transformer can be viewed as a special GNN

Graph Neural Networks[2]

A General Perspective on Graph Neural Networks

A Genera GNN Framework

J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020

Summary

A Single Layer of a GNN

A Single GNN Layer

Message Computation

Aggregation

Issue

Putting things together

Classical GNN Layers: GCN

T. Kipf, M. Welling. Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017

Graph Convolutional Networks (GCN)

How to write this as Message + Aggregation?

GraphSAGE

Hamilton et al. Inductive Representation Learning on Large Graphs, NeurIPS 2017

L2 Normalization

GAT

Graph Attention Networks

Not all node’s neighbors are equally important

Attention is inspired by cognitive attention.
The attention a_uv focuses on the important parts of the input data and fades out the rest.
- Idea: the NN should devote more computing power on that small but important part of the data.
- Which part of the data is more important depends on the context and is learned through training.

Can we do better than simple neighborhood aggregation?

Attention Mechanism

What is the form of attention mechanism a?

The approach is agnostic to the choice of a
E.g., use a simple single-layer neural network
a have trainable parameters (weights in the Linear layer)

Multi-head attention: Stabilizes the learning process of attention mechanism

Benefits of Attention Mechanism

GNN Layers in Practice

Many modern deep learning modules can be incorporated into a GNN layer

Batch Normalization:
- Stabilize neural network training
Dropout:
- Prevent overfitting
Attention/Gating:
- Control the importance of a message
More:
- Any other useful deep learning modules

Batch Normalization

S. Loffe, C.Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, ICML 2015

Goal: Stabilize neural networks training
Idea: Given a batch of inputs (node embeddings)
- Re-center the node embeddings into zero mean
- Re-scale the variance into unit variance

Dropout

Goal: Regularize a neural net to prevent overfitting.
Idea:
- During training: with some probability p, randomly set neurons to zero (turn off)
- During testing: Use all the neurons for computation

Activation(Non-linearity)

Summary:

Modern deep learning modules can be included into a GNN layer for better performance
Designing novel GNN layers is still an active research frontier!
Suggested resources: You can explore diverse GNN designs or try out your own ideas in GraphGym

Stacking Layers of a GNN

How to construct a Graph Neural Network?
The standard way: Stack GNN layers sequentially

The Over-smoothing Problem

The Issue of stacking many GNN layers
- GNN suffers from the over-smoothing problem
The over-smoothing problem: all the node embeddings converge to the same value
- This is bad because we want to use node embeddings to differentiate nodes
Why does the over-smoothing problem happen?

Receptive Field of a GNN

Receptive field: the set of nodes that determine the embedding of a node of interest
- In a k-layer GNN, each node has a receptive field of k-hop neighborhood

Receptive field overlap for two nodes
- The shared neighbors quickly grows when we increase the number of hops (num of GNN layers)

We can explain over-smoothing via the notion of receptive field
- We knew the embedding of a node is determined by its receptive field
  - If two nodes have highly-overlapped receptive fields, then their embeddings are highly similar
Stack many GNN layers -> nodes will have highlyoverlapped receptive fields -> node embeddings will be highly similar -> suffer from the over-smoothing problem
Next: how do we overcome over-smoothing problem?

Design GNN Layer Connectivity

What do we learn from the over-smoothing problem?
Lesson 1: Be cautious when adding GNN layers
- Unlike neural networks in other domains (CNN for image classification), adding more GNN layers do not always help
- Step 1: Analyze the necessary receptive field to solve your problem. E.g., by computing the diameter of the graph
- Step 2: Set number of GNN layers L to be a bit more than the receptive field we like. Do not set L to be unnecessarily large!
Question: How to enhance the expressive power of a GNN, if the number of GNN layers is small?

Expressive Power for Shallow GNNS

How to make a shallow GNN more expressive?
Solution 1: Increase the expressive power within each GNN layer
- In our previous examples, each transformation or aggregation function only include one linear layer
- We can make aggregation / transformation become a deep neural network!

Solution 2: Add layers that do not pass messages
- A GNN does not necessarily only contain GNN layers
  - E.g., we can add MLP layers (applied to each node) before and after GNN layers, as pre-process layers and post-process layers
- Pre-processing layers: Important when encoding node features is necessary.
- E.g., when nodes represent images/text
- Post-processing layers: Important when reasoning / transformation over node embeddings are needed
- E.g., graph classification, knowledge graphs
- In practice, adding these layers works great

What if my problem still requires many GNN layers?

Lesson 2: Add skip connections in GNNs
- Observation from over-smoothing: Node embeddings in earlier GNN layers can sometimes better differentiate nodes
- Solution: We can increase the impact of earlier layers on the final node embeddings, by adding shortcuts in GNN

He et al. Deep Residual Learning for Image Recognition, CVPR 2015

Idea of Skip Connections

Why do skip connections work?

Veit et al. Residual Networks Behave Like Ensembles of Relatively Shallow Networks, ArXiv 2016

Example: GCN with Skip Connections

Other Options of Skip Connections

Xu et al. Representation learning on graphs with jumping knowledge networks, ICML 2018

GNN Augmentation(增强) and Training

Graph Augmentation for GNNS

Our assumption so far has been

Raw input graph = computational graph

Reasons for breaking this assumption

Features:
- The input graph lacks features
Graph structure:
- The graph is too sparse(稀疏的) -> inefficient message passing
- The graph is too dense -> message passing is too costly
- The graph is too large -> cannot fit the computational graph into a GPU
It’s unlikely that the input graph happens to be the optimal computation graph for embeddings

Graph Augmentation Approaches

Graph Feature augmentation
- The input graph lacks features -> feature augmentation
Graph Structure augmentation
- The graph is too sparse -> Add virtual nodes / edges
- The graph is too dense -> Sample neighbors when doing message passing
- The graph is too large -> Sample subgraphs to compute embeddings
Will cover later in lecture: Scaling up GNNs

Why do we need feature augmentation?

(1) Input graph does not have node features

This is common when we only have the adj. matrix ¡ Standard approaches:

a) Assign constant values to node

b) Assign unique IDs to nodes

These IDs are converted into one-hot vectors

Feature augmentation: constant vs. one-hot

J. You, J. Gomes-Selman, R. Ying, J. Leskovec. Identity-aware Graph Neural Networks, AAAI 2021

(2) Certain structures are hard to learn by GNN

Example: Cycle count feature:

Can GNN learn the length of a cycle that v1 resides in?

Unfortunately, no

v1 cannot differentiate which graph it resides in

Because all the nodes in the graph have degree of 2
The computational graphs will be the same binary tree

Solution:

We can use cycle count as augmented node features

Other commonly used augmented features:

Node degree
Clustering coefficient
PageRank
Centrality
…

Add Virtual Nodes/ Edges

Motivation: Augment sparse graphs

(1) Add virtual edges

Common approach: Connect 2-hop neighbors via virtual edges
Intuition: Instead of using adj. matrix A for GNN computation, use A + A^2,

Use cases: Bipartite graphs

Author-to-papers (they authored)
2-hop virtual edges make an author-author collaboration graph

(2) Add virtual nodes

The virtual node will connect to all the nodes in the graph
- Suppose in a sparse graph, two nodes have shortest path distance of 10
- After adding the virtual node, all the nodes will have a distance of two
  - Node A – Virtual node – Node B
Benefits: Greatly improves message passing in sparse graphs

Node Neighborhood Sampling

Hamilton et al. Inductive Representation Learning on Large Graphs, NeurIPS 2017

Previously:

All the nodes are used for message passing

New idea: (Randomly) sample a node’s neighborhood for message passing

For example, we can randomly choose 2 neighbors to pass messages in a given layer

Only nodes B and D will pass messages to A

In the next layer when we compute the embeddings, we can sample different neighbors

Only nodes C and D will pass messages to A

In expectation, we get embeddings similar to the case where all the neighbors are used

Benefits: Greatly reduces computational cost
- Allows for scaling to large graphs (more about this later)
And in practice it works great!

Prediction with GNNS

GNN Training Pipeline

Pipeline

(1) Different prediction heads:

Node-level tasks
Edge-level tasks
Graph-level tasks

GNN Prediction Heads

Idea: Different task levels require different prediction heads

Node-level

Node-level prediction: We can directly make prediction using node embeddings!

Edae-level

Edge-level prediction: Make prediction using pairs of node embeddings

(1) Concatenation + Linear

(2) Dot product

Graph-level

Graph-level prediction: Make prediction using all the node embeddings in our graph

K. Xu, W. Hu, J. Leskovec, S. Jegelka. How Powerful Are Graph Neural Networks, ICLR 2019

Issue of Global Pooling

Issue: Global pooling over a (large) graph will lose information

Hierarchical Global Pooling

A solution: Let’s aggregate all the node embeddings hierarchically(等级的)

Ying et al. Hierarchical Graph Representation Learning with Differentiable Pooling, NeurIPS 2018

DiffPool idea:

Hierarchically pool node embedding

Leverage 2 independent GNNs at each level

GNN A: Compute node embeddings
GNN B: Compute the cluster that a node belongs to

GNNs A and B at each level can be executed in parallel

For each Pooling layer

Use clustering assignments from GNN B to aggregate node embeddings generated by GNN A
Create a single new node for each cluster, maintaining edges between clusters to generated a new pooled network

Jointly train GNN A and GNN B

Training Graph Neural Networks

Pipeline

(2) Where does ground-truth come from?

Supervised labels
Unsupervised signals

Supervised Vs Unsupervised

Supervised learning on graphs
- Labels come from external sources
  - E.g., predict drug likeness of a molecular graph
Unsupervised learning on graphs
- Signals come from graphs themselves
  - E.g., link prediction: predict if two nodes are connected
Sometimes the differences are blurry
- We still have “supervision” in unsupervised learning
  - E.g., train a GNN to predict node clustering coefficient
- An alternative name for “unsupervised” is “self-supervised”

Supervised Labels on Graphs

Supervised labels come from the specific use cases. For example:

Advice: Reduce your task to node / edge / graph labels, since they are easy to work with

E.g., we knew some nodes form a cluster. We can treat the cluster that a node belongs to as a node label

Unsupervised Signals on Graphs

The problem: sometimes we only have a graph, without any external labels
The solution: “self-supervised learning”, we can find supervision signals within the graph.
- For example, we can let GNN predict the following:

These tasks do not require any external labels!

Pipeline

(3) How do we compute the final loss?

Classification loss
Regression loss

Settings for GNN Training

Classification or Rearession

Classification LOSS

As discussed in lecture 6, cross entropy (CE) is a very common loss function in classification

Regression Loss

For regression tasks we often use Mean Squared Error (MSE) a.k.a. L2 loss

Pipeline

(4) How do we measure the success of a GNN?

Accuracy
ROC AUC

Evaluation Metrics: Rearession

We use standard evaluation metrics for GNN
- (Content below can be found in any ML course)
- In practice we will use sklearn for implementation
- Suppose we make predictions for N data points
Evaluate regression tasks on graphs:

Evaluation Metrics: Classification

Evaluate classification tasks on graphs:

Metrics for Binary Classification

ROC Curve: Captures the tradeoff in TPR and FPR as the classification threshold is varied for a binary classifier.

ROC AUC: Area under the ROC Curve.

Intuition: The probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one

Setting-up GNN Prediction Tasks

Pipeline

(5) How do we split our dataset into train / validation / test set?

Dataset Split: Fixed/ Random Split

Fixed split: We will split our dataset once
- Training set: used for optimizing GNN parameters
- Validation set: develop model/hyperparameters
- Test set: held out until we report final performance
A concern: sometimes we cannot guarantee that the test set will really be held out
Random split: we will randomly split our dataset into training / validation / test
- We report average performance over different random seeds

Why Splitting Graphs is Special

Suppose we want to split an image dataset

Image classification: Each data point is an image
Here data points are independent
- Image 5 will not affect our prediction on image 1

Splitting a graph dataset is different!

Node classification: Each data point is a node
Here data points are NOT independent
- Node 5 will affect our prediction on node 1, because it will participate in message passing -> affect node 1’s embedding

What are our options?

Solution 1 (Transductive(传导的) setting): The input graph can be observed in all the dataset splits (training, validation and test set).

We will only split the (node) labels

At training time, we compute embeddings using the entire graph, and train using node 1&2’s labels
At validation time, we compute embeddings using the entire graph, and evaluate on node 3&4’s labels

Solution 2 (Inductive(归纳法的) setting): We break the edges between splits to get multiple graphs

Now we have 3 graphs that are independent. Node 5 will not affect our prediction on node 1 any more
At training time, we compute embeddings using the graph over node 1&2, and train using node 1&2’s labels
At validation time, we compute embeddings using the graph over node 3&4, and evaluate on node 3&4’s labels

Transductive setting: training / validation / test sets are on the same graph

The dataset consists of one graph
The entire graph can be observed in all dataset splits, we only split the labels
Only applicable to node / edge prediction tasks

Inductive setting: training / validation / test sets are on different graphs

The dataset consists of multiple graphs
Each split can only observe the graph(s) within the split. A successful model should generalize to unseen graphs
Applicable to node / edge / graph tasks

Graph Classification

Only the inductive setting is well defined for graph classification

Because we have to test on unseen graphs
Suppose we have a dataset of 5 graphs. Each split will contain independent graph(s)

Link Prediction

Goal of link prediction: predict missing edges

Setting up link prediction is tricky:

Link prediction is an unsupervised / self-supervised task. We need to create the labels and dataset splits on our own
Concretely, we need to hide some edges from the GNN and the let the GNN predict if the edges exist

For link prediction, we will split edges twice

Step 1: Assign 2 types of edges in the original graph

Message edges: Used for GNN message passing
Supervision edges: Use for computing objectives
After step 1:
- Only message edges will remain in the graph
- Supervision edges are used as supervision for edge predictions made by the model, will not be fed into GNN!

Step 2: Split edges into train / validation / test

Option 1: Inductive link prediction split
- Suppose we have a dataset of 3 graphs. Each inductive split will contain an independent grap
- Suppose we have a dataset of 3 graphs. Each inductive split will contain an independent graph
- In train or val or test set, each graph will have 2 types of edges: message edges + supervision edges § Supervision edges are not the input to GNN

Option 2: Transductive link prediction split:
- This is the default setting when people talk about link prediction
- Suppose we have a dataset of 1 graph
- By definition of “transductive”, the entire graph can be observed in all dataset splits
  - But since edges are both part of graph structure and the supervision, we need to hold out validation / test edges
  - To train the training set, we further need to hold out supervision edges for the training set
- Next: we will show the exact settings

Summary: Transductive link prediction split

Note: Link prediction settings are tricky and complex. You may find papers do link prediction differently.
Luckily, we have full support in PyG and GraphGym

GNN Training Pipeline

Implementation resources:

DeepSNAP provides core modules for this pipeline
GraphGym further implements the full pipeline to facilitate GNN design

Summary

GNN Layer:
- Transformation + Aggregation
- Classic GNN layers: GCN, GraphSAGE, GAT
Layer connectivity:
- The over-smoothing problem
- Solution: skip connections
Graph Augmentation:
- Feature augmentation
- Structure augmentation
Learning Objectives
- The full training pipeline of a GNN

When Things Dont Go As Planned

General Tips

Data preprocessing is important:
- Node attributes can vary a lot!
  - E.g. probability ranges (0,1), but some inputs could have much larger range, say (−1000, 1000)
- Use normalization
Optimizer:
- ADAM is relatively robust to learning rate
Activation function
- ReLU activation function often works well
- Other alternatives: LeakyReLU, SWISH, rational activation
- No activation function at your output layer:
Include bias term in every layer
Embedding dimensions:
- 32, 64 and 128 are often good starting points

Debugging Deep Networks

Debug issues: Loss/accuracy not converging during training
- Check pipeline (e.g. in PyTorch we need zero_grad)
- Adjust hyperparameters such as learning rate
- Pay attention to weight parameter initialization
Important for model development:
- Overfit on (part of) training data:
  - With a small training dataset, loss should be essentially close to 0, with an expressive neural network
  - If neural network cannot overfit a single data point, something is wrong
- Scrutinize loss function!
- Scrutinize visualizations!

Setting-up GNN Prediction Tasks

How Expressive are Graph Neural Networks?

Theory of GNNS

How powerful are GNNs?

Many GNN models have been proposed (e.g., GCN, GAT, GraphSAGE, design space).
What is the expressive power (ability to distinguish different graph structures) of these GNN models?
How to design a maximally expressive GNN model?

Background: Many GNN Models

Many GNN models have been proposed:

GCN, GraphSAGE, GAT, Design Space etc.

GNN Model Example

GCN (mean-pool) [Kipf and Welling ICLR 2017]

GraphSAGE (max-pool) [Hamilton et al. NeurIPS 2017]

Note: Node Colors

We use node same/different colors to represent nodes with same/different features.
- For example, the graph below assumes all the nodes share the same feature.

Key question: How well can a GNN distinguish different graph structures?

Local Neighborhood Structures

We specifically consider local neighborhood structures around each node in a graph.

Example: Nodes 1 and 5 have different neighborhood structures because they have different node degrees.

We specifically consider local neighborhood structures around each node in a graph. 1 2 3 5 4

Example: Nodes 1 and 4 both have the same node degree of 2. However, they still have different neighborhood structures because their neighbors have different node degrees

Node 1 has neighbors of degrees 2 and 3. Node 4 has neighbors of degrees 1 and 3.

We specifically consider local neighborhood structures around each node in a graph. 1 2 3 5 4

Example: Nodes 1 and 2 have the same neighborhood structure because they are symmetric within the graph.

Node 1 has neighbors of degrees 2 and 3. Node 2 has neighbors of degrees 2 and 3. And even if we go a step deeper to 2nd hop neighbors, both nodes have the same degrees (Node 4 of degree 2)

Key question: Can GNN node embeddings distinguish different node’s local neighborhood structures?
- If so, when? If not, when will a GNN fail?
Next: We need to understand how a GNN captures local neighborhood structures.
- Key concept: Computational graph

Computational Graph

In each layer, a GNN aggregates neighboring node embeddings.

A GNN generates node embeddings through a computational graph defined by the neighborhood.

Ex: Node 1’s computational graph (2-layer GNN)

Ex: Nodes 1 and 2’s computational graphs.

Ex: Nodes 1 and 2’s computational graphs.
But GNN only sees node features (not IDs):

A GNN will generate the same embedding for nodes 1 and 2 because:

Computational graphs are the same.
Node features (colors) are identical.

Note: GNN does not care about node ids, it just aggregates features vectors of different nodes.

In general, different local neighborhoods define different computational graphs

Computational graphs are identical to rooted subtree structures around each node.

GNN‘s node embeddings capture rooted subtree structures.

Most expressive GNN maps different rooted subtrees into different node embeddings (represented by different colors).

Recall: Iniective Function

How Expressive is a GNN?

Most expressive GNN should map subtrees to the node embeddings injectively.

Key observation: Subtrees of the same depth can be recursively characterized from the leaf nodes to the root nodes.

If each step of GNN’s aggregation can fully retain the neighboring information, the generated node embeddings can distinguish different rooted subtrees.

In other words, most expressive GNN would use an injective neighbor aggregation function at each step.

Maps different neighbors to different embeddings.

Summary so far

To generate a node embedding, GNNs use a computational graph corresponding to a subtree rooted around each node.

GNN can fully distinguish different subtree structures if every step of its neighbor aggregation is injective.

Designing the Most Powerful Graph Neural Network

Expressive Power of GNNS

Key observation: Expressive power of GNNs can be characterized by that of neighbor aggregation functions they use.
- A more expressive aggregation function leads to a more expressive a GNN.
- Injective aggregation function leads to the most expressive GNN.
Next:
- Theoretically analyze expressive power of aggregation functions.

Neighbor Aggregation

Observation: Neighbor aggregation can be abstracted as a function over a multi-set (a set with repeating elements).

Next: We analyze aggregation functions of two popular GNN models

GCN (mean-pool) [Kipf & Welling ICLR 2017]
- Take element-wise mean, followed by linear function and ReLU activation, i.e., max(0, x).
- Theorem [Xu et al. ICLR 2019]
  - GCN’s aggregation function cannot distinguish different multi-sets with the same color proportion.

For simplicity, we assume node colors are represented by one-hot encoding.

Example) If there are two distinct colors:

This assumption is sufficient to illustrate how GCN fails.

Failure case illustration

GraphSAGE (max-pool) [Hamilton et al. NeurIPS 2017]
- Apply an MLP, then take element-wise max.
- Theorem [Xu et al. ICLR 2019]
  - GraphSAGE’s aggregation function cannot distinguish different multi-sets with the same set of distinct colors.

Failure case illustration

Summary So Far

We analyzed the expressive power of GNNs.

Main takeaways:

Expressive power of GNNs can be characterized by that of the neighbor aggregation function.
Neighbor aggregation is a function over multi-sets (sets with repeating elements)
GCN and GraphSAGE’s aggregation functions fail to distinguish some basic multi-sets; hence not injective.
Therefore, GCN and GraphSAGE are not maximally powerful GNNs.

Designing Most Expressive GNNS

Our goal: Design maximally powerful GNNs in the class of message-passing GNNs.
This can be achieved by designing injective neighbor aggregation function over multi-sets.
Here, we design a neural network that can model injective multiset function.

Injective Multi-set Function

Universal Approximation Theorem

Injective Multi-set Function

We have arrived at a neural network that can model any injective multi-set function.

In practice, MLP hidden dimensionality of 100 to 500 is sufficient.

Most Expressive GNN

Graph Isomorphism Network (GIN) [Xu et al. ICLR 2019]

Apply an MLP, element-wise sum, followed by another MLP

Theorem [Xu et al. ICLR 2019]
- GIN‘s neighbor aggregation function is injective.
No failure cases!
GIN is THE most expressive GNN in the class of message-passing GNNs!

Full Model of GIN

So far: We have described the neighbor aggregation part of GIN.

We now describe the full model of GIN by relating it to WL graph kernel (traditional way of obtaining graph-level features).

We will see how GIN is a “neural network” version of the WL graph kernel.

Relation to WL Graph Kernel

Recall: Color refinement algorithm in WL kernel.

Example of color refinement given two graphs

Process continues until a stable coloring is reached
Two graphs are considered isomorphic if they have the same set of colors.

The Complete GIN Mode

GIN uses a neural network to model the injective HASH function.

Specifically, we will model the injective function over the tuple:

Theorem (Xu et al. ICLR 2019)

Any injective function over the tuple

GIN and WL Graph Kernel

GIN can be understood as differentiable neural version of the WL graph Kernel:

Advantages of GIN over the WL graph kernel are:

Node embeddings are low-dimensional; hence, they can capture the fine-grained similarity of different nodes.
Parameters of the update function can be learned for the downstream tasks.

Expressive Power of GIN

Because of the relation between GIN and the WL graph kernel, their expressive is exactly the same.
- If two graphs can be distinguished by GIN, they can be also distinguished by the WL kernel, and vice versa.
How powerful is this?
- WL kernel has been both theoretically and empirically shown to distinguish most of the real-world graphs [Cai et al. 1992].
- Hence, GIN is also powerful enough to distinguish most of the real graphs!

Summary of the Lecture

We design a neural network that can model injective multi-set function.
We use the neural network for neighbor aggregation function and arrive at GIN—the most expressive GNN model.
The key is to use element-wise sum pooling, instead of mean-/max-pooling.
GIN is closely related to the WL graph kernel.
Both GIN and WL graph kernel can distinguish most of the real graphs!

The Power of Pooling

Can expressive power of GNNs be improved?

There are basic graph structures that existing GNN framework cannot distinguish, such as difference in cycles.

GNNs’ expressive power can be improved to resolve the above problem. [You et al. AAAI 2021, Li et al. NeurIPS 2020]

Summary

GNNs and connection to bijective functions on sets.

Most powerful GNN is equivalent to WL graph isomorphism test.

GIN is the most powerful GNN.

Sum aggregator is more powerful than mean is more powerful than max.

Heterogeneous Graphs and Knowledge Graph Embeddings

Heterogeneous(异构) Graphs and Relational GCN (RGCN)

Heterogeneous(异构) Graphs

A heterogeneous graph is defined as

Relational GCN

Schlichtkrull et al., Modeling Relational Data with Graph Convolutional Networks, ESWC 2018

We will extend GCN to handle heterogeneous graphs with multiple edge/relation types

We start with a directed graph with one relation

How do we run GCN and update the representation of the target node A on this graph?

What if the graph has multiple relation types?

What if the graph has multiple relation types?

Use different neural network weights for different relation types

Definition

RGCN: Scalability

Block Diagonal Matrices

Key insight: make the weights sparse

Key insight: Share weights across different relations!

Example: Entity/Node Classification

Goal: Predict the label of a given node

Example Link Prediction

Summary of RGCN

Relational GCN, a graph neural network for heterogeneous graphs
Can perform entity classification as well as link prediction tasks.
Ideas can easily be extended into RGNN (RGraphSAGE, RGAT, etc.)

Knowledge Graphs KG Completion with Embeddings

Knowledge Graphs( KG)

Knowledge in graph form:

Capture entities, types, and relationships

Nodes are entities
Nodes are labeled with their types
Edges between two nodes capture relationships between entities
KG is an example of a heterogeneous graph

Example: Bibliographic Networks

Node types: paper, title, author, conference, year
Relation types: pubWhere, pubYear, hasTitle, hasAuthor, cite

Example: Bio Knowledge Graphs

Node types: drug, disease, adverse event, protein, pathways
Relation types: has_func, causes, assoc, treats, is_a

Knowledge Graphs in Practice

Examples of knowledge graphs
Google Knowledge Graph
Amazon Product Graph
Facebook Graph API
IBM Watson
Microsoft Satori
Project Hanover/Literome
LinkedIn Knowledge Graph
Yandex Object Answer

Applications of Knowledge Graphs

Serving information
Question answering and conversation agents

Knowledge Graph Datasets

Publicly available KGs:
- FreeBase, Wikidata, Dbpedia, YAGO, NELL, etc.
Common characteristics:
- Massive: millions of nodes and edges
- Incomplete: many true edges are missing

Example: Freebase

Knowledge Graph Completion TransE, TransR, DistMul, ComplEx

KG Completion Task

Given an enormous KG, can we complete the KG?
For a given (head, relation), we predict missing tails.
- (Note this is slightly different from link prediction task)

KG Representation

TransE

Bordes, et al., Translating embeddings for modeling multi-relational data, NeurIPS 2013

Connectivity Patterns in KG

Relations in a heterogeneous KG have different properties

Can we categorize these relation patterns?

Are KG embedding methods (e.g., TransE) expressive enough to model these patterns?

Relation Patterns

Antisymmetric Relations in Transe

Inverse Relations in TransE

Composition in Transe

Limitation Symmetric Relations

Limitation: 1-to-n Relations

Knowledge Graph Completion : TransR

TransR

Lin, et al., Learning entity and relation embeddings for knowledge graph completion,AAAI 2015

TransE models translation of any relation in the same embedding space.

Can we design a new space for each relation and do translation in relation-specific space?

Symmetric Relations in TransR

1-to-n Relations in TransR

Inverse Relations in TransR

Composition Relations in TransR

High-level intuition: TransR models a triple with linear functions, they are chainable

Knowledge Graph Completion DistMult

New Idea: Bilinear Modeling

Yang et al, Embedding Entities and Relations for Learning and Inference in Knowledge Bases, ICLR 2015

DistMult

1-to-n Relations in DistMult

Symmetric Relations in DistMult

Limitation: Antisymmetric Relations

Limitation: Inverse Relations

Limitation: Composition Relations

Knowledge Graph Completion ComplEx

ComplEx

Trouillon et al, Complex Embeddings for Simple Link Prediction, ICML 2016

Based on Distmult, ComplEx embeds entities and relations in Complex vector space

Antisymmetric Relations in Complex

Symmetric Relations in Complex

Inverse Relations in ComplEx

Composition and 1-to-n

Expressiveness of All Models

Properties and expressive power of different KG completion methods:

KG Embeddings in Practice

Sun et al, RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space, ICLR 2019

Different KGs may have drastically different relation patterns!
There is not a general embedding that works for all KGs, use the table to select models
Try TransE for a quick run if the target KG does not have much symmetric relations
Then use more expressive models, e.g., ComplEx, RotatE (TransE in Complex space)

Summary

Link prediction / Graph completion is one of the prominent tasks on knowledge graphs
Introduce TransE / TransR / DistMult / ComplEx models with different embedding space and expressiveness

Reasoning in Knowledge Graphs using Embeddings

Example KG: Biomedicine

Predictive Queries on KG

Can we do multi-hop reasoning, i.e., answer complex queries on an incomplete, massive KG?

Predictive One-hop Queries

We can formulate knowledge graph completion problems as answering one-hop queries.

Path Queries

Generalize one-hop queries to path queries by adding more relations on the path.

Query plan of path queries is a chain.

Question: “What proteins are associated with adverse events caused by Fulvestrant?”

Query: (e:Fulvestrant, (r:Causes, r:Assoc)) Given a KG, how to answer a path query?

Traversing Knowledge Graphs

We answer path queries by traversing the KG: “What proteins are associated with adverse events caused by Fulvestrant?”
Query: (e:Fulvestrant, (r:Causes, r:Assoc))

However, KGS are incomplete

Answering queries seems easy: Just traverse the graph.
But KGs are incomplete and unknown:
- Many relations between entities are missing or are incomplete
  - For example, we lack all the biomedical knowledge
  - Enumerating all the facts takes non-trivial time and cost, we cannot hope that KGs will ever be fully complete
Due to KG incompleteness, one is not able to identify all the answer entities

Can we first do KG completion and then traverse the completed (probabilistic) KG?

Task: Predictive Oueries

We need a way to answer path-based queries over an incomplete knowledge graph.

We want our approach to implicitly impute and account for the incomplete KG.

Task: Predictive queries

Want to be able to answer arbitrary queries while implicitly imputing for the missing information
Generalization of the link prediction task

Answering Predictive Queries on Knowledge Graphs

General Idea

Map queries into embedding space. Learn to reason in that space

Embed query into a single point in the Euclidean space: answer nodes are close to the query.
Query2Box: Embed query into a hyper-rectangle (box) in the Euclidean space: answer nodes are enclosed in the box.

[Embedding Logical Queries on Knowledge Graphs. Hamilton, et al., NeurIPS 2018]

[Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings. Ren, et al., ICLR 2020]

Idea: Traversing KG in Vector Space

Guu, et al., Traversing knowledge graphs in vector space, EMNLP 2015

Key idea: Embed queries!

Traversing KG in Vector Space

The embedding process only involves vector addition, independent of # entities in the KG!

Embed path queries in vector space.

Question: “What proteins are associated with adverse events caused by Fulvestrant?”
Query: (e:Fulvestrant, (r:Causes , r:Assoc))

Insights:

We can train TransE to optimize knowledge graph completion objective (Lecture 10)
Since TransE can naturally handle compositional relations, it can handle path queries by translating in the latent space for multiple hops using addition of relation embeddings.
For TransR / DistMult / ComplEx, since they cannot handle compositional relations, they cannot be easily extended to handle path queries.

Conjunctive Queries

Can we answer more complex queries with logic conjunction operation?

Conjunctive Queries: “What are drugs that cause Short of Breath and treat diseases associated with protein ESR2?” ((e:ESR2, (r:Assoc, r:TreatedBy)), (e:Short of Breath, (r:CausedBy))

How do we answer the question by KG traversal?

Traverse KG from anchor nodes: ESR2 and Short of Breath:

How can we use embeddings to implicitly impute the missing (ESR2, Assoc, Breast Cancer)?
Intuition: ESR2 interacts with both BRCA1 and ESR1. Both proteins are associated with breast cancer

Traversing KG in Vector Space

“What are drugs that cause Short of Breath and treat diseases associated with protein ESR2?”

((e:ESR2, (r:Assoc, r:TreatedBy)), (e:Short of Breath, (r:CausedBy))

Each intermediate node represents a set of entities, how do we represent it? How do we define the intersection operation in the latent space?

Qvery2box: Reasoning over KGS Using Box Embeddings

Conjunctive Queries

How can we answer more complex queries with logical conjunction operation?

(1) Each intermediate node represents a set of entities; how do we represent it?

(2) How do we define the intersection operation in the latent space?

Box Embeddings

Ren et al., Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings, ICLR 2020.

Embed queries with hyper-rectangles (boxes)

Key Insight: Intersection(交点)

Intersection of boxes is well-defined!
When we traverse the KG to find the answers, each step produces a set of reachable entities.
How can we better model these sets?
- Boxes are a powerful abstraction, as we can project the center and control the offset to model the set of entities enclosed in the box

Embed with Box Embedding

Projection(预测) Operator

Embed with Box Embedding

How do we take intersection of boxes?

Intersection Operator

Use box intersection operator

Entity-to-box Distance

Extending to Union Operation

Can we embed complex queries with union? E.g.: “What drug can treat breast cancer or lung cancer?”
Conjunctive queries + disjunction is called Existential Positive First-order (EPFO) queries. We’ll refer to them as AND-OR queries.
Can we also design a disjunction operator and embed AND-OR queries in low-dimensional vector space?

Embedding AND-OR Queries

Can we embed AND-OR queries in a low-dimensional vector space?
No! Intuition: Allowing union over arbitrary queries requires high-dimensional embeddings!
Example:

Example 2:

Can we embed AND-OR queries in low-dimensional vector space?

Since we cannot embed AND-OR queries in lowdimensional space, can we still handle them?

Key idea: take all unions out and only do union at the last step!

Disjunctive Normal Form

Any AND-OR query can be transformed into equivalent DNF, i.e., disjunction of conjunctive queries.

Distance Between q and an Entity

How to Train Query2box

Training Overview

How to achieve a query, its answers, its negative answers from the KG to train the parameters?

How to split the KG for query answering?

Training

Query Generation from Templates

Generate queries from multiple query templates:

How can we generate a complex query?
We start with a query template
Query template can be viewed as an abstraction of the query
We generate a query by instantiating every variable with a concrete entity and relation from the KG
- E.g., instantiate Anchor1 with ESR2 (a node on KG)
- E.g., instantiate Rel1 with Assoc (an edge on KG)
How to instantiate query template given a KG?

How to instantiate a query template given a KG?

Example of Query2box

Visualization

What do box embeddings actually learn? Example: “List male instrumentalists who play string instruments”
We use t-SNE to reduce the embedding space to a 2-dimensional space, in order to visualize the query results

Embedding Space

Summary

We introduce answering predictive queries on large knowledge graphs.

The key idea is to embed queries by navigating the embedding space!

We embed the query by composing learned operators

Embedding of the query is close to its answers in the embedding space

Fast Neural Subgraph Matching and Counting

Subgraphs

Subgraphs are the building blocks of networks:

They have the power to characterize and discriminate networks

Building Blocks of Networks

Subgraphs and Motifs

Definition

Two ways to formalize “network building blocks”

Alternate terminology: “induced subgraph”

Alternate terminology: “non-induced subgraph” or just “subgraph”

The best definition depends on the domain! Examples:

Chemistry: Node-induced (functional groups)
Knowledge graphs: Often edge-induced (focus is on edges representing logical relations)

Graph Isomorphism(同构关系)

Graph isomorphism problem: Check whether two graphs are identical:

Case Example of Subgraphs

All non-isomorphic, connected, undirected graphs of size 4

All non-isomorphic, connected, directed graphs of size 3

Network Motifs

Network motifs: “recurring, significant patterns of interconnections”

How to define a network motif:

Pattern: Small (node-induced) subgraph
Recurring: Found many times, i.e., with high frequency
Significant: More frequent than expected, i.e., in randomly generated graphs?
How to define frequency? How to define random graphs?

Motifs: Induced Subaraphs

Why Do We Need Motifs?

Subgraph Frequency

What if the dataset contains multiple graphs, and we want to compute frequency of subgraphs in the dataset?

Solution: Treat the dataset as a giant graph G_T with disconnected components corresponding to individual graphs

Defining Motif Significance

To define significance, we need to have a null-model (i.e., point of comparison).

Key idea: Subgraphs that occur in a real network much more often than in a random network have functional significance.

Defining Random Graphs

New Model: Configuration Mode

Alternative for Spokes: Switching

Intuition: Motifs are overrepresented in a network when compared to random graphs:

Step 1: Count motifs in the given graph (G_real)
Step 2: Generate random graphs with similar statistics (e.g. number of nodes, edges, degree sequence), and count motifs in the random graphs
Step 3: Use statistical measures to evaluate how significant is each motif
- Use Z-score

Z-score for Statistical Sianificance

For each subgraph:
- z-score metric is capable of classifying the subgraph “significance”:
  - Negative values indicate under-representation
  - Positive values indicate over-representation
We create a network significance profile:
- A feature vector with values for all subgraph types
Next: Compare profiles of different graphs with random graphs:
- Regulatory network (gene regulation)
- Neuronal network (synaptic connections)
- World Wide Web (hyperlinks between pages)
- Social network (friendships)
- Language networks (word adjacency)

Summary: Detecting Motifs(主题)

Variations on the Motif Concept

Extensions:
- Directed and undirected
- Colored and uncolored
- Temporal and static motifs
Variations on the concept:
- Different frequency concepts
- Different significance metrics
- Under-Representation (anti-motifs)
- Different null models
Subgraphs and motifs are the building blocks of graphs
- Subgraph isomorphism and counting are NP-hard
Understanding which motifs are frequent or significant in a dataset gives insight into the unique characteristics of that domain
Use random graphs as null model to evaluate the significance of motif via Z-score

Neural Subgraph Matching

Given:

Large target graph (can be disconnected)
Query graph (connected)

Decide:

Is a query graph a subgraph in the target graph?

Isomorphism as an ML Task

Large target graph (can be disconnected)
Query graph (has to be connected)
Use GNN to predict subgraph isomorphism:
Intuition: Exploit the geometric shape of embedding space to capture the properties of subgraph isomorphism

Task Setup

Consider a binary prediction: Return True if query is isomorphic to a subgraph of the target graph, else return False

Finding node correspondences between Q and T is another challenging problem, which will not be covered in this lecture.

Overview of the Approach

Neural Architecture for Subgraphs

(1) We are going to work with node-anchored definitions:

(2) We are going to work with node-anchored neighborhoods:

Why Anchor?

Order Embedding Space

Why Order Embedding Space?

Subgraph isomorphism relationship can be nicely encoded in order embedding space

All properties have their counter-parts in the order embedding space

Order Constraint

We use a GNN to learn to embed neighborhoods and preserve the order embedding structure
What loss function should we use, so that the learned order embedding reflects the subgraph relationship?
We design loss functions based on the order constraint:
- Order constraint specifies the ideal order embedding property that reflects subgraph relationships

We specify the order constraint to ensure that the subgraph properties are preserved in the order embedding space

Loss Function

GNN Embeddings are learned by minimizing a max-margin loss

Training Neural Subgraph Matching

Training Details

How many training examples to sample?
- At every iteration, we sample new training pairs
- Benefit: Every iteration, the model sees different subgraph examples
- Improves performance and avoids overfitting – since there are exponential number of possible subgraphs to sample from
How deep is the BFS sampling?
- A hyper-parameter that trades off runtime and performance
- Usually use 3-5, depending on size of the dataset

Subgraph Predictions on New Graphs

Summary: Neural Subgraph Matching

Neural subgraph matching uses a machine learning-based approach to learn the NP-hard problem of subgraph isomorphism
- Given query and target graph, it embeds both graphs into an order embedding space
- Using these embeddings, it then computes E(G_q, G_t) to determine whether query is a subgraph of the target
Embedding graphs within an order embedding space allows subgraph isomorphism to be efficiently represented and tested by the relative positions of graph embeddings

Finding Frequent Subgraphs

Intro

Generally, finding the most frequent size-k motifs requires solving two challenges:

1) Enumerating all size-k connected subgraphs
2) Counting #(occurrences of each subgraph type)

Why is it Hard?

Just knowing if a certain subgraph exists in a graph, is a hard computational problem!
- Subgraph isomorphism is NP-complete
Computation time grows exponentially as the size of the subgraphs increases
- Feasible motif size for traditional methods is relatively small (3 to 7)

Solution with Representation Learning

Finding frequent subgraph patterns is computationally hard
- Combinatorial explosion of number of possible patterns
- Counting subgraph frequency is NP-hard
Representation learning can tackle these challenges:
- Combinatorial explosion -> organize the search space
- Subgraph isomorphism -> prediction using GNN

Representation learning can tackle these challenges:

Counting #(occurrences of each subgraph type)
- Solution: Use GNN to “predict” the frequency of the subgraph.
Enumerating all size-k connected subgraphs
- Solution: Don’t enumerate subgraphs but construct a size-k subgraph incrementally
  - Note: We are only interested in high frequency subgraphs

Problem Setup: Frequent Motif Mining

SpMiner: Overview

SPMiner: A neural model to identify frequent motifs

Key Idea

Motif Frequency Estimation

Spminer Search Procedure

Termination: Upon reaching a desired motif size, take the subgraph of the target graph induced by S.

Results: Small Motifs

Ground-truth: Find most frequent 10 motifs in dataset by brute-force exact enumeration (expensive)
Question: Can the model identify frequent motifs?
Result: The model identifies 9 and 8 of the top 10 motifs, respectively.

Experiments: Large motifs

Question: How do the frequencies of the identified motif compare?
Result: SPMiner identifies motifs that appear 10-100x more frequently than the baselines

Summary

Subgraphs and motifs are important concepts that provide insights into the structure of graphs. Their frequency can be used as features for nodes/graphs.
We covered neural approaches to prediction subgraph isomorphism relationship.
Order embeddings have desirable properties and can be used to encode subgraph relations
Neural embedding-guided search in order embedding space can enable ML model to identify motifs much more frequent than existing methods

GNNS for Recommender Systems

Recommender Systems Task and Evaluation

Preliminary of Recommendation

Information Explosion in the era of Internet

10K+ movies in Netflix
12M products in Amazon
70M+ music tracks in Spotify
10B+ videos on YouTube
200B+ pins (images) in Pinterest

Personalized recommendation (i.e., suggesting a small number of interesting items for each user) is critical for users to effectively explore the content of their interest.

Recommender System as a Graph

Recommender system can be naturally modeled as a bipartite graph

A graph with two node types: users and items.
Edges connect users and items
- Indicates user-item interaction (e.g., click, purchase, review etc.)
- Often associated with timestamp (timing of the interaction).

Recommendation Task

Given
- Past user-item interactions
Task
- Predict new items each user will interact in the future.
- Can be cast as link prediction problem.
  - Predict new user-item interaction edges given the past edges.

Top-k Recommendation

Evaluation Metric: Recall@k(1)

Recommender Systems Embedding-based Models

Notation

Score Function

Training Objective

Embedding-based models have three kinds of parameters:

Training objective: Optimize the model parameters to achieve high recall@K on seen (i.e., training) user-item interactions

We hope this objective would lead to high recall@K on unseen (i.e., test) interactions.

Surrogate(代理) Loss Functions

The vanilla training objective (training recall@K) is not differentiable.
- Cannot apply efficient gradient-based optimization.
Two surrogate loss functions are widely-used to enable efficient gradient-based optimization.
- Binary loss
- Bayesian Personalized Ranking (BPR) loss
Surrogate losses are differentiable and should align well with the original training objective.

Binary Loss

Binary loss pushes the scores of positive edges higher than those of negative edges.

This aligns with the training recall metric since positive edges need to be recalled.

Issue:

In the binary loss, the scores of ALL positive edges are pushed higher than those of ALL negative edges.
This would unnecessarily penalize model predictions even if the training recall metric is perfect.
Why? (example in the next slide)

Key insight: The binary loss is non-personalized in the sense that the positive/negative edges are considered across ALL users at once.
However, the recall metric is inherently personalized (defined for each user).
- The non-personalized binary loss is overly-stringent for the personalized recall metric.

Desirable Surrogate Loss

Lesson learned: Surrogate loss function should be defined in a personalized manner.

For each user, we want the scores of positive items to be higher than those of the negative items
We do not care about the score ordering across users.

Bayesian Personalized Ranking (BPR) loss achieves this!

Bayesian Personalized Ranking (BPR) loss is a personalized surrogate loss that aligns better with the recall@K metric.

Summary

We have introduced
- Recall@K as a metric for personalized recommendation
- Embedding-based models
  - Three kinds of parameters to learn
    - user encoder to generate user embeddings
    - item encoder to generate item embeddings
    - score function to predict the user-item interaction likelihood.
- Surrogate loss functions to achieve the high recall metric.
Embedding-based models have achieved SoTA in recommender systems.
- Why do they work so well?

Why Embedding Models Work?

Underlying idea:
Collaborative filtering
- Recommend items for a user by collecting preferences of many other similar users.
- Similar users tend to prefer similar items.
Key question: How to capture similarity between users/items?

Embedding-based models can capture similarity of users/items!

Low-dimensional embeddings cannot simply memorize all user-item interaction data.
Embeddings are forced to capture similarity between users/items to fit the data.
This allows the models to make effective prediction on unseen user-item interactions.

This Lecture: GNNS for Recsys

Neural Graph Collaborative Filtering (NGCF) [Wang et al. 2019], LightGCN [He et al. 2020]

Improve the conventional collaborative filtering models (i.e., matrix factorization) by explicitly modeling graph structure using GNNs.

PinSAGE [Ying et al. 2018]

Use GNNs to generate high-quality embeddings by simultaneously capturing rich node attributes (e.g., images) and the graph structure.

Neural Graph Collaborative Filtering

Conventional Collaborative Filtering

Limitations of MF

The model itself does not explicitly capture graph structure

The graph structure is only implicitly captured in the training objective.

Only the first-order graph structure (i.e., edges) is captured in the training objective.

High-order graph structure (e.g., k-hop paths between two nodes) is not explicitly captured.

Motivation

We want a model that…
- explicitly captures graph structure (beyond implicitly through the training objective)
- captures high-order graph structure (beyond the first-order edge connectivity structure)
GNNs are a natural approach to achieve both!
- Neural Graph Collaborative Filtering (NGCF) [Wang et al. 2019]
- LightGCN [He et al. 2020]
  - A simplified and improved version of NGCF

NGCF: Overview

Neural Graph Collaborative Filtering (NGCF) explicitly incorporates high-order graph structure when generating user/item embeddings.

Key idea: Use a GNN to generate graph-aware user/item embeddings.

Framework

Given: User-item bipartite graph.
NGCF framework:
- Prepare shallow learnable embedding for each node.
- Use multi-layer GNNs to propagate embeddings along the bipartite graph.
  - High-order graph structure is captured.
- Final embeddings are explicitly graph-aware!
Two kinds of learnable params are jointly learned:
- Shallow user/item embeddings
- GNN’s parameters

Initial Node Embeddings

Neighbor Aggregation

Final Embeddings and Score Function

NGCF: Summary

Conventional collaborative filtering uses shallow user/item embeddings.
- The embeddings do not explicitly model graph structure.
- The training objective does not model high-order graph structure.
NGCF uses a GNN to propagate the shallow embeddings.
- The embeddings are explicitly aware of high-order graph structure.

Lightgcn

Motivation

Recall: NGCF jointly learns two kinds of parameters:
- Shallow user/item embeddings
- GNN’s parameters
Observation: Shallow learnable embeddings are already quite expressive.

Can we simplify the GNN used in NGCF (e.g., remove its learnable parameters)?
- Answer: Yes!
- Bonus: Simplification improves the recommendation performance!
Overview of the idea:
- Adjacency matrix for a bipartite graph
- Matrix formulation of GCN
- Simplification of GCN by removing non-linearity
  - Related: SGC for scalable GNN [Wu et al. 2019]

Adjacency and Embedding Matrices

Adjacency matrix of a (undirected) bipartite graph.

Shallow embedding matrix

Matrix Formulation of GCN

Simplifying GCN

Multi-scale Diffusion

LightGCN: Model Overview

Given:

Adjacency matrix A
Initial learnable embedding matrix E

Average the embedding matrices at different scales

Score function:

LightGCN: Intuition

Question: Why does the simple diffusion propagation work well?

Answer: The diffusion directly encourages the embeddings of similar users/items to be similar.

Similar users share many common neighbors (items) and are expected to have similar future preferences (interact with similar items)

Lightgcn and GCN/C&S

LIGHTGCN and MF: Comparison

Both LightGCN and Matrix Factorization (MF) learn a unique embedding for each user/item.
The difference is that
- MF directly uses the shallow user/item embeddings for scoring.
- LightGCN uses the diffused user/item embeddings for scoring.
LightGCN performs better than MF but are also more computationally expensive due to the additional diffusion step.
- The final embedding of a user/item is obtained by aggregating embeddings of its multi-hop neighboring nodes.

Summary

LightGCN simplifies NGCF by removing the learnable parameters of GNNs.
Learnable parameters are all in the shallow input node embeddings.
- Diffusion propagation only involves matrix-vector multiplication.
- The simplification leads to better empirical performance than NGCF

PINSAGE

Motivation

P2P recommendation

PINSAGE: Pin Embedding

Unifies visual, textual, and graph information.
The largest industry deployment of a Graph Convolutional Networks
Huge Adoption across Pinterest
Works for fresh content and is available in a few seconds after pin creation

Graph Convolutional Neural Networks for Web-Scale Recommender Systems

PinSage graph convolutional network:

Goal: Generate embeddings for nodes in a large-scale Pinterest graph containing billions of objects
Key Idea: Borrow information from nearby nodes
- E.g., bed rail Pin might look like a garden fence, but gates and beds are rarely adjacent in the graph
Pin embeddings are essential to various tasks like recommendation of Pins, classification, ranking
Services like “Related Pins”, “Search”, “Shopping”, “Ads”

Harnessing Pins and Boards

PINSAGE: Graph Neural Network

Graph has tens of billions of nodes and edges
Further resolves embeddings across the Pinterest graph

PINSAGE: Methods for Scaling Up

In addition to the GNN model, the PinSAGE paper introduces several methods to scale the GNN to a billion-scale recommender system (e.g., Pinterest).

Shared negative samples across users in a mini-batch
Hard negative samples
Curriculum learning
Mini-batch training of GNNs on a large-graph (to be covered in the future lecture)

PINSAGE Model

Training Data

1+B repin pairs:

From Related Pins surface
Capture semantic relatedness
Goal: Embed such pairs to be “neighbors”

Example positive training pairs (Q,X):

Shared Negative Samples

Curriculum Learning

Key insight: It is effective to make the negative samples gradually harder in the process of training.
At n-th epoch, we add n − 1 hard negative items.
- #(Hard negatives) gradually increases in the process of training.
The model will gradually learn to make finer-grained predictions

Idea: use harder and harder negative samples

Include more and more hard negative samples for each epoch

Hard Negatives

Challenge: Industrial recsys needs to make extremely fine-grained predictions.

#Total items: Up to billions.
#Items to recommend for each user: 10 to 100.

Issue: The shared negative items are randomly sampled from all items

Most of them are “easy negatives”, i.e., a model does not need to be fine-grained to distinguish them from positive items.

We need a way to sample “hard negatives” to force the model to be fine-grained

For each user node, the hard negatives are item nodes that are close (but not connected) to the user node in the graph.
Hard negatives for user u ∈ U are obtained as follows:
- Compute personalized page rank (PPR) for user u.
- Sort items in the descending order of their PPR scores.
- Randomly sample item nodes that are ranked high but not too high, e.g., 2000th —5000th .
  - Item nodes that are close but not too close (connected) to the user node.
The hard negatives for each user are used in addition to the shared negatives.

PINSAGE: Negative Sampling

(q, p) positive pairs are given but various methods to sample negatives to form (q, p, n)
Distance Weighted Sampling (Wu et al., 2017) - Sample negatives so that query-negative distance distribution is approx U[0.5, 1.4]

Fine-grained Object Similarity

Compare against Prod

PINSAGE: Summary

PinSAGE uses GNNs to generate high-quality user/item embeddings that capture both the rich node attributes and graph structure.
The PinSAGE model is effectively trained using sophisticated negative sampling strategies.
PinSAGE is successfully deployed at Pinterest, a billion-scale image content recommendation service.
- Uncovered in this lecture: How to scale up GNNs to large-scale graphs. Will be covered in a later lecture

杰克成

https://jackhcc.github.io/posts/Lesson-CS224W-Machine-Learning-with-Graphs.html

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源杰克成 !

Lesson

Python-学习资料

Python-学习资料汇总【Archived】

2021-11-13 Python

Python

DL专栏18-Normalization

Normalization方法

2021-11-11 Deep Learning

Normalization

CS224W-Machine Learning with Graphs

Introduction

Machine Learning with Graphs

Why is Graph Deep Learning Hard?

Deep Learning in Graphs

Representation Learning

Course Outline

Applications of Graph ML

Different Types of Tasks

Classic Graph ML Tasks

Node-level ML Tasks

Protein Folding

Alphafold: Impact

Edge-level ML Tasks

Recommender Systems

Pinsage: Graph-based Recommender

Drug Side Effects

Biomedical Graph Link Prediction

Subgraph-level ML Tasks

Traffic Prediction

Graph-level ML Tasks

Drug Discovery

Deep Learning for Antibiotic Discovery

Molecule Generation/Optimization

Physics Simulation

Simulation Learning Framework

Choice of Graph Representation

Components of a Network

Choosing a Proper Representation

How do you define a graph

Directed vs Undirected Graphs

Heterogeneous Graphs

Node Degrees

Bipartite Graph

FoldedProjected Bipartite Graphs

Representing Graphs: Adjacency Matrix

Networks are Sparse Graphs

Representing Graphs: Edge list

Representing Graphs: Adjacency list

Node and Edge Attributes

More Types of Graphs

Connectivity of Undirected Graphs

Connectivity: Example

Connectivity of Directed Graphs

Traditional Methods for Machine Learning in Graphs

Traditional Methods for Machine Learning in Graphs

Traditional ML Pipeline

Feature Desian

Node-level Tasks and Features

Node-level Features: Overview

Node Degree

Node Centrality

Eigenvector centrality

Betweenness centrality

Closeness centrality

Clustering Coefficient

Graphlets

Induced Subgraph Isomorphism

Node-level Feature: Summary

Link Prediction Task and Features

Recap

Link Prediction via Proximity

Link-level Features: Overview

Distance-based Features

Local Neighborhood Overlap

Global Neighborhood Overlap

Intuition: Powers of Adj Matrices

Summary

Graph-level Features and Graph Kernels

Graph-level Features

Backaround Kernel Methods

Overview

Graph Kernel: Key Idea

Graphlet Features

Graphlet Kernel

Weisfeiler-lehman Kernel

Summary

Node Embeddings

Why Embedding?

Node Embeddings Encoder and Decoder