본문 바로가기
스터디/인공지능, 딥러닝, 머신러닝

[CS224w] Colab 1 - Node Embeddings

by 궁금한 준이 2023. 3. 6.
728x90
반응형

강의자료 1~3  기반 colab 프로그래밍 과제이다.

  • 1. Introduction
  • 2. Feature Engineering for Machine Learning in Graphs
  • 3. Node Embeddings

Dataset

Zachary's karate(가라테) club network라는 그래프 데이터를 이용할 것이다. NetworkX에 내장되어있으니 간단히 불러온다.

import networkx as nx

G = nx.karate_club_graph()
print(type(G))           # <class 'networkx.classes.graph.Graph'>
print(G.is_directed())   # False

그래프를 그려볼 수 있다. 실행할 때마다 그래프 그림이 달라지지만, 당연히 같은 그래프이다.

nx.draw(G, with_labels=True)

sample graph structure
Karate Club Network

Question 1. Average degree of network

이 그래프는 무방향 그래프이므로, 강의자료1의 (무방향 그래프에 대한) node degree 정의를 살펴보면

\[ \bar{k} = \cfrac{2E}{N} \]

따라서 함수의 파라미터로 들어온 $E, N$을 이용한다.

def average_degree(num_edges, num_nodes):
  # TODO: Implement this function that takes number of edges
  # and number of nodes, and returns the average node degree of 
  # the graph. Round the result to nearest integer (for example 
  # 3.3 will be rounded to 3 and 3.7 will be rounded to 4)

  avg_degree = 0

  ############# Your code here ############
  avg_degree = round(2 * num_edges/ num_nodes)
  #########################################

  return avg_degree

num_edges = G.number_of_edges()
num_nodes = G.number_of_nodes()
avg_degree = average_degree(num_edges, num_nodes)
print("Average degree of karate club network is {}".format(avg_degree))

----- result -----
Average degree of karate club network is 0

 

Question 2. Average clustering coefficient of network

강의자료2에서는, 노드 $v$가 이웃 노드들과 얼마나 연결되어있는지 측정하는 지표가 clustering coefficient이다.

\[ e_v = \cfrac{\text{# edges among neighboring nodes}}{\dbinom{k_v}{2}} \in [0, 1] \]

 

위키피디아의 정의는

\[ C_i = \cfrac{1}{k_i(k_i - 1)} \sum_{j, k}A_{ij}A_{jk}A_{ki} \ \mathrm{where} \ A \mathrm{\ is \ adjacency\ matrix} \]

 

따라서 average clustering coefficient는 NetworkX 공식문서에 따르면

\[ C = \cfrac{1}{n}\sum_{v \in G}c_v \ \mathrm{where} \ n \text{ is a number of nodes in } G \]

 

NetworkX 자체 모듈로 average_clustering을 구할 수 있다.

https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.cluster.average_clustering.html#average-clustering

def average_clustering_coefficient(G):
  # TODO: Implement this function that takes a nx.Graph
  # and returns the average clustering coefficient. Round 
  # the result to 2 decimal places (for example 3.333 will
  # be rounded to 3.33 and 3.7571 will be rounded to 3.76)

  avg_cluster_coef = 0

  ############# Your code here ############
  ## Note: 
  ## 1: Please use the appropriate NetworkX clustering function
  avg_cluster_coef = nx.average_clustering(G)
  avg_cluster_coef = round(avg_cluster_coef, 2)
  #########################################

  return avg_cluster_coef

avg_cluster_coef = average_clustering_coefficient(G)
print("Average clustering coefficient of karate club network is {}".format(avg_cluster_coef))

# Average clustering coefficient of karate club network is 0.57

 

Question 3. PageRank of network

(웹 페이지 기반 그래프에서) PageRank는 그래프에서 노드의 중요도를 평가하는 지표 중에 하나다. $i$번째 페이지(노드)의 중요도를 $r_i$, 링크의 개수를 $d_i$라 할 때(degree라 해도 무방하다) 각 링크는 $r_i / d_i$의 vote를 얻는다. 이때 $j$번째 페이지의 중요도 $r_j$는

\[ r_j = \sum_{i \to j}\cfrac{r_i}{d_i} \]

라고 한다.

 

PageRank 알고리즘은 랜덤 검색자가 링크를 통해 특정 페이지에 도달할 확률 분포다. 랜덤으로 페이지에 도달할 확률을 $\beta$라 하면

$$r_j = \sum_{i \rightarrow j} \beta \frac{r_i}{d_i} + (1 - \beta) \frac{1}{N}$$

def one_iter_pagerank(G, beta, r0, node_id):
  # TODO: Implement this function that takes a nx.Graph, beta, r0 and node id.
  # The return value r1 is one interation PageRank value for the input node.
  # Please round r1 to 2 decimal places.

  r1 = 0

  ############# Your code here ############
  ## Note: 
  ## 1: You should not use nx.pagerank
  for node in G.neighbors(node_id):
    r1 += beta * r0 / G.degree[node]
  r1 += (1 - beta) / G.number_of_nodes()
  #########################################

  return r1

beta = 0.8
r0 = 1 / G.number_of_nodes()
node = 0
r1 = one_iter_pagerank(G, beta, r0, node)
print("The PageRank value for node 0 after one iteration is {}".format(r1))

# The PageRank value for node 0 after one iteration is 0.12810457516339868

 

Question 4. (raw) Closeness centrality of network

closeness centrality는 $c(v) = \frac{1}{\sum_{u \neq v}\text{shortest path length between } u \text{ and } v}$ 로 정의된다.

NetworkX는 closeness centrality를 계산해주는 모듈이 있는데, 이는 위의 정의와 다르게 정규화되어있다.

\[ C(u) = \cfrac{n-1}{\sum_{v=1}^{n-1}d(v, u)}, \text{where } d(v, u) \text{ is the shortest-path distance between } u \text{ and } v \]

https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.closeness_centrality.html#networkx.algorithms.centrality.closeness_centrality

def closeness_centrality(G, node=5):
  # TODO: Implement the function that calculates closeness centrality 
  # for a node in karate club network. G is the input karate club 
  # network and node is the node id in the graph. Please round the 
  # closeness centrality result to 2 decimal places.

  closeness = 0

  ## Note:
  ## 1: You can use networkx closeness centrality function.
  ## 2: Notice that networkx closeness centrality returns the normalized 
  ## closeness directly, which is different from the raw (unnormalized) 
  ## one that we learned in the lecture.
  closeness = nx.closeness_centrality(G, u=node)
  n_connected = len(nx.node_connected_component(G, node))
  closeness /= (n_connected - 1)
  closeness = round(closeness, 2)
  #########################################

  return closeness

node = 5
closeness = closeness_centrality(G, node=node)
print("The node 5 has closeness centrality {}".format(closeness))

# The node 5 has closeness centrality 0.01

 

Question 5. Edge List of graph using PyTorch

def graph_to_edge_list(G):
  # TODO: Implement the function that returns the edge list of
  # an nx.Graph. The returned edge_list should be a list of tuples
  # where each tuple is a tuple representing an edge connected 
  # by two nodes.

  edge_list = []

  ############# Your code here ############
  edge_list = list(G.edges)

  #########################################

  return edge_list

def edge_list_to_tensor(edge_list):
  # TODO: Implement the function that transforms the edge_list to
  # tensor. The input edge_list is a list of tuples and the resulting
  # tensor should have the shape [2 x len(edge_list)].

  edge_index = torch.tensor([])

  ############# Your code here ############
  edge_index = torch.tensor(edge_list).T
  #########################################

  return edge_index

pos_edge_list = graph_to_edge_list(G)
pos_edge_index = edge_list_to_tensor(pos_edge_list)
print("The pos_edge_index tensor has shape {}".format(pos_edge_index.shape))
print("The pos_edge_index tensor has sum value {}".format(torch.sum(pos_edge_index)))

# The pos_edge_index tensor has shape torch.Size([2, 78])
# The pos_edge_index tensor has sum value 2535

 

Question 6. Sample negative edges of network

negative edge는 그래프에 존재하지 않는 edge를 말한다.

주어진 그래프는 양방향 그래프이므로, $(u, v)$가 edge라면 $(v, u)$ 역시 edge임에 유의한다.

import random

def sample_negative_edges(G, num_neg_samples):
  # TODO: Implement the function that returns a list of negative edges.
  # The number of sampled negative edges is num_neg_samples. You do not
  # need to consider the corner case when the number of possible negative edges
  # is less than num_neg_samples. It should be ok as long as your implementation 
  # works on the karate club network. In this implementation, self loops should 
  # not be considered as either a positive or negative edge. Also, notice that 
  # the karate club network is an undirected graph, if (0, 1) is a positive 
  # edge, do you think (1, 0) can be a negative one?

  neg_edge_list = []

  ############# Your code here ############
  for i in G.nodes():
    for j in G.nodes():
      if (i, j) in G.edges(): continue
      if (j, i) in G.edges(): continue
      neg_edge_list.append((i, j))
      neg_edge_list.append((j, i))

  neg_edge_list = random.sample(neg_edge_list, num_neg_samples)
  #########################################

  return neg_edge_list

# Sample 78 negative edges
neg_edge_list = sample_negative_edges(G, len(pos_edge_list))

# Transform the negative edge list to tensor
neg_edge_index = edge_list_to_tensor(neg_edge_list)
print("The neg_edge_index tensor has shape {}".format(neg_edge_index.shape))

# Which of following edges can be negative ones?
edge_1 = (7, 1)
edge_2 = (1, 33)
edge_3 = (33, 22)
edge_4 = (0, 4)
edge_5 = (4, 2)

############# Your code here ############
## Note:
## 1: For each of the 5 edges, print whether it can be negative edge
for (u, v) in [edge_1, edge_2, edge_3, edge_4, edge_5]:
  if (u, v) not in pos_edge_list and (v, u) not in pos_edge_list:
    print(f'{(u, v)} is a negative edge of G')
  else:
    print(f'{(u, v)} is not a negative edge of G')
#########################################

The neg_edge_index tensor has shape torch.Size([2, 78])
(7, 1) is not a negative edge of G
(1, 33) is a negative edge of G
(33, 22) is not a negative edge of G
(0, 4) is not a negative edge of G
(4, 2) is a negative edge of G

Question 7. Train the embedding

임베딩 벡터 생성 및 시각화

# Please do not change / reset the random seed
torch.manual_seed(1)

def create_node_emb(num_node=34, embedding_dim=16):
  # TODO: Implement this function that will create the node embedding matrix.
  # A torch.nn.Embedding layer will be returned. You do not need to change 
  # the values of num_node and embedding_dim. The weight matrix of returned 
  # layer should be initialized under uniform distribution. 

  emb = None

  ############# Your code here ############
  emb = nn.Embedding(num_embeddings=num_node, embedding_dim=embedding_dim)
  emb.weight.data = torch.rand(num_node, embedding_dim)
  #########################################

  return emb

emb = create_node_emb()
ids = torch.LongTensor([0, 3])

# Print the embedding layer
print("Embedding: {}".format(emb))

# An example that gets the embeddings for node 0 and 3
print(emb(ids))


def visualize_emb(emb):
  X = emb.weight.data.numpy()
  pca = PCA(n_components=2)
  components = pca.fit_transform(X)
  plt.figure(figsize=(6, 6))
  club1_x = []
  club1_y = []
  club2_x = []
  club2_y = []
  for node in G.nodes(data=True):
    if node[1]['club'] == 'Mr. Hi':
      club1_x.append(components[node[0]][0])
      club1_y.append(components[node[0]][1])
    else:
      club2_x.append(components[node[0]][0])
      club2_y.append(components[node[0]][1])
  plt.scatter(club1_x, club1_y, color="red", label="Mr. Hi")
  plt.scatter(club2_x, club2_y, color="blue", label="Officer")
  plt.legend()
  plt.show()

# Visualize the initial random embeddding
visualize_emb(emb)

embedding space before training
visualizing embedding vector

from torch.optim import SGD
import torch.nn as nn

def accuracy(pred, label):
  # TODO: Implement the accuracy function. This function takes the 
  # pred tensor (the resulting tensor after sigmoid) and the label 
  # tensor (torch.LongTensor). Predicted value greater than 0.5 will 
  # be classified as label 1. Else it will be classified as label 0.
  # The returned accuracy should be rounded to 4 decimal places. 
  # For example, accuracy 0.82956 will be rounded to 0.8296.

  accu = 0.0

  ############# Your code here ############
  count = torch.sum(torch.round(pred) == label)
  accu = count / pred.shape[0]
  accu = round(accu.item(), 4)
  #########################################

  return accu

def train(emb, loss_fn, sigmoid, train_label, train_edge):
  # TODO: Train the embedding layer here. You can also change epochs and 
  # learning rate. In general, you need to implement: 
  # (1) Get the embeddings of the nodes in train_edge
  # (2) Dot product the embeddings between each node pair
  # (3) Feed the dot product result into sigmoid
  # (4) Feed the sigmoid output into the loss_fn
  # (5) Print both loss and accuracy of each epoch 
  # (6) Update the embeddings using the loss and optimizer 
  # (as a sanity check, the loss should decrease during training)

  epochs = 500
  learning_rate = 0.1

  optimizer = SGD(emb.parameters(), lr=learning_rate, momentum=0.9)

  for i in range(epochs):
    ############# Your code here ############
    optimizer.zero_grad()

    z = emb(train_edge) # (1)
    sim = torch.sum(z[0] * z[1], dim=-1) # (2)
    pred = torch.sigmoid(sim) # (3)
    loss = loss_fn(pred, train_label) # (4)
    print(f'Epoch: {i}, Loss: {loss:.4f}, Accuracy: {accuracy(pred, train_label)}') # (5)
    loss.backward() # (6)
    optimizer.step() # (6)
    #########################################

loss_fn = nn.BCELoss()
sigmoid = nn.Sigmoid()

print(pos_edge_index.shape)

# Generate the positive and negative labels
pos_label = torch.ones(pos_edge_index.shape[1], )
neg_label = torch.zeros(neg_edge_index.shape[1], )

# Concat positive and negative labels into one tensor
train_label = torch.cat([pos_label, neg_label], dim=0)

# Concat positive and negative edges into one tensor
# Since the network is very small, we do not split the edges into val/test sets
train_edge = torch.cat([pos_edge_index, neg_edge_index], dim=1)
print(train_edge.shape)

emb = create_node_emb()
train(emb, loss_fn, sigmoid, train_label, train_edge)

embedding space
embedding vector after training

임베딩 학습이 끝난 후, 시각화를 해보면, 두 클래스가 잘 분류되어 있는 것(상하로 분리)을 볼 수 있다.

728x90
반응형