【Intro】Cora数据集介绍

https://graphsandnetworks.com/the-cora-dataset/

Graph Convolutional Network (GCN) on the CORA citation dataset — StellarGraph 1.0.0rc1 documentation

pytorch-GAT/The Annotated GAT (Cora).ipynb at main · gordicaleksa/pytorch-GAT · GitHub

Cora数据集

Cora数据集包括2708份科学出版物,分为7类。引文网络由5429个链接组成。数据集中的每个发布都用一个0/1值的词向量来描述,该词向量表示字典中相应词的缺失/存在。这部词典由1433个独特的单词组成。

这个数据集在图学习中是MNIST等价的。

import pandas as pd

node_df = pd.read_csv('./data/nodes.csv')
node_df.head()
Unnamed: 0nodeIdlabelssubjectfeatures
0031336PaperNeural_Networks[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
111061127PaperRule_Learning[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
221106406PaperReinforcement_Learning[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3313195PaperReinforcement_Learning[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4437879PaperProbabilistic_Methods[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
edge_df = pd.read_csv('./data/edges.csv')
edge_df.head()
Unnamed: 0sourceNodeIdtargetNodeIdrelationshipType
00351033CITES
1135103482CITES
2235103515CITES
33351050679CITES
44351103960CITES
edge_df = pd.read_csv('./data/edges.csv', names=["target", "source"])
edge_df["label"] = "cites"

edge_df.sample(frac=0.5).head(5)
targetsourcelabel
563.023541130539CITEScites
2766.03506132083CITEScites
4040.01107808116512CITEScites
134.059454335CITEScites
1023.04584124064CITEScites

https://graphsandnetworks.com/the-cora-dataset/

按此指引,下载

https://github.com/gordicaleksa/pytorch-GAT/blob/main/The%20Annotated%20GAT%20(Cora).ipynb

事实证明,将注意力的想法与已经存在的图形卷积网络(GCN)结合起来是一个很好的举动🤓- GAT是GNN文献中被引用次数第二多的论文(截至该notebook撰写时)。

整个想法来自cnn。卷积神经网络解决了各种计算机视觉任务,并在深度学习领域掀起了一场巨大的热潮,所以一些人决定把这个想法转移到图上。基本问题是,虽然图像位于规则网格上(你也可以将其视为图形),因此具有精确的顺序概念,但图不享受这种良好的属性,邻居的数量以及邻居的顺序。

因此,如何定义图的kernel成为一个问题。我们无法建党将kernel的大小定义为3 \times 3,因为节点的邻居可能很少或者很大。

此时主要用到的是两个思路:

  • spectral methods(都以某种方式利用了图的拉普拉斯特征基)
    据说其历史源于graph signal processing,有空读一下
  • spatial method

对于spatial methods(空间方法)的high level解释

假设我们有邻居的特征向量,则可以执行以下操作:

  1. 以某种方式变换它们(也许是一个线性投影),
  2. 以某种方式聚合它们(也许用注意力系数对它们进行加权->GAT)。
  3. 通过将当前节点的(变换后的)特征向量与聚合的邻域表示结合起来,(以某种方式)更新当前节点的特征向量。

import & 读取数据

# I always like to structure my imports into Python's native libs,
# stuff I installed via conda/pip and local file imports (but we don't have those here)

import pickle

# Visualization related imports
import matplotlib.pyplot as plt
import networkx as nx
import igraph as ig

# Main computation libraries
import scipy.sparse as sp
import numpy as np

# Deep learning related imports
import torch
"""
    Contains constants needed for data loading and visualization.

"""

import os
import enum


# Supported datasets - only Cora in this notebook
class DatasetType(enum.Enum):
    CORA = 0

    
# Networkx is not precisely made with drawing as its main feature but I experimented with it a bit
class GraphVisualizationTool(enum.Enum):
    NETWORKX = 0,
    IGRAPH = 1


# We'll be dumping and reading the data from this directory
DATA_DIR_PATH = os.path.join(os.getcwd(), 'data')
CORA_PATH = os.path.join(DATA_DIR_PATH, 'cora')  # this is checked-in no need to make a directory

#
# Cora specific constants
#

# Thomas Kipf et al. first used this split in GCN paper and later Petar Veličković et al. in GAT paper
CORA_TRAIN_RANGE = [0, 140]  # we're using the first 140 nodes as the training nodes
CORA_VAL_RANGE = [140, 140+500]
CORA_TEST_RANGE = [1708, 1708+1000]
CORA_NUM_INPUT_FEATURES = 1433
CORA_NUM_CLASSES = 7

# Used whenever we need to visualzie points from different classes (t-SNE, CORA visualization)
cora_label_to_color_map = {0: "red", 1: "blue", 2: "green", 3: "orange", 4: "yellow", 5: "pink", 6: "gray"}

数据所在位置在当前notebook所在位置data文件夹下后cora文件夹,如:GNN_test_project/data/cora/node_features.csr

使用前140个节点来训练节点,500个节点作为验证,1000个节点作为测试。

输入的feature数量是1433,共分成7类。

为了方便可视化,这里设置了每个类别不同的颜色。

数据集了解

Transductive(直推式) - 假设我们有一个单一的图(如:Cora),将一些节点(而不是图)分成训练/验证/测试训练集。在训练时,只使用来自训练节点的标签。但是。在前向传递过程中,根据空间gnn工作的本质,将从邻居中聚集特征向量,其中一些可能属于验证集甚至测试集!重点是——此处不是在使用它们的标签信息(没有使用节点的feature),而是在使用它们的结构信息和特征。

Inductive(归纳式) - 如果有计算机视觉或NLP背景,可能对这个更熟悉。有一组训练图,有一组单独的验证图当然还有一组单独的测试图。

pickle.load(file)

pickle — Python object serialization — Python 3.12.3 documentation

pickle.dump和pickle.load-CSDN博客

从file中读取一个字符串,并重构为原来的python对象

with open(path, 'rb') as file

python - What's the difference between open('filepath', 'rb') and open(rb'filepath')? There's some encoding difference between them - Stack Overflow

https://www.quora.com/What-does-opening-a-file-rb-in-Python-mean

請問with open() as f 的語法意思為何? open()內參數何時使用'wb'、'rb'? - Cupoy

r将字符串字面值标记为raw(在这种特殊情况下不做任何事情),而b将其标记为二进制,这意味着结果对象是bytes对象,而不是STR对象

简单来说就是:读入,并且转换成bytes类型

with open(path, 'wb') as file

类似的,这里是写入

pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)

用于将python独享序列化并保存到文件中

loading/saving Pickle files:
# First let's define these simple functions for loading/saving Pickle files - we need them for Cora

# All Cora data is stored as pickle
def pickle_read(path):
    with open(path, 'rb') as file:
        data = pickle.load(file)

    return data

def pickle_save(path, data):
    with open(path, 'wb') as file:
        pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)

 加载数据

node_features_csr = pickle_read(os.path.join(CORA_PATH, 'node_features.csr'))
node_labels_npy = pickle_read(os.path.join(CORA_PATH, 'node_labels.npy'))
adjacency_list_dict = pickle_read(os.path.join(CORA_PATH, 'adjacency_list.dict'))
获得三个数据:
1. 节点特征
2. 边的标签
3. 邻接表(N个节点:节点的所有邻居节点)
load_graph_data
# We'll pass the training config dictionary a bit later
def load_graph_data(training_config, device):
    dataset_name = training_config['dataset_name'].lower()
    should_visualize = training_config['should_visualize']

    if dataset_name == DatasetType.CORA.name.lower():

        # shape = (N, FIN), where N is the number of nodes and FIN is the number of input features
        node_features_csr = pickle_read(os.path.join(CORA_PATH, 'node_features.csr'))
        # shape = (N, 1)
        node_labels_npy = pickle_read(os.path.join(CORA_PATH, 'node_labels.npy'))
        # shape = (N, number of neighboring nodes) <- this is a dictionary not a matrix!
        adjacency_list_dict = pickle_read(os.path.join(CORA_PATH, 'adjacency_list.dict'))

        # Normalize the features (helps with training)
        node_features_csr = normalize_features_sparse(node_features_csr)
        num_of_nodes = len(node_labels_npy)

        # shape = (2, E), where E is the number of edges, and 2 for source and target nodes. Basically edge index
        # contains tuples of the format S->T, e.g. 0->3 means that node with id 0 points to a node with id 3.
        topology = build_edge_index(adjacency_list_dict, num_of_nodes, add_self_edges=True)

        # Note: topology is just a fancy way of naming the graph structure data 
        # (aside from edge index it could be in the form of an adjacency matrix)

        if should_visualize:  # network analysis and graph drawing
            plot_in_out_degree_distributions(topology, num_of_nodes, dataset_name)  # we'll define these in a second
            visualize_graph(topology, node_labels_npy, dataset_name)

        # Convert to dense PyTorch tensors

        # Needs to be long int type because later functions like PyTorch's index_select expect it
        topology = torch.tensor(topology, dtype=torch.long, device=device)
        node_labels = torch.tensor(node_labels_npy, dtype=torch.long, device=device)  # Cross entropy expects a long int
        node_features = torch.tensor(node_features_csr.todense(), device=device)

        # Indices that help us extract nodes that belong to the train/val and test splits
        train_indices = torch.arange(CORA_TRAIN_RANGE[0], CORA_TRAIN_RANGE[1], dtype=torch.long, device=device)
        val_indices = torch.arange(CORA_VAL_RANGE[0], CORA_VAL_RANGE[1], dtype=torch.long, device=device)
        test_indices = torch.arange(CORA_TEST_RANGE[0], CORA_TEST_RANGE[1], dtype=torch.long, device=device)

        return node_features, node_labels, topology, train_indices, val_indices, test_indices
    else:
        raise Exception(f'{dataset_name} not yet supported.')

读取节点特征数据,节点特征归一化

读取邻接列表,得到边的连接信息

两者数据类型都是numpy.ndarray

normalize features sparse 
def normalize_features_sparse(node_features_sparse):
    assert sp.issparse(node_features_sparse), f'Expected a sparse matrix, got {node_features_sparse}.'

    # Instead of dividing (like in normalize_features_dense()) we do multiplication with inverse sum of features.
    # Modern hardware (GPUs, TPUs, ASICs) is optimized for fast matrix multiplications! ^^ (* >> /)
    # shape = (N, FIN) -> (N, 1), where N number of nodes and FIN number of input features
    node_features_sum = np.array(node_features_sparse.sum(-1))  # sum features for every node feature vector

    # Make an inverse (remember * by 1/x is better (faster) then / by x)
    # shape = (N, 1) -> (N)
    node_features_inv_sum = np.power(node_features_sum, -1).squeeze()

    # Again certain sums will be 0 so 1/0 will give us inf so we replace those by 1 which is a neutral element for mul
    node_features_inv_sum[np.isinf(node_features_inv_sum)] = 1.

    # Create a diagonal matrix whose values on the diagonal come from node_features_inv_sum
    diagonal_inv_features_sum_matrix = sp.diags(node_features_inv_sum)

    # We return the normalized features.
    return diagonal_inv_features_sum_matrix.dot(node_features_sparse)

归一化特征,让每个节点的特征和为1 

 node_features_sum = np.array(node_features_sparse.sum(-1))

scipy.sparse.csr_matrix.sum — SciPy v1.13.1 Manual

python - How to get sum of each row and sum of each column in Scipy sparse matrices (csr_matrix and csc_matrix)? - Stack Overflow

python对矩阵某行求和_python – 对scipy.sparse.csr_matrix中的行求和-CSDN博客

python - Convert Pandas dataframe to Sparse Numpy Matrix directly - Stack Overflow

这里node_features_sparse的数据类型是scipy.sparse._csr.csr_matrix

import pandas as pd

df = pd.DataFrame({
    'w_0': [1, 0, 1, 0, 1, 0, 1, 0],
    'w_1': [0, 0, 0, 0, 1, 0, 1, 0],
    'w_2': [1, 1, 1, 1, 1, 1, 1, 1],
    'w_4': [0, 1, 0, 1, 1, 0, 1, 1]
})
sp.csr_matrix(df.values).sum(-1)

"""
输入:
matrix([[2],
        [2],
        [2],
        [2],
        [4],
        [1],
        [4],
        [2]])
"""

⬅️df

import pandas as pd

df = pd.DataFrame({
    'w_0': [1., 0., 1., 0., 1., 0., 1., 0.],
    'w_1': [0., 0., 0., 0., 1., 0., 1., 0.],
    'w_2': [1., 1., 1., 1., 1., 1., 1., 1.],
    'w_4': [0., 1., 0., 1., 1., 0., 1., 1.]
})
df = sp.csr_matrix(df.values)
df_sum = np.array(df.sum(-1))
df.toarray(), df_sum, np.power(df_sum, -1), np.power(df_sum, -1).squeeze()

"""
输出:
(array([[1., 0., 1., 0.],
        [0., 0., 1., 1.],
        [1., 0., 1., 0.],
        [0., 0., 1., 1.],
        [1., 1., 1., 1.],
        [0., 0., 1., 0.],
        [1., 1., 1., 1.],
        [0., 0., 1., 1.]]),
 array([[2.],
        [2.],
        [2.],
        [2.],
        [4.],
        [1.],
        [4.],
        [2.]]),
 array([[0.5 ],
        [0.5 ],
        [0.5 ],
        [0.5 ],
        [0.25],
        [1.  ],
        [0.25],
        [0.5 ]]),
 array([0.5 , 0.5 , 0.5 , 0.5 , 0.25, 1.  , 0.25, 0.5 ]))
"""

注意,这里,要求得是float类型

import pandas as pd

df = pd.DataFrame({
    'w_0': [1., 0., 1., 0., 1., 0., 1., 0.],
    'w_1': [0., 0., 0., 0., 1., 0., 1., 0.],
    'w_2': [1., 1., 1., 1., 1., 1., 1., 1.],
    'w_4': [0., 1., 0., 1., 1., 0., 1., 1.]
})
df = sp.csr_matrix(df.values)
df_sum = np.array(df.sum(-1))
df_inv_sum = np.power(df_sum, -1).squeeze()
# 某些和可能是0,所以1/0会给我们inf,所以我们用1代替
df_inv_sum[np.isinf(df_inv_sum)] = 1.
df_inv_sum, sp.diags(df_inv_sum).toarray()

"""
输出:
(array([0.5 , 0.5 , 0.5 , 0.5 , 0.25, 1.  , 0.25, 0.5 ]),
 array([[0.5 , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
        [0.  , 0.5 , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
        [0.  , 0.  , 0.5 , 0.  , 0.  , 0.  , 0.  , 0.  ],
        [0.  , 0.  , 0.  , 0.5 , 0.  , 0.  , 0.  , 0.  ],
        [0.  , 0.  , 0.  , 0.  , 0.25, 0.  , 0.  , 0.  ],
        [0.  , 0.  , 0.  , 0.  , 0.  , 1.  , 0.  , 0.  ],
        [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.25, 0.  ],
        [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.5 ]]))
"""
import pandas as pd

df = pd.DataFrame({
    'w_0': [1., 0., 1., 0., 1., 0., 1., 0.],
    'w_1': [0., 0., 0., 0., 1., 0., 1., 0.],
    'w_2': [1., 1., 1., 1., 1., 1., 1., 1.],
    'w_4': [0., 1., 0., 1., 1., 0., 1., 1.]
})
df = sp.csr_matrix(df.values)
df_sum = np.array(df.sum(-1))
df_inv_sum = np.power(df_sum, -1).squeeze()
# 某些和可能是0,所以1/0会给我们inf,所以我们用1代替
df_inv_sum[np.isinf(df_inv_sum)] = 1.
diag_df_sum_matrix = sp.diags(df_inv_sum)
diag_df_sum_matrix.dot(df).toarray()

"""
输出:
array([[0.5 , 0.  , 0.5 , 0.  ],
       [0.  , 0.  , 0.5 , 0.5 ],
       [0.5 , 0.  , 0.5 , 0.  ],
       [0.  , 0.  , 0.5 , 0.5 ],
       [0.25, 0.25, 0.25, 0.25],
       [0.  , 0.  , 1.  , 0.  ],
       [0.25, 0.25, 0.25, 0.25],
       [0.  , 0.  , 0.5 , 0.5 ]])
"""

矩阵乘法

build edge index
def build_edge_index(adjacency_list_dict, num_of_nodes, add_self_edges=True):
    source_nodes_ids, target_nodes_ids = [], []
    seen_edges = set()

    for src_node, neighboring_nodes in adjacency_list_dict.items():
        for trg_node in neighboring_nodes:
            # if this edge hasn't been seen so far we add it to the edge index (coalescing - removing duplicates)
            if (src_node, trg_node) not in seen_edges:  # it'd be easy to explicitly remove self-edges (Cora has none..)
                source_nodes_ids.append(src_node)
                target_nodes_ids.append(trg_node)

                seen_edges.add((src_node, trg_node))

    if add_self_edges:
        source_nodes_ids.extend(np.arange(num_of_nodes))
        target_nodes_ids.extend(np.arange(num_of_nodes))

    # shape = (2, E), where E is the number of edges in the graph
    edge_index = np.row_stack((source_nodes_ids, target_nodes_ids))

    return edge_index

记录边的信息

adj_list_dict = {
    0: [1, 2, 5],
    1: [0, 2, 3, 4],
    2: [0, 1, 5],
    3: [1, 4, 5],
    4: [1, 3, 6],
    5: [0, 2, 3, 6 ,7],
    6: [4, 5, 7],
    7: [5, 6]
}
num_of_nodes = 8
source_nodes_ids, target_nodes_ids = [], []
seen_edges = set()
for src_node, neighboring_nodes in adj_list_dict.items():
    for trg_node in neighboring_nodes:
        # if this edge hasn't been seen so far we add it to the edge index (coalescing - removing duplicates)
        if (src_node, trg_node) not in seen_edges:  # it'd be easy to explicitly remove self-edges (Cora has none..)
            source_nodes_ids.append(src_node)
            target_nodes_ids.append(trg_node)

            seen_edges.add((src_node, trg_node))
print(pd.DataFrame([source_nodes_ids, target_nodes_ids]))
source_nodes_ids.extend(np.arange(num_of_nodes))
target_nodes_ids.extend(np.arange(num_of_nodes))
print(pd.DataFrame([source_nodes_ids, target_nodes_ids]))

"""
输出:
   0   1   2   3   4   5   6   7   8   9   ...  16  17  18  19  20  21  22  \
0   0   0   0   1   1   1   1   2   2   2  ...   5   5   5   5   5   6   6   
1   1   2   5   0   2   3   4   0   1   5  ...   0   2   3   6   7   4   5   

   23  24  25  
0   6   7   7  
1   7   5   6  

[2 rows x 26 columns]
   0   1   2   3   4   5   6   7   8   9   ...  24  25  26  27  28  29  30  \
0   0   0   0   1   1   1   1   2   2   2  ...   7   7   0   1   2   3   4   
1   1   2   5   0   2   3   4   0   1   5  ...   5   6   0   1   2   3   4   

   31  32  33  
0   5   6   7  
1   5   6   7  

[2 rows x 34 columns]
"""

可以看出,在做的事是把节点连接信息填上,并添加了自环

adj_list_dict = {
    0: [1, 2, 5],
    1: [0, 2, 3, 4],
    2: [0, 1, 5],
    3: [1, 4, 5],
    4: [1, 3, 6],
    5: [0, 2, 3, 6 ,7],
    6: [4, 5, 7],
    7: [5, 6]
}
num_of_nodes = 8
source_nodes_ids, target_nodes_ids = [], []
seen_edges = set()
for src_node, neighboring_nodes in adj_list_dict.items():
    for trg_node in neighboring_nodes:
        # if this edge hasn't been seen so far we add it to the edge index (coalescing - removing duplicates)
        if (src_node, trg_node) not in seen_edges:  # it'd be easy to explicitly remove self-edges (Cora has none..)
            source_nodes_ids.append(src_node)
            target_nodes_ids.append(trg_node)

            seen_edges.add((src_node, trg_node))
source_nodes_ids.extend(np.arange(num_of_nodes))
target_nodes_ids.extend(np.arange(num_of_nodes))
np.row_stack((source_nodes_ids, target_nodes_ids))

"""
输出:
array([[0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6,
        6, 6, 7, 7, 0, 1, 2, 3, 4, 5, 6, 7],
       [1, 2, 5, 0, 2, 3, 4, 0, 1, 5, 1, 4, 5, 1, 3, 6, 0, 2, 3, 6, 7, 4,
        5, 7, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7]])
"""

把源节点和目标节点放一起

为防止python报错,在这里先定义画图函数

# Let's just define dummy visualization functions for now - just to stop Python interpreter from complaining!
# We'll define them in a moment, properly, I swear.

def plot_in_out_degree_distributions():
    pass

def visualize_graph():
    pass

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # checking whether you have a GPU

config = {
    'dataset_name': DatasetType.CORA.name,
    'should_visualize': False
}

node_features, node_labels, edge_index, train_indices, val_indices, test_indices = load_graph_data(config, device)

print(node_features.shape, node_features.dtype)
print(node_labels.shape, node_labels.dtype)
print(edge_index.shape, edge_index.dtype)
print(train_indices.shape, train_indices.dtype)
print(val_indices.shape, val_indices.dtype)
print(test_indices.shape, test_indices.dtype)

数据集可视化

数一个节点作为源节点的次数和作为目标节点的次数

def plot_in_out_degree_distributions(edge_index, num_of_nodes, dataset_name):
    """
        Note: It would be easy to do various kinds of powerful network analysis using igraph/networkx, etc.
        I chose to explicitly calculate only the node degree statistics here, but you can go much further if needed and
        calculate the graph diameter, number of triangles and many other concepts from the network analysis field.

    """
    if isinstance(edge_index, torch.Tensor):
        edge_index = edge_index.cpu().numpy()
        
    assert isinstance(edge_index, np.ndarray), f'Expected NumPy array got {type(edge_index)}.'

    # Store each node's input and output degree (they're the same for undirected graphs such as Cora)
    in_degrees = np.zeros(num_of_nodes, dtype=np.int)
    out_degrees = np.zeros(num_of_nodes, dtype=np.int)

    # Edge index shape = (2, E), the first row contains the source nodes, the second one target/sink nodes
    # Note on terminology: source nodes point to target/sink nodes
    num_of_edges = edge_index.shape[1]
    for cnt in range(num_of_edges):
        source_node_id = edge_index[0, cnt]
        target_node_id = edge_index[1, cnt]

        out_degrees[source_node_id] += 1  # source node points towards some other node -> increment its out degree
        in_degrees[target_node_id] += 1  # similarly here

    hist = np.zeros(np.max(out_degrees) + 1)
    for out_degree in out_degrees:
        hist[out_degree] += 1

    fig = plt.figure(figsize=(12,8), dpi=100)  # otherwise plots are really small in Jupyter Notebook
    fig.subplots_adjust(hspace=0.6)

    plt.subplot(311)
    plt.plot(in_degrees, color='red')
    plt.xlabel('node id'); plt.ylabel('in-degree count'); plt.title('Input degree for different node ids')

    plt.subplot(312)
    plt.plot(out_degrees, color='green')
    plt.xlabel('node id'); plt.ylabel('out-degree count'); plt.title('Out degree for different node ids')

    plt.subplot(313)
    plt.plot(hist, color='blue')
    plt.xlabel('node degree')
    plt.ylabel('# nodes for a given out-degree') 
    plt.title(f'Node out-degree distribution for {dataset_name} dataset')
    plt.xticks(np.arange(0, len(hist), 5.0))

    plt.grid(True)
    plt.show()
in_degrees = np.zeros(num_of_nodes, dtype=np.int_)
out_degrees = np.zeros(num_of_nodes, dtype=np.int_)
in_degrees, out_degrees

"""
输出:
(array([0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0]))
"""

这里可能需要修改一下(原因是因为numpy版本不一样,作者提供的环境下可以直接用np.int。

in_degrees = np.zeros(num_of_nodes, dtype=np.int_)
out_degrees = np.zeros(num_of_nodes, dtype=np.int_)

num_of_edges = edge_index.shape[1]
for cnt in range(num_of_edges):
    source_node_id = edge_index[0, cnt]
    target_node_id = edge_index[1, cnt]

    out_degrees[source_node_id] += 1  # source node points towards some other node -> increment its out degree
    in_degrees[target_node_id] += 1  # similarly here

in_degrees, out_degrees

"""
输出:
(array([4, 5, 4, 4, 4, 6, 4, 3]), array([4, 5, 4, 4, 4, 6, 4, 3]))
"""

剩下的部分就是画图了

Cora结果如上图所示,可以看出:

  • 上面的两张图是一样的,因为我们把Cora看作是一个无向图(尽管它自然应该被建模为一个有向图)。
  • 某些节点有大量的边(中间的峰值),但大多数节点的边要少得多。
  • 第三张图以直方图的形式很好地可视化了这一点——大多数节点只有2-5条边(因此最左边的峰值)。
"""
Check out this blog for available graph visualization tools:
    https://towardsdatascience.com/large-graph-visualization-tools-and-approaches-2b8758a1cd59

Basically depending on how big your graph is there may be better drawing tools than igraph.

Note: I unfortunatelly had to flatten this function since igraph is having some problems with Jupyter Notebook,
we'll only call it here so it's fine!

"""

dataset_name = config['dataset_name']
visualization_tool=GraphVisualizationTool.IGRAPH

if isinstance(edge_index, torch.Tensor):
    edge_index_np = edge_index.cpu().numpy()

if isinstance(node_labels, torch.Tensor):
    node_labels_np = node_labels.cpu().numpy()

num_of_nodes = len(node_labels_np)
edge_index_tuples = list(zip(edge_index_np[0, :], edge_index_np[1, :]))  # igraph requires this format

# Construct the igraph graph
ig_graph = ig.Graph()
ig_graph.add_vertices(num_of_nodes)
ig_graph.add_edges(edge_index_tuples)

# Prepare the visualization settings dictionary
visual_style = {}

# Defines the size of the plot and margins
# go berserk here try (3000, 3000) it looks amazing in Jupyter!!! (you'll have to adjust the vertex_size though!)
visual_style["bbox"] = (700, 700)
visual_style["margin"] = 5

# I've chosen the edge thickness such that it's proportional to the number of shortest paths (geodesics)
# that go through a certain edge in our graph (edge_betweenness function, a simple ad hoc heuristic)

# line1: I use log otherwise some edges will be too thick and others not visible at all
# edge_betweeness returns < 1 for certain edges that's why I use clip as log would be negative for those edges
# line2: Normalize so that the thickest edge is 1 otherwise edges appear too thick on the chart
# line3: The idea here is to make the strongest edge stay stronger than others, 6 just worked, don't dwell on it

edge_weights_raw = np.clip(np.log(np.asarray(ig_graph.edge_betweenness())+1e-16), a_min=0, a_max=None)
edge_weights_raw_normalized = edge_weights_raw / np.max(edge_weights_raw)
edge_weights = [w**6 for w in edge_weights_raw_normalized]
visual_style["edge_width"] = edge_weights

# A simple heuristic for vertex size. Size ~ (degree / 4) (it gave nice results I tried log and sqrt as well)
visual_style["vertex_size"] = [deg / 4 for deg in ig_graph.degree()]

# This is the only part that's Cora specific as Cora has 7 labels
if dataset_name.lower() == DatasetType.CORA.name.lower():
    visual_style["vertex_color"] = [cora_label_to_color_map[label] for label in node_labels_np]
else:
    print('Feel free to add custom color scheme for your specific dataset. Using igraph default coloring.')

# Set the layout - the way the graph is presented on a 2D chart. Graph drawing is a subfield for itself!
# I used "Kamada Kawai" a force-directed method, this family of methods are based on physical system simulation.
# (layout_drl also gave nice results for Cora)
visual_style["layout"] = ig_graph.layout_kamada_kawai()

print('Plotting results ... (it may take couple of seconds).')
ig.plot(ig_graph, **visual_style)

# This website has got some awesome visualizations check it out:
# http://networkrepository.com/graphvis.php?d=./data/gsm50/labeled/cora.edges

----------------------------------------------------------------------------------------------

OK到此为止,算是初步认识了Cora数据集,这里附上前面用来解释代码的toy dataset做可视化的流程:

# Visualization related imports
import matplotlib.pyplot as plt
import networkx as nx
import igraph as ig

# Main computation libraries
import scipy.sparse as sp
import numpy as np
import pandas as pd

# 节点特征
node_df = pd.DataFrame({
    'w_0': [1., 0., 1., 0., 1., 0., 1., 0.],
    'w_1': [0., 0., 0., 0., 1., 0., 1., 0.],
    'w_2': [1., 1., 1., 1., 1., 1., 1., 1.],
    'w_4': [0., 1., 0., 1., 1., 0., 1., 1.]
})
num_of_nodes = 8
node_df = sp.csr_matrix(node_df.values)
node_df_sum = np.array(node_df.sum(-1))
node_df_inv_sum = np.power(node_df_sum, -1).squeeze()
# 某些和可能是0,所以1/0会给我们inf,所以我们用1代替
node_df_inv_sum[np.isinf(node_df_inv_sum)] = 1.
diag_node_df_sum_matrix = sp.diags(node_df_inv_sum)
topology = diag_node_df_sum_matrix.dot(node_df).toarray()
# 边
adj_list_dict = {
    0: [1, 2, 5],
    1: [0, 2, 3, 4],
    2: [0, 1, 5],
    3: [1, 4, 5],
    4: [1, 3, 6],
    5: [0, 2, 3, 6 ,7],
    6: [4, 5, 7],
    7: [5, 6]
}

source_nodes_ids, target_nodes_ids = [], []
seen_edges = set()
for src_node, neighboring_nodes in adj_list_dict.items():
    for trg_node in neighboring_nodes:
        # if this edge hasn't been seen so far we add it to the edge index (coalescing - removing duplicates)
        if (src_node, trg_node) not in seen_edges:  # it'd be easy to explicitly remove self-edges (Cora has none..)
            source_nodes_ids.append(src_node)
            target_nodes_ids.append(trg_node)

            seen_edges.add((src_node, trg_node))
source_nodes_ids.extend(np.arange(num_of_nodes))
target_nodes_ids.extend(np.arange(num_of_nodes))
edge_index = np.row_stack((source_nodes_ids, target_nodes_ids))

in_degrees = np.zeros(num_of_nodes, dtype=np.int_)
out_degrees = np.zeros(num_of_nodes, dtype=np.int_)

num_of_edges = edge_index.shape[1]
for cnt in range(num_of_edges):
    source_node_id = edge_index[0, cnt]
    target_node_id = edge_index[1, cnt]

    out_degrees[source_node_id] += 1  # source node points towards some other node -> increment its out degree
    in_degrees[target_node_id] += 1  # similarly here

hist = np.zeros(np.max(out_degrees) + 1)
for out_degree in out_degrees:
    hist[out_degree] += 1

fig = plt.figure(figsize=(12,8), dpi=100)  # otherwise plots are really small in Jupyter Notebook
fig.subplots_adjust(hspace=0.6)

plt.subplot(311)
plt.plot(in_degrees, color='red')
plt.xlabel('node id'); plt.ylabel('in-degree count'); plt.title('Input degree for different node ids')

plt.subplot(312)
plt.plot(out_degrees, color='green')
plt.xlabel('node id'); plt.ylabel('out-degree count'); plt.title('Out degree for different node ids')

plt.subplot(313)
plt.plot(hist, color='blue')
plt.xlabel('node degree')
plt.ylabel('# nodes for a given out-degree') 
plt.title(f'Node out-degree distribution for toy data dataset')
plt.xticks(np.arange(0, len(hist), 5.0))

plt.grid(True)
plt.show()

label_to_color_map = {0: "red", 1: "blue"}
edge_index = torch.from_numpy(edge_index)
node_labels = np.array([0,0,0,1,1,0,1,1])
node_labels = torch.from_numpy(node_labels)
edge_index_np = edge_index.cpu().numpy()
node_labels_np = node_labels.cpu().numpy()
print(type(node_labels), len(node_labels))
num_of_nodes = len(node_labels_np)
edge_index_tuples = list(zip(edge_index_np[0, :], edge_index_np[1, :]))

# Construct the igraph graph
ig_graph = ig.Graph()
ig_graph.add_vertices(num_of_nodes)
ig_graph.add_edges(edge_index_tuples)

# Prepare the visualization settings dictionary
visual_style = {}

# Defines the size of the plot and margins
visual_style["bbox"] = (400, 400)
visual_style["margin"] = 20

edge_weights_raw = np.clip(np.log(np.asarray(ig_graph.edge_betweenness())+1e-16), a_min=0, a_max=None)
edge_weights_raw_normalized = edge_weights_raw / np.max(edge_weights_raw)
edge_weights = [w**6 for w in edge_weights_raw_normalized]
visual_style["edge_width"] = edge_weights
visual_style["vertex_color"] = [label_to_color_map[label] for label in node_labels_np]
visual_style["layout"] = ig_graph.layout_kamada_kawai()
ig.plot(ig_graph, **visual_style)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/682332.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

RA8D1-Vision Board上OSPI-Flash实践

Vision-Board 开发板是 RT-Thread 推出基于瑞萨 Cortex-M85 架构 RA8D1 芯片,拥有Helium和TrustZone技术的加持,性能非常强大。 内核:480 MHz Arm Cortex-M85,包含Helium和TrustZone技术 存储:集成2MB/1MB闪存和1MB SRAM(包括TCM,512KB ECC保护) 外设:兼容xSPI的四线O…

web刷题记录(3)

[NISACTF 2022]checkin 简单的get传参,好久没做过这么简单的题了 王德发&#xff1f;&#xff1f;&#xff1f;&#xff1f;&#xff1f;&#xff01;&#xff0c;看了源代码以后&#xff0c;本来以为是js脚本的问题&#xff0c;但是禁用js脚本没用&#xff0c;看了大佬的wp以后…

游戏缺失xinput1_3.dll怎么修复,总结几种有效的修复方法

在现代科技日新月异的时代&#xff0c;电脑已经成为我们生活和工作中不可或缺的工具。然而&#xff0c;由于各种原因&#xff0c;电脑可能会出现一些错误或问题&#xff0c;其中之一就是找不到xinput13.dll文件&#xff0c;这个问题会导致软件或者游戏无法正常启动运行&#xf…

认识微服务,认识Spring Cloud

1. 介绍 本博客探讨的内容如下所示 什么是微服务&#xff1f;什么是springcloud&#xff1f;微服务和springcloud有什么关系&#xff1f; 首先&#xff0c;没有在接触springcloud之前&#xff0c;我写的项目都是单体结构&#xff0c; 但随着网站的用户量越来越大&#xff0c;…

【云原生】Kubernetes----Ingress对外服务

目录 引言 一、K8S对外方式 &#xff08;一&#xff09;NodePort 1.作用 2.弊端 3.示例 &#xff08;二&#xff09;externalIPs 1.作用 2.弊端 3.示例 &#xff08;三&#xff09;LoadBalancer 1.作用 2.弊端 &#xff08;四&#xff09;Ingress 二、Ingress的…

kubeedge v1.17.0部署教程

文章目录 前言一、安装k8s平台二、部署kubeedge1.部署MetalLB(可选)2.cloud3.edge4. 部署nginx到edge端 总结参考 前言 本文主要介绍kubeedge v1.17.0的安装过程 主要环境如下表 应用版本centos7.0k8s1.28.2kubeedge1.17.0docker24.0.8centos7.0 一、安装k8s平台 本文主要参…

JavaWeb1 Json+BOM+DOM+事件监听

JS对象-Json //Json 字符串转JS对象 var jsObject Json.parse(userStr); //JS对象转JSON字符串 var jsonStr JSON.stringify(jsObject);JS对象-BOM BOM是浏览器对象模型&#xff0c;允许JS与浏览器对话 它包括5个对象&#xff1a;window、document、navigator、screen、hi…

【QT5】<总览三> QT常用控件

文章目录 前言 一、QWidget---界面 二、QPushButton---按钮 三、QRadioButton---单选按钮 四、QCheckBox---多选、三选按钮 五、margin&padding---边距控制 六、QHBoxLayout---水平布局 七、QVBoxLayout---垂直布局 八、QGridLayout---网格布局 九、QSplitter---…

Base64前端图片乱码转换

title: Base64码乱转换 date: 2024-06-01 20:30:28 tags: vue3 后端图片前端显示乱码 现象 后端传来一个图片&#xff0c;前端能够接收&#xff0c;但是console.log()后发现图片变成了乱码&#xff0c;但是检查后台又发现能够正常的收到了这张图片。 处理方法 笔者有尝试将…

身份证数字识别DBNET

采用DBNET检测身份证数字所在区域&#xff0c;然后使用切割字符的方法&#xff0c;使用PCASVM训练和分类&#xff0c;支持C/PYTHON开发&#xff0c;只需要OPENCV 身份证数字识别DBNETPCASVM

2004NOIP普及组真题 3. FBI树

线上OJ 地址&#xff1a; [04NOIP普及组] FBI树 本题的意思是&#xff1a;给定一个 01字符串 &#xff08;对应一棵完全二叉树的最后一层叶子节点&#xff09;&#xff0c;将树的每一个节点的值用字母“F、B、I”表示。规则&#xff08;如下图所示&#xff09;为&#xff1a; 1…

【AI大模型】Transformers大模型库(三):特殊标记(special tokens)

目录​​​​​​​ 一、引言 二、特殊标记&#xff08;special tokens&#xff09; 2.1 概述 2.2 主要功能 2.3 代码示例 三、总结 一、引言 这里的Transformers指的是huggingface开发的大模型库&#xff0c;为huggingface上数以万计的预训练大模型提供预测、训练等服…

证件照太大了怎么压缩到100k?6个软件教你快速进行压缩

证件照太大了怎么压缩到100k&#xff1f;6个软件教你快速进行压缩 压缩证件照大小通常需要使用专门的图片压缩工具或者图片编辑软件。以下是六款常用的软件&#xff0c;它们可以帮助你快速压缩证件照大小到100KB以内&#xff1a; 1.迅捷压缩&#xff1a;这是一款图片压缩工具…

【javaEE初阶】

&#x1f308;&#x1f308;&#x1f308;关于java ⚡⚡⚡java的由来 我们这篇文章主要是来介绍javaEE&#xff0c;一般称为java企业版&#xff0c;实际上java的历史可以追溯到上个世纪90年代&#xff0c;当时主要的语言主流的还是C语言和C&#xff0c;但是在那个时期嵌入式初…

MySQL之查询性能优化(六)

查询性能优化 查询优化器 9.等值传播 如果两个列的值通过等式关联&#xff0c;那么MySQL能够把其中一个列的WHERE条件传递到另一列上。例如&#xff0c;我们看下面的查询: mysql> SELECT film.film_id FROM film-> INNER JOIN film_actor USING(film_id)-> WHERE f…

K8S==ingress配置自签名证书

安装openssl Win32/Win64 OpenSSL Installer for Windows - Shining Light Productions 生成证书 openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout example.local.key -out example.local.crt -subj "/CNexample.local/Oexample.local"创建K8S secr…

揭秘c语言储存类别

前言 欢迎来到我的博客 个人主页:北岭敲键盘的荒漠猫-CSDN博客 本文将整理c语言的储存类型的知识点 储存类型概念 描述:用于解决内存开辟与解放的时间的问题。跟作用域没啥关系。 但是呢&#xff0c;他也是能影响到程序的运行的&#xff0c;所以是很关键的。 类型: auto :自…

htb_solarlab

端口扫描 80,445 子域名扫描 木有 尝试使用smbclient连接445端口 Documents目录可查看 将Documents底下的文件下载到本地看看 xlsx文件里有一大串用户信息&#xff0c;包括username和password 先弄下来 不知道在哪登录&#xff0c;也没有子域名&#xff0c;于是返回进行全端…

flyfish3.0.0配置避坑

1.基础环境准备篇 doc/01-基础环境准备篇.md 云智慧/FlyFish - Gitee.com 使用教程里给出的java环境时&#xff0c;可以显示java版本&#xff0c;但是不能显示Maven的版本 改为&#xff1a; export NODE_HOME/usr/local/node/node-v14.19.3-linux-x64 export PATH$NODE_HOME…

C语言—字符函数和字符串函数

1.字符分类函数 C语言中有一系列的函数是专门做字符分类的&#xff0c;也就是一个字符是属于什么类型的字符的。 这些函数的使用都需要包含一个头文件 ctype.h。 例&#xff1a;将一句话中的小写字母改成大写字母。 2.字符转换函数 头文件&#xff1a;ctype.h C语言提供了2…