Cora_dataset description
Cora数据集是一个常用的学术文献用网络数据集,用于研究学术文献分类和图网络分析等任务。
该数据集由机器学习领域的博士论文摘要组成,共计2708篇论文,涵盖了7个不同的学科领域。每篇论文都有一个唯一的ID,并且被分为以下7个类别之一:Case_Based、Genetic_Algorithms、Neural_Networks、Probabilistic_Methods、Reinforcement_Learning、Rule_Learning和Theory。
除了论文之间的引用关系外,Cora数据集还包含了每篇论文的词袋表示,即将每篇论文表示为一个词频向量(0-1嵌入,每行有多个1,非one-hot vector,feature of node)。这些词频向量表示了论文中出现的单词及其在该论文中的出现频率。
Cora数据集常用于图神经网络的研究和评估,可以用于学术文献分类、引文网络分析、节点嵌入等任务。
print cora
dataset = Planetoid("./tmp/Cora", name="Cora", transform=T.NormalizeFeatures())
num_nodes = dataset.data.num_nodes
# For num. edges see:
# - https://github.com/pyg-team/pytorch_geometric/issues/343
# - https://github.com/pyg-team/pytorch_geometric/issues/852
num_edges = dataset.data.num_edges // 2
train_len = dataset[0].train_mask.sum()
val_len = dataset[0].val_mask.sum()
test_len = dataset[0].test_mask.sum()
other_len = num_nodes - train_len - val_len - test_len
print(f"Dataset: {dataset.name}")
print(f"Num. nodes: {num_nodes} (train={train_len}, val={val_len}, test={test_len}, other={other_len})")
print(f"Num. edges: {num_edges}")
print(f"Num. node features: {dataset.num_node_features}")
print(f"Num. classes: {dataset.num_classes}")
print(f"Dataset len.: {dataset.len()}")
GCN原理与实现
卷积公式:
f
∗
g
=
F
−
1
(
F
(
f
)
⋅
F
(
g
)
)
f*g=F^{-1}(F(f)·F(g))
f∗g=F−1(F(f)⋅F(g))
给定一个图信号x和一个卷积核,
x
∗
g
=
U
(
U
T
x
⊙
U
T
g
)
=
U
(
U
T
x
⊙
g
θ
)
=
D
~
−
0.5
A
~
D
~
−
0.5
X
Θ
x*g=U(U^Tx\odot U^Tg)=U(U^Tx\odot g_{\theta})=\widetilde D^{-0.5}\widetilde A\widetilde D^{-0.5}X\Theta
x∗g=U(UTx⊙UTg)=U(UTx⊙gθ)=D
−0.5A
D
−0.5XΘ
其中A为图的邻接矩阵,D为图的度数矩阵,
D
~
=
D
+
γ
I
,
A
~
=
A
+
γ
I
\widetilde D=D+\gamma I,\widetilde A=A+\gamma I
D
=D+γI,A
=A+γI,添加自环以缩小
λ
\lambda
λ(Laplace matrix)
1.computation of D ~ − 0.5 A ~ D ~ − 0.5 \widetilde D^{-0.5}\widetilde A\widetilde D^{-0.5} D −0.5A D −0.5
def gcn_norm(edge_index, edge_weight=None, num_nodes=None,
add_self_loops=True, flow="source_to_target", dtype=None):
fill_value = 1.
num_nodes = maybe_num_nodes(edge_index, num_nodes)
if add_self_loops: #添加自环
edge_index, edge_weight = add_remaining_self_loops(
edge_index, edge_weight, fill_value, num_nodes)
edge_weight = torch.ones((edge_index.size(1), ), dtype=dtype,
device=edge_index.device)
row, col = edge_index[0], edge_index[1]
idx = col
deg = scatter(edge_weight, idx, dim=0, dim_size=num_nodes, reduce='sum')
deg_inv_sqrt = deg.pow_(-0.5)
deg_inv_sqrt.masked_fill_(deg_inv_sqrt == float('inf'), 0)
edge_weight = deg_inv_sqrt[row] * edge_weight * deg_inv_sqrt[col]
return edge_index, edge_weight
代码解释
edge_index, edge_weight = add_remaining_self_loops(edge_index, edge_weight,fill_value, num_nodes)
:
D
~
=
D
+
γ
I
,
A
~
=
A
+
γ
I
\widetilde D=D+\gamma I,\widetilde A=A+\gamma I
D
=D+γI,A
=A+γI;
deg = scatter(edge_weight, idx, dim=0, dim_size=num_nodes, reduce='sum')
:
根据edge_weight和idx=edge_index[1]得到度数矩阵,deg=D
- explantation:edge_weight是要放入的对角阵,
deg_inv_sqrt = deg.pow_(-0.5)
:require
D
−
0.5
D^{-0.5}
D−0.5
deg_inv_sqrt.masked_fill_(deg_inv_sqrt == float('inf'), 0)
:
由于D非对角元=0,其-0.5次幂=∞,需要转化为0,
edge_weight = deg_inv_sqrt[row] * edge_weight * deg_inv_sqrt[col]
:
输出归一化后的edge_index
2. PairNorm
3.GCNConv的实现如下(删改自torch_geometric.nn.GCNConv)
class myGCNConv2(MessagePassing):
def __init__(self, in_channels: int, out_channels: int,
add_self_loops: bool = True,bias: bool = True):
super().__init__()
self.in_channels = in_channels
self.out_channels = out_channels
self.add_self_loops = add_self_loops
self.lin = Linear(in_channels, out_channels, bias=False,
weight_initializer='glorot')
if bias:
self.bias = Parameter(torch.Tensor(out_channels))
else:
self.register_parameter('bias', None)
self.reset_parameters()
def reset_parameters(self):
super().reset_parameters()
self.lin.reset_parameters() #卷积层
zeros(self.bias) #偏置层
def forward(self, x: Tensor, edge_index: Adj,
edge_weight: OptTensor = None) -> Tensor:
edge_index, edge_weight = gcn_norm( # yapf: disable
edge_index, edge_weight, x.size(self.node_dim),
self.add_self_loops, self.flow, x.dtype)
x = self.lin(x)
# propagate_type: (x: Tensor, edge_weight: OptTensor)
out = self.propagate(edge_index, x=x, edge_weight=edge_weight,
size=None)
if self.bias is not None:
out = out + self.bias
return out
def message(self, x_j: Tensor, edge_weight: OptTensor) -> Tensor:
return x_j if edge_weight is None else edge_weight.view(-1, 1) * x_j
def message_and_aggregate(self, adj_t: SparseTensor, x: Tensor) -> Tensor:
return spmm(adj_t, x, reduce=self.aggr)
代码解释
x = self.lin(x)
:
X
′
=
X
Θ
,
X
∈
R
n
∗
d
1
,
Θ
∈
R
d
1
∗
d
2
X'=X\Theta,X\in R^{n*d1},\Theta \in R^{d1*d2}
X′=XΘ,X∈Rn∗d1,Θ∈Rd1∗d2,对X降维
out = self.propagate(edge_index, x=x, edge_weight=edge_weight,size=None)
:
out=
A
′
X
′
=
D
~
−
1
2
A
~
D
~
−
1
2
X
Θ
A'X'=\widetilde D^{-\frac 1 2}\widetilde A \widetilde D^{-\frac 1 2 } X \Theta
A′X′=D
−21A
D
−21XΘ
Converge {x1’,…,xn’} ,each of which be a sampled vector,into target form.
message&message_and_aggregate为MessagePassing.propagate的相关函数,
经测试,删除后,val acc下降,故予以保留
4.Net(GCN)的实现
class GCN(torch.nn.Module):
def __init__(
self,
num_node_features: int,
num_classes: int,
hidden_dim: int = 16,
dropout_rate: float = 0.5,
) -> None:
super().__init__()
self.dropout1 = torch.nn.Dropout(dropout_rate)
self.conv1 = myGCNConv2(num_node_features,
hidden_dim,add_self_loops=True)
self.relu = torch.nn.ReLU(inplace=True)
self.dropout2 = torch.nn.Dropout(dropout_rate)
self.conv2 = myGCNConv2(hidden_dim, num_classes,add_self_loops=True)
self.pn=PairNorm()
def forward(self, x: Tensor, edge_index: Tensor) -> torch.Tensor:
x = self.pn(x)
x = self.dropout1(x)
x = self.conv1(x, edge_index)
x = self.relu(x)
x = self.dropout2(x)
x = self.conv2(x, edge_index)
return x
代码解释
x = self.pn(x)
:对x作PairNorm处理,之后xi~N(0,s2),各节点特征范数大小平衡,作用不明显;
采用2层GCN卷积层,中间用relu激活,dropout避免过拟合
DropEdge Realization的手动实现
- idea
- 首先把有向图的边,转化为无向图的边,保存在single_edge_index中,实现时先用single_edge字
典存储每条无向边(key-value 任意)1次,再把single_edge转化成无向图的边集索引(2-dim tensor
array)
#single_edge_index
single_edge={}
for i in range(len(dataset.data.edge_index[0])):
if(((dataset.data.edge_index[0][i],dataset.data.edge_index[1][i]) not in single_edge.items()) and
((dataset.data.edge_index[1][i],dataset.data.edge_index[0][i]) not in single_edge.items())):
single_edge[dataset.data.edge_index[0][i]]=dataset.data.edge_index[1][i]
single_edge_index=[[],[]]
for key,value in single_edge.items():
single_edge_index[0].append(key)
single_edge_index[1].append(value)
single_edge_index=torch.tensor(single_edge_index)
- 再把无向边集舍去dropout_rate比例的部分,之后转成有向边集索引
def drop_edge(single_edge_index, dropout_rate):
# 计算需要丢弃的边数
num_edges = single_edge_index.shape[1]
num_drop = int(num_edges * dropout_rate)
# 随机选择要丢弃的边
remain_indices = torch.randperm(num_edges)[num_drop:]
remain_single_edges = single_edge_index[:, remain_indices]
reverse_edges = torch.stack([remain_single_edges[1],remain_single_edges[0]],dim=0)
remain_edges=torch.cat([remain_single_edges,reverse_edges],dim=1)
return remain_edges