过年好,使用Keras和Tensorflow从图数据中学习(Python完好完成),谷歌翻译

有许多数据能够在实践运用中以图的方式表明,如引文网络,交际网络(重视者图,朋友网络......),生物网络等。

运用Graph提取的特征能够经过依靠相邻节点之间的信息流来进步猜测模型的功能。可是,表明图数据并不简略,由于大多数机器学习(ML)模型都希望固定巨细或线性输入,而图数据不是这种状况。

在这篇文章中,咱们将讨论一些处理通用图的办法,以便依据直接从数据中学习的图表明进行节玻利维亚点分类。

数据集:

Cora引文网络数据集(https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz)将作为本文完结和试验的根底。每个节点表明一篇科学论文,节点之间的边表明两篇论文之间的引证联系。

每个节点由一组二进制特征(Bag of words)以及将其链接到其他节点的一组边表明。

数据集有2708个节点,它们被划分为七个类中的一个。该网络有5429条链接。每个节点还由表明对应单词存在的二进制单词特征表明。总的来说,每个节点有1433个二进制(稀少)特征。下面咱们只运用140个样本进行练习,其他的用于验证/测验。

问题勤闲宝下载设定:

问题 :在具有少数练习样本的状况下,为图中的节点分配类标签。

直觉 / 假定 :图中挨近的节点更或许具有类似的标签。

解决方案 :找到一种从图中提取特征的办法,以协助对新节点进行分类。

基线模型:

简略的基线模型

咱们首要测验运用最春节好,运用Keras和Tensorflow从图数据中学习(Python无缺完结),谷歌翻译简略的模型,该模型学习仅运用二进制特征猜测节点类并丢掉一切图信息。

该模型是一个全衔接的神经网络,它将二进制特征作为输入,并输出每个节点的类概率。Python完结如下:

import json
import numpy as np
from keras.callbacks import EarlyStopping
from keras.layers import Input, Dense, D春节好,运用Keras和Tensorflow从图数据中学习(Python无缺完结),谷歌翻译ropout
from keras.models import Model
from keras.regularizers import l2, l1
from sklearn.metrics import accuracy_score
dataset = "cora"
#cora Accuracy test : 0.5328820116054158
def get_features_only_model(n_features, n_classes):
in_ = Input((n_features,))
x = Dense(10, activation="relu", kernel_regularizer=l1(0.001))(in_)
x = Dropout(0.5)(x)
x = Dense(n_classes, activation="softmax")(x)
model = Model(in_, x)
model.compile(loss="sparse_categorical_crossentropy", metric春节好,运用Keras和Tensorflow从图数据中学习(Python无缺完结),谷歌翻译s=['acc'], optimizer="adam")
model.summary()
return model
train_samples = json.load(open("../input/%s/%s.train.json" % (dataset, dataset), 'r'))
val_samples = json.load(open("../input/%s/%s.val.json" % (dataset, daboycottaset), 'r'))
test_samples = json.load(open("../input/%s/%s.test.json" % (dataset, dataset), 'r'))
edges_lines = json.load(open("../input/%s/%s.graph.json" % (dataset, dataset), 'r'))
class_int_mapping, node_int_mapping, node_int_class_mapping, node_class_mapping = \
json.load(open("../input/%s/%s.mappings.json" % (dataset, dataset), 'r'))
X_train = [x['features'] for x in train_samples]
Y_train = [class_int_mapping[x['label']] for x in train_samples]
X_val = [x['features'] for x in val_samples]
Y_val = [class_int_mapping[x['label']] for x in val_samples]
X_test = [x['features'] for x in test_sa美人总裁俏房客mples]
Y_test = [class_int_mapping[x['label']] for x in test_samples]
X_train, Y_train = np.array(X_train), np.array(Y_train)[:, np.newaxis]
X_test, Y_test = np.array(X_test), np.array(Y_test)[:, np.newaxis]
X_val, Y_val = np.array(X_val), np.array(Y_val)[:, np.newaxis]
model = get_features_only_model(n_features=int(X_train.shape[1]), n_classes=int(max(Y_train) + 1))
early = EarlyStopping(monitor="val_acc", patience=10, restore_best_weights=True)
model.fit(X_train, Y_train, validation_data=(X_val, Y_val), nb_epoch=400, verbose=2, callbacks=[early])
Y_test_pred = model.predict(X_test)
Y_test_pred = Y_test_pred.argmax(axis=-1).ravel()
print("Accuracy test : %s" % (accuracy_score(Y_test, Y_test_pred)))

基线模型准确度: 53.28%

这是咱们将测验经过增加根据图的特征来进步的初始精度。

增加图特征:

一种主动学习图形特征的办法是将每个节点嵌入到一个向量中,经过练习网络来辅佐猜测两个输入节点之间最短途径长度的倒数,如下图:

学习每个节点的嵌入向量

下一步是运用预练习好的节点嵌入作为分类模型的输入。咱们还增加了一个额定的输入,即运用学习嵌入向量的间隔来核算相邻节点的均匀二进制特征。

得到的分类网络如下图所示:

运用预练习嵌入来进行节点分类

Python无缺代码如下:

import json
import numpy as np
from keras.callbacks import EarlyStopping
from keras.layers import Input, Dense, Dropout, Embedding, Flatten, Multiply, Concatenate
from keras.models import Model
from keras.regularizers import l2, l1
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from random import sample, choice
from sklearn.neighbors import NearestNeighbors
import networkx as nx
dataset = "cora"
#cora Accuracy test : 0.7306576402321083
batch_size = 32
n_neighbors = 5
train_samples = json.load(open("../input/%s/%s.train.json" % (dataset, dataset), 'r'))
val_samples = json.load(open("../input/%s/%s.val.json" % (dataset, dataset), 'r'))
test_samples = json.load(open("../input/%s/%s.test.json" % (dataset, dataset), 'r'))
edges_lines = json.load(open("../input/%s/%s.graph.json" % (dataset, dataset), 'r'))
class_int_map中影世界影城ping, node_int_mapping, node_int_class_mapping, node_class_mapping = \
json.load(open("../input/%s/%s.mappings.json" % (dataset, dataset), 'r'))
node_int_features_mapping = {node_int_mapping[k['node']]:np.array(k['features']) for k in train_samples+val_samples+test_samples}
G=nx.Graph()
for node, int_node in node_int_mapping.items():
G.add_node(int_node)
for edge in edges_lines:
G.add_edge(node_int_mapping[edge[0]], node_int_mapping[edge[1]])
spl = dict(nx.all_pairs_shortest_path_length(G))
def get_graph_embedding_model(n_nodes):
in_1 = Input((1,))
in_2 = Input((1,))
emb = Embedding(n_nodes, 100, name="node1")
x1 = emb(in_1)
x2 = emb(in_2)
x1 = Flatten()(x1)
x1 = Dropout(0.1)(x1)
x2 = Flatten()(x2)
x2 = Dropout(0.1)(x2)
x = Multiply()([x1, x2])
x = Dropout(0.1)(x)
x = Dense(1, activation="linear", name="spl")(x)
model = Model([in_1, in_2], x)
model.compile(loss="mae"将进酒朗读, optimizer="adam")
model.summary()
return model
def get_features_graph_model(n_features, n_classes, n_nodes):
in_1 = Input((n_features,))
in_2 = Input((n_features,))
in_3 = Input((1,))
emb = Embedding(n_nodes, 100, name="node1", trainable=False)
x1 = emb(in_3)
x1 = Flatten()(x1)
d = Dense(10, kernel_regularizer=l2(0.0005))
x2 = d(in_1)
x3 = d(in_2)
x = Concatenate()([x1, x2, x3])
x = Dropout(不拘一格降人才0.5)(x)
x = Dense(n_classes, activation="softmax", kernel_regularizer=l1(0.001))(x)
model = Model([in_1, in_2, in_3], x)
model.compile(loss="sparse_categorical_crossentropy", metrics=['acc'], optimizer="adam")
model.summary()
return model
def gen(list_edges, node_int_mapping, batch_size=batch_size):
while True:
positive_samples = sample(list_edges, batch_size//2)
positive_samples = [[node_int_mapping[x[0]], node_int_mapping[x[1]]] for x in positive_samples]
negative_samples = [[choice(range(len(node_int_mapping))), choice(range(len(node_int_mapping)))] for _ in
range(batch_size//2)]
samples = positive_samples+negative_samples
X1 = [x[0] for x in samples]
X2 = [x[1] for x in samples]
labels = [1/max(面相学spl[x[0]].get(x[1], 100), 1) for x in samples]
#print(labels)
yield [np.array(X1),np.array(X2)], np.array(labels)
train, test = train_test_split(edges_lines, test_size=0.05)
model_g = get_graph_embedding_model(len(node_int_mapping))
early = EarlyStopping(monitor="val_loss", patience=50, restore_best_weights=True)
model_g.fit_generator(gen(train, node_int春节好,运用Keras和Tensorflow从图数据中学习(Python无缺完结),谷歌翻译_mapping), validation_data=gen(test, node_int_mapping), nb_epoch=500, verbose=2,
callbacks=[early], steps_per_epoch=1000, validation_steps=100)
model_g.save_weights("graph_model_1.h5")
model_g.load_weights("graph_model_1.h5")
W = model_g.layers[2].get_weights()[0]
nearest_neighbors = NearestNeighbors(n_neighbors=n_neighbors)
nearest_neighbors.fit(W)
W_pred = nearest_neighbors.kneighbors(W, return_distance=False)
n_features = len(train_samples[0]['features'])
neigh_features = {}
for i, neighs in enumerate(W_pred):
v = np.zeros((n_features, ))
cnt = 0.0001
for n in neighs[1:]:
print(n)
v += node_int_features_mapping[n]
cnt += 1
neigh_features[i] = v/cnt
X_train = [x['features'] for x in train_samples]
X_G_train = [node_int_mapping[x['node']] for x in train_samples]
X_train_2 = [neigh_features[node_int_mapping[x['node']]] for x in train_samples]
Y_train = [class_int_mapping[x['label']] for x in train_samples]
X_val = [x['features'] for x in val_samples]
X_G_val = [node_int_mapping[x['node']] for x in val_samples]
X_val_2 = [neigh_features[node_int_mapping[x['node']]] for x in val_samples]
Y_val = [class_int_mapping[x['label']] for x in val_samples]
X_test = [x['features'] for x in test_samples]
X_G_test = [node_int_mapping[x['node']] for x in test_samples]
X_test_2 = [neigh_features[node_int_mapping[x['node']]]食谱 for x in test_samples]
Y_test = [class_int_mapping[x['label']] for x in test_samples]
X_train, X_train_2, X_G_train, Y_train = np.array(X_trai机构改革n), np.array(X_train_2), np.array(X_G_train)[:, np.newaxis],\
np.array(Y_train)[:, np.newaxis]
X_val, X_val_2, X_G_val, Y_val = np.array(X_val), np.array(X_val_2), np.array(X_G_val)[:, np.newaxis],\
np.array(Y_val)[:, np.newaxis]
X_test, X_test_2, X_G_test, Y_test = np.array(X_test), np.array(X_test_2), np.array(X_G_test)[:, np.newaxis],\
np.array(Y_test)[:, np.newaxis]
model = get_features_graph_model(n_features=int(X_traincut.shape[1]), n_classes=int(max(Y_train) + 1), n_nodes = len(node_int_mapping))
model.load_weights("graph_model_1.h5", by_name=True)
early = EarlyStopping(monitor="val_acc", patience=5000, restore_best_weights=True)
model.fit([X_train, X_train_2, X_G_train], Y_train, validation_data=([X_val, X_val_2, X_G_val], Y_val),
nb_epoch=1000, verbose=2, callbacks=[early])
Y_test_pred = model.predict([X_test, X_test_2, X_G_test])
Y_test_pred = Y_test_pred.argmax(axis=-1).ravel()
print("Accuracy test : %s" % (ac169curacy_score(Y_test, Y_test_pred)))

图嵌入分类模型准确度: 73.06%

咱们能够看到,增加学习图特征作为分类模型的输入有助于明显进步分类准确性,与基线模型比较,从53.28%到73.06%。

改善图特征学习:

咱们能够经过进一步推动预练习,运用节点嵌入网络中的二进制特征,再运用二进制特征和节点嵌入向量的预练习权重,进一步改善之前的模型。春节好,运用Keras和Tensorflow从图数据中学习(Python无缺完结),谷歌翻译这个模型的结果是依靠于从图表结构中得到的二进制特征的更有用的表明。

Python无缺完结如下:

import json
import numpy as np
from keras.callbacks import EarlyStopping
from keras.layers import Input, Dense, Dropout, Embedding, Flatten, Multiply, Concatenate, LeakyReLU
from keras.models import Model
from keras.regularizers import l2, l1
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from random import sample, choice
from sklearn.neighbors import NearestNeighbors
import networkx as nx
dataset = "cora"
#Accuracy test : 0.7635396518375241
n_features = 1433
batch_size = 32
n_neighbors = 5
train_samp玉蛤les = json.load(open("../input/%s/%s.train.json" % (dataset, dataset), 'r'))
val_samples = json.load(open("../input/%s/%s.val.json" % (dataset, dataset), 'r'))
test_samples = json.load(open("../input/%s/%s.test.json" % (dataset, dataset), 'r'))
edges_lines = json.load(open("../input/%s/%s.graph.json" % (dataset, dataset), 'r'))
class_int_mapping, node_int_mapping, node_int_class_mapping, node_class_mapping = \
json.load(open("../input/%s/%s.mappings.json" % (dataset, dataset), 'r'))
node_int_features_mapping = {node_int_mapping[k['node']]:np.array(k['features']) for k in train_samples+val_samples+test_samples}
G=nx.Graph()
for node, int_node in node_int_mapping.items():
G.add_node(int_node)
for edge in edges_lines:
G.add_edge(node_int_mapping[edge[0]], node_int_mapping[edge[1]])
spl = dict(nx.all_pairs_shortest_path_length(G))
def get_graph_embedding_model(n_nodes, n_features):
in_1 = Input((1,))
in_2 = Input((1,))
in_3 = Input((n_features,))
in_4春节好,运用Keras和Tensorflow从图数据中学习(Python无缺完结),谷歌翻译 = Input((n_features,))
emb = Embedding(n_nodes, 50, name="node1")
x1 = emb(in_1)
x2 = emb(in_2)
x1 = Flatten()(x1)
x1 = Dropout(0.02江天鸿)(x1)
x2 = Flatten()(x2)
x2 = Dropout(0.02)(x2)
x = Multiply(春节好,运用Keras和Tensorflow从图数据中学习(Python无缺完结),谷歌翻译)([x1, x2])
d = Dense(10, kernel_regularizer=l2(0.0005), name="features")
x1_ = d(in_3)
x2_ = d(in_4)
x_ = Multiply()([x1_, x2_])
x = Concatenate()([x, x_])
x = Dropout(0.02)(x)
x = Dense(1, activation="linear", name="spl")(x)
model = Model([in_1, in_2, in_3, in_4], x)
model.compile(loss="mae", optimizer="adam")
model.summary()
return model
def get_features_graph_model(n_features, n_classes, n_nodes):
in_1 = Input((n_features,))
in_2 = Input((n_features,))
in_3 = Input((1,))
emb = Embedding(n_nodes, 50, name="node1", trainable=False)
x1 = emb(in_3)
x1 = Flatten()(x1)
d = Dense(10, name="features", kernel_regularizer=l2(0.01))
x2 = d(in_1)
x3 = d(in_2)
x = Concatenate()([x1, x2, x3])
x = Dropout(0.5)(x)
x = Dense(n_classes, activation="softmax", kernel_regularizer=l2(0.01))(x)
model = Model([in_1, in_2, in_3], x)
model.compile(loss="sparse_categorical_crossentropy", metrics=['acc'], optimizer="adam")
model.summary()
return model
def gen(list_edges, node_int_mapping, batch_size=batch_size):
while True:
positive_samples = sample(list_edges, batch_size//2)
positive_samples = [[node_int_mapping[x[0]], node_int_mapping[x[1]]] for x in positive_samples]
negative_samples = [[choice(range(len(node_int_mapping))), choice(range(len(node_int_mapping)))] for _ in
range(batch_size//2)]
samples = positive_samples+negative_samples
X1 = [x[0] for x in samples]
X2 = [x[1] for x in samples]
X1_ = [node_int_features_mapping[j] for j in X1]
X2_ = [node_int_features_mapping[j] for j in X2]
labels = [1/max(spl[x[0]].get(x[1], 100), 1) for x in samples]
#print(labels)
yield [np.array(X1),np.array(X2), np.array(X1_), np.array(X2_)], np.array(labels)
train, test = train_test_split(edges_lines, test_size=0.05)
model_g = get_graph_embedding_model(len(node_int_mapping), n_features=n_features)
early = EarlyStopping(monitor="val_loss", patience=50, restore_best_weights=True)
model_g.fit_generator(gen(train, node_int_mapping), validation_data=gen(test, node_int_mapping), nb_epoch=400, verbose=2,
callbacks=[early], steps_per_epoch=1000, validation_steps=100)
model_g.save_weights("graph_model_2.h5")
model_g.load_weights("graph_model_2.h5")
W = model_g.layers[2].get_weights()[0]
nearest_neighbors = NearestNeighbors(n_neighbors=n_neighbors)
nearest_neighbors.fit(W)
W_pred = nearest_neighbors.kneighbors(W, return_distance=False)
n_features = len(train_samples[0]['f香港浸会大学eatures'])
neigh_features = {}
for i, neighs in enumerate(W_pred):
v = np.zeros((n_features, ))
cnt = 0.0001
for n in neighs[1:]:
print(n)
v += node_int_features_mapping[n]
cnt += 1
neigh_features[i] = v/cnt
X_train = [x['features'] for x in train_samples]
X_G_train = [node_int_mapping[x['node']] for x in train_samples]
X_train_2 = [neigh_features[node_int_mapping[x['node']]] for x in train_samples]
Y_train = [class_int_mapping[x['label']] for x in train_samples]
X_val = [x['fea云帆民航词典tures'] for x in val_samples]
X_G_val = [node_int_mapping[x['node']] for x in val_samples]
X_val_2 = [neigh_features[node_int_mapping[x['node']]] for x in val_samples]
Y_val产后修正 = [class_int_mapping[x['label']] for x in val_samples]
X_test = [x['features'] for x in test_samples]
X_G_test = [node_int_mapping[x['node']] for x in test_samples]
X_test_2 = [neigh_features[node_int_mapping[x['node']]] for x in test_samples]
Y_test = [class_int_mapping[x['label']] for x in test_samples]
X_train, X_train_2, X_G_train, Y_train = np.array(X_train), np.array(X_train_2), np.array(X_G_train)[:, np.newaxis],\
np.array(Y_train)[:, np.newaxis]
X_val, X_val_2, X_G_val, Y_val = np.array(X_val), np.array(X_val_2), np.array(X_G_val)[:密春雷, np.newaxis],\
np.array(Y_val)[:, np.newaxis]
X_test, X_test_2, X_G_test, Y_test = np.array(X_test), np.array(X_test_2), np.array(X_G_test)[:, np.newaxis],\
np.array(Y_test)[:, np.newaxis]
model = get_features_graph_model(n_features=n_features, n_classes=int(max(Y_train) + 1), n_nodes = len(nodelogo是什么意思_int_mapping))
model.load_weights("graph_model_2.h5", by_name=True)
early = EarlyStopping(monitor="val_acc", patience=10000, restore_best_weights=True)
model.fit([X_train, X_train_2, X_G_train], Y_train, validation_data=([X_val, X_val_2, X_G_val], Y_val),
nb_epoch=5000, verbose=2, callbacks=[early])
Y_test_pred = model.predict([X_test, X_test_2, X_G_test])
Y_test_pred = Y_test_pred.argmax(axis=-1).ravel()
print("Accuracy test : %s" % (accuracy_score(Y_test, Y_test_pred)))

改善的图嵌入分类模型准确度: 76.35%

与曾经的办法比较,这种额定的改善增加了几个百分点。

定论:

在这篇文章中,咱们看到咱们能够从图结构化数据中学习有用的表明,然后运用这些表明来将节点分类模型的泛化功能从53.28%进步到76.35%