python 中的图谱聚类

2024-03-03

我想使用谱聚类在 python 中对图进行聚类。

谱聚类是一种更通用的技术，不仅可以应用于图形，还可以应用于图像或任何类型的数据，但是，它被认为是一种特殊的技术graph聚类技术。遗憾的是，我无法在线找到 python 中的谱聚类图的示例。

Scikit Learn 记录了两种谱聚类方法：谱聚类 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html and 光谱聚类 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.spectral_clustering.html看起来他们不是别名。
这两种方法都提到它们可以在图表上使用，但没有提供具体说明。用户指南也没有 http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering. I've 向开发人员索要这样的例子 https://github.com/scikit-learn/scikit-learn/issues/9481，但他们过度劳累，还没有做到这一点。
一个很好的网络来记录这一点是空手道俱乐部网络 https://en.wikipedia.org/wiki/Zachary%27s_karate_club。包含在内作为 networkx 中的方法 https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.generators.social.karate_club_graph.html.

我希望得到一些关于如何解决这个问题的指导。如果有人可以帮助我解决这个问题，我可以将文档添加到 scikit learn 中。

Notes:

这个网站上已经有人问过类似的问题 https://stackoverflow.com/questions/23684746/spectral-clustering-using-scikit-learn-on-graph-generated-through-networkx.

没有太多频谱聚类经验，只是查看文档（跳到最后查看结果！）：

Code:

import numpy as np
import networkx as nx
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(1)

# Get your mentioned graph
G = nx.karate_club_graph()

# Get ground-truth: club-labels -> transform to 0/1 np-array
#     (possible overcomplicated networkx usage here)
gt_dict = nx.get_node_attributes(G, 'club')
gt = [gt_dict[i] for i in G.nodes()]
gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])

# Get adjacency-matrix as numpy-array
adj_mat = nx.to_numpy_matrix(G)

print('ground truth')
print(gt)

# Cluster
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(adj_mat)

# Compare ground-truth and clustering-results
print('spectral clustering')
print(sc.labels_)
print('just for better-visualization: invert clusters (permutation)')
print(np.abs(sc.labels_ - 1))

# Calculate some clustering metrics
print(metrics.adjusted_rand_score(gt, sc.labels_))
print(metrics.adjusted_mutual_info_score(gt, sc.labels_))

Output:

ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
just for better-visualization: invert clusters (permutation)
[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0.204094758281
0.271689477828

总体思路：

数据和任务介绍here http://ptrckprry.com/course/ssd/lecture/community.html:

图中的节点代表大学空手道俱乐部的 34 名成员。（扎卡里是一位社会学家，他是成员之一。）两个节点之间的边缘表明这两个成员在正常的俱乐部会议之外在一起度过了很长时间。该数据集很有趣，因为当扎卡里（Zachary）收集数据时，空手道俱乐部发生了争议，并且分裂成两个派别：一个派系由“先生”领导。你好”，其中一个由“John A”领导。事实证明，仅使用连接信息（边缘），就可以恢复两个派系。

使用 sklearn 和谱聚类来解决这个问题：

如果亲和力是图的邻接矩阵，则可以使用此方法来查找归一化图割。

This http://www.dccia.ua.es/~sco/Spectral/Lesson3_Cuts.pdf将归一化图割描述为：

找到图的顶点 V 的两个不相交分区 A 和 B，因此 A ∪ B = V 且 A ∩ B = ∅

给定两个顶点之间的相似性度量 w(i,j)（例如恒等当它们连接时）剪切值（及其标准化版本）定义为： cut(A, B) = SUM A 中的 u, B 中的 v: w(u, v)

...

我们寻求最大限度地减少脱离关系 A 组和 B 组之间以及关联的最大化每组内

听起来不错。所以我们创建邻接矩阵（nx.to_numpy_matrix(G)）并设置参数affinity to 预先计算的（因为我们的邻接矩阵是我们预先计算的相似性度量）。

或者，通过预先计算，可以使用用户提供的亲和力矩阵。

Edit:虽然对此不熟悉，但我寻找要调整的参数并发现分配标签 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering.fit:

用于在嵌入空间中分配标签的策略。拉普拉斯嵌入后有两种分配标签的方法。 k-means 可以应用并且是一个流行的选择。但它也可能对初始化敏感。离散化是另一种对随机初始化不太敏感的方法。

因此尝试不太敏感的方法：

sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')

Output:

ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
just for better-visualization: invert clusters (permutation)
[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
0.771725032425
0.722546051351

这与事实非常吻合！

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

scikitlearn

clusteranalysis

graphtheory

spectral