Use pairwise_distances
计算距离并用 1 减去该距离即可找到相似度得分:
from sklearn.metrics.pairwise import pairwise_distances
1 - pairwise_distances(df.T.to_numpy(), metric='jaccard')
解释:
在较新版本的 scikit learn 中,定义jaccard_score
类似于 Jaccard 相似系数定义维基百科 https://en.wikipedia.org/wiki/Jaccard_index:
where
- M11 represents the total number of attributes where A and B both have a value of 1.
- M01 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
- M10 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
- M00 represents the total number of attributes where A and B both have a value of 0.
让我们创建一个示例数据集来查看结果是否匹配:
from pandas import DataFrame, crosstab
from numpy.random import default_rng
rng = default_rng(0)
# Create a dataframe of 40 rows and 5 columns (named A, B, C, D, E)
# Each cell in the DataFrame is either 0 or 1 with 50% probability
df = DataFrame(rng.binomial(1, 0.5, size=(40, 5)), columns=list('ABCDE'))
这将为 A 列和 B 列生成以下交叉表:
根据定义,Jaccard 相似度得分为:
M00 = (df['A'].eq(0) & df['B'].eq(0)).sum() # 10
M01 = (df['A'].eq(0) & df['B'].eq(1)).sum() # 7
M10 = (df['A'].eq(1) & df['B'].eq(0)).sum() # 14
M11 = (df['A'].eq(1) & df['B'].eq(1)).sum() # 9
print(M11 / (M01 + M10 + M11)) # 0.3
这就是你会得到的jaccard_score
:
from sklearn.metrics import jaccard_score
print(jaccard_score(df['A'], df['B'])) # 0.3
问题与jaccard_score
功能是它不是矢量化的。您必须循环所有列才能计算每个相应列的相似度得分。为了避免这种情况,您可以使用矢量化距离版本。但是,由于它是“距离”而不是“相似度”,因此您需要从 1 中减去该值:
from sklearn.metrics.pairwise import pairwise_distances
print(1 - pairwise_distances(df.T.to_numpy(), metric='jaccard'))
# [[1. 0.3 0.45714286 0.34285714 0.46666667]
# [0.3 1. 0.29411765 0.33333333 0.23333333]
# [0.45714286 0.29411765 1. 0.40540541 0.44117647]
# [0.34285714 0.33333333 0.40540541 1. 0.36363636]
# [0.46666667 0.23333333 0.44117647 0.36363636 1. ]]
或者,您可以将其转换回 DataFrame:
jac_sim = 1 - pairwise_distances(df.T.to_numpy(), metric='jaccard')
jac_sim_df = DataFrame(
1 - pairwise_distances(df.T.to_numpy(), metric='jaccard'),
index=df.columns, columns=df.columns,
)
# A B C D E
# A 1.000000 0.300000 0.457143 0.342857 0.466667
# B 0.300000 1.000000 0.294118 0.333333 0.233333
# C 0.457143 0.294118 1.000000 0.405405 0.441176
# D 0.342857 0.333333 0.405405 1.000000 0.363636
# E 0.466667 0.233333 0.441176 0.363636 1.000000
Note: In the previous version of this answer, the calculations used the hamming metric with pairwise_distances
because in earlier versions of scikit-learn, jaccard_score
was calculated similar to the accuracy score (i.e. (M00 + M11) / (M00 + M01 + M10 + M11)). That is no longer the case so the answer was updated to use the jaccard
metric instead of hamming
.