如何从 pandas 数据帧计算 jaccard 相似度


我有一个数据框如下:框架的形状是(1510, 1399)。列代表产品,行代表用户为给定产品分配的值(0 或 1)。我怎样才能计算jaccard_similarity_scores?


data_ibs = pd.DataFrame(index=data_g.columns,columns=data_g.columns)


for i in range(0,len(data_ibs.columns)) :
    # Loop through the columns for each column
    for j in range(0,len(data_ibs.columns)) :

Use pairwise_distances计算距离并用 1 减去该距离即可找到相似度得分:

from sklearn.metrics.pairwise import pairwise_distances
1 - pairwise_distances(df.T.to_numpy(), metric='jaccard')


在较新版本的 scikit learn 中,定义jaccard_score类似于 Jaccard 相似系数定义维基百科 https://en.wikipedia.org/wiki/Jaccard_index:


  • M11 represents the total number of attributes where A and B both have a value of 1.
  • M01 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
  • M10 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
  • M00 represents the total number of attributes where A and B both have a value of 0.


from pandas import DataFrame, crosstab
from numpy.random import default_rng
rng = default_rng(0)

# Create a dataframe of 40 rows and 5 columns (named A, B, C, D, E)
# Each cell in the DataFrame is either 0 or 1 with 50% probability
df = DataFrame(rng.binomial(1, 0.5, size=(40, 5)), columns=list('ABCDE'))

这将为 A 列和 B 列生成以下交叉表:

A/B 0 1
0 10 7
1 14 9

根据定义,Jaccard 相似度得分为:

M00 = (df['A'].eq(0) & df['B'].eq(0)).sum()  # 10
M01 = (df['A'].eq(0) & df['B'].eq(1)).sum()  # 7
M10 = (df['A'].eq(1) & df['B'].eq(0)).sum()  # 14
M11 = (df['A'].eq(1) & df['B'].eq(1)).sum()  # 9

print(M11 / (M01 + M10 + M11))  # 0.3


from sklearn.metrics import jaccard_score
print(jaccard_score(df['A'], df['B']))  # 0.3

问题与jaccard_score功能是它不是矢量化的。您必须循环所有列才能计算每个相应列的相似度得分。为了避免这种情况,您可以使用矢量化距离版本。但是,由于它是“距离”而不是“相似度”,因此您需要从 1 中减去该值:

from sklearn.metrics.pairwise import pairwise_distances
print(1 - pairwise_distances(df.T.to_numpy(), metric='jaccard'))

# [[1.         0.3        0.45714286 0.34285714 0.46666667]
#  [0.3        1.         0.29411765 0.33333333 0.23333333]
#  [0.45714286 0.29411765 1.         0.40540541 0.44117647]
#  [0.34285714 0.33333333 0.40540541 1.         0.36363636]
#  [0.46666667 0.23333333 0.44117647 0.36363636 1.        ]]

或者,您可以将其转换回 DataFrame:

jac_sim = 1 - pairwise_distances(df.T.to_numpy(), metric='jaccard')
jac_sim_df = DataFrame(
    1 - pairwise_distances(df.T.to_numpy(), metric='jaccard'), 
    index=df.columns, columns=df.columns,

#           A         B         C         D         E
#  A  1.000000  0.300000  0.457143  0.342857  0.466667
#  B  0.300000  1.000000  0.294118  0.333333  0.233333
#  C  0.457143  0.294118  1.000000  0.405405  0.441176
#  D  0.342857  0.333333  0.405405  1.000000  0.363636
#  E  0.466667  0.233333  0.441176  0.363636  1.000000

Note: In the previous version of this answer, the calculations used the hamming metric with pairwise_distances because in earlier versions of scikit-learn, jaccard_score was calculated similar to the accuracy score (i.e. (M00 + M11) / (M00 + M01 + M10 + M11)). That is no longer the case so the answer was updated to use the jaccard metric instead of hamming.


