shap 模型
Responsible AI has been a very hot topic in recent years. Accountability and explainability now become the necessary components of your machine learning models, particularly when the models make decisions that will impact people’s life, such as medical diagnostic, financial service. This is a very large topic for machine learning and a lot of ongoing work has been dedicated to various aspects. You can check more resources on this topic[1]. In this post, I will focus on SHAP (SHapley Additive exPlanations), which is one of the most popular explainability packages, due to its versatile (local/global explainability; model-specific/agnostic) and the solid theoretical foundation from game theory. You can find many posts and tutorials to understand how SHAP can help you understand how your ML model works, i.e., how each of your features contributes to the model prediction. However, in this post, I will talk about SHAP loss values that many people may be less familiar with. I will walk through some key concepts by presenting an example. I will also share some of my thoughts.
近年来,负责任的AI一直是一个非常热门的话题。 问责制和可解释性现已成为您的机器学习模型的必要组成部分,尤其是当模型做出会影响人们生活的决策时,例如医疗诊断,金融服务。 对于机器学习来说,这是一个非常大的主题,并且许多正在进行的工作致力于各个方面。 您可以检查有关此主题的更多资源[1]。 在本文中,我将重点介绍SHAP (SHapley Additive exPlanations),这是最受欢迎的可解释性软件包之一,这是因为它的通用性(局部/全局可解释性;特定于模型/不可知论)以及博弈论的扎实理论基础。 您可以找到许多文章和教程来了解SHAP如何帮助您了解ML模型的工作原理,即每个功能如何有助于模型预测。 但是,在这篇文章中,我将讨论许多人可能不太熟悉的SHAP损失值。 我将通过举例说明一些关键概念。 我还将分享我的一些想法。
To begin with, you may want to check the example provided by SHAP package. And there are two important notes:
首先,您可能需要检查SHAP软件包提供的示例 。 并且有两个重要说明:
- The shap loss values will show you how each feature contributes to the logloss value from the expected value. Note, in this post, when I say loss value, it refers to logloss, since we will look at the classification problem) 急剧损失值将向您显示每个功能如何从预期值贡献到对数损失值。 请注意,在本文中,当我说损失值时,它指的是对数损失,因为我们将研究分类问题)
- You should use “interventional” method for the calculation of SHAP loss values 您应使用“介入”方法计算SHAP损失值
Essentially, this means when integrating out the absent features, you should use the marginal distribution instead of the conditional distribution. And the way to achieve the marginal distribution is to assign the absent features with the values from the background dataset.
本质上,这意味着在集成缺少的功能时,应使用边际分布而不是条件分布。 实现边际分布的方法是用背景数据集中的值分配缺失的特征。
The use of “interventional” (i.e., marginal distribution) or “tree_path_dependent” (i.e., conditional distribution) is an important nuance (see docstring in SHAP package) and it’s worth further discussion. But I don’t want to confuse you in the very beginning. You just need to know that in the common practice, TreeShap calculates shap values very fast because it takes advantage of the conditional distribution from the tree structure of the model, but the use of conditional distribution can introduce the problem of causality[2].
使用“介入”(即边际分布)或“ tree_path_dependent”(即条件分布)是一个重要的细微差别(请参阅SHAP软件包中的文档字符串 ),值得进一步讨论。 但我不想一开始就让您感到困惑。 您只需要知道,在通常的实践中,TreeShap可以非常快速地计算shap值,因为它利用了模型树结构中的条件分布,但是使用条件分布会引入因果关系问题[2]。
训练XGBoost分类器 (Train an XGBoost Classifier)
The example in this post is modified from the tutorial example in SHAP package and you can find the full code and notebook here. I first trained an XGBoost classifier. The dataset uses 12 features to predict if a person makes over 50K a year.
本文中的示例是从SHAP软件包中的教程示例修改而来的 ,您可以在此处找到完整的代码和笔记本。 我首先训练了XGBoost分类器。 该数据集使用12个特征预测一个人的年收入是否超过5万。
['Age', 'Workclass', 'Education-Num', 'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital Gain', 'Capital Loss', 'Hours per week', 'Country']
You can use the SHAP package to calculate the shap values. The force plot will give you the local explainability to understand how the features contribute to the model prediction for an instance of interest (Fig. 1). The summary plot will give the global explainability (Fig. 2). You can check Part 1 in the Jupyter Notebook. There is nothing new but just the common use of SHAP, so I will leave the details to you and jump to Part 2, shap values for the model loss.
您可以使用SHAP包来计算shap值。 力图将为您提供局部可解释性,以了解特征如何对感兴趣的实例进行模型预测 (图1)。 摘要图将给出全局解释性(图2)。 您可以在Jupyter Notebook中检查第1部分。 除了SHAP的常用用法之外,没有什么新的内容,因此我将为您提供细节,并跳至第2部分, 模型损失的shap值 。
解释模型的对数损失 (Explain the Log-Loss of the Model)
Now the contribution to the model loss is more of interest, so we need to calculate shap loss values. In some sense, this is similar to residual analysis. The code snippet is as follows. Note that you need to
现在,对模型损失的贡献更加令人关注,因此我们需要计算形状损失值。 从某种意义上讲,这类似于残差分析。 代码段如下。 请注意,您需要
- provide a background data since we use “interventional” approach. And the computational cost could be expensive. So you should provide a background data of a reasonable size (here I use 100). 由于我们使用“介入”方法,因此提供了背景数据。 并且计算成本可能是昂贵的。 因此,您应该提供合理大小的背景数据(此处使用100)。
- Now the model_output is “log_loss”. 现在,模型输出为“ log_loss”。
# subsample to provide the background data (stratified by the target variable)X_subsample = subsample_data(X, y)explainer_bg_100 = shap.TreeExplainer(model, X_subsample,
feature_perturbation="interventional",
model_output="log_loss")shap_values_logloss_all = explainer_bg_100.shap_values(X, y)
Force Plot
力图
Now the fore plot for a data instance has a similar interpretation as that in Fig. 2, but in terms of log loss instead of prediction. A successful prediction (ground truth as True and prediction as True) is given in Fig. 3, while a wrong prediction (ground truth as True and prediction as False) in Fig. 4. You can see how the features with blue color try to reduce the logloss from the base value, and the reds increase the logloss. It’s noteworthy that the base values (expected values) of the model loss depend on the label (True/False) so it is a function instead of a single number. The calculation of expected values is by first setting all the data labels to True (or False), and then calculate the average log loss, for which you can check more details on the notebook. I am not sure if there is a particular reason for such a calculation of base values, but after all, the base values just serve as a reference value so I think it should not matter very much.
现在,数据实例的前部图具有与图2相似的解释,但用对数损失代替预测。 图3中给出了成功的预测(地面真实为True,预测为True),而图4中给出了错误的预测(地面真实为True,预测为False)。您可以看到蓝色特征如何尝试从基本值减少对数损失,红色增加对数损失。 值得注意的是,模型损失的基值(期望值)取决于标签(真/假),因此它是一个函数,而不是单个数字。 期望值的计算方法是,首先将所有数据标签设置为True(或False),然后计算平均对数损失,您可以在笔记本上查看更多的详细信息。 我不确定是否有这样特殊的原因来计算基本值,但是毕竟,基本值仅用作参考值,因此我认为应该没有太大关系。
Summary Plot
摘要图
Similarly, we have the summary plot for the model logloss (Fig. 5). This will tell you how the features contribute to the model logloss (the calculation is based on absolute mean). A feature with a large contribution means it contributes a lot to the model loss, could be increasing the logloss for some data instance or reducing the logloss for other data instances. Therefore, the summary plot here should show the consistency with the top features by shap values in Fig. 2. But we can see the ranking orders are a bit different. While “Relationship” remains the top one, the order of “Age”, “Education-Num”, “Capital Gain”, “Hours per week”, “Occupation” is different. And “Capital Gain” in Fig. 5 has a relatively large contribution than it does in Fig. 2. This suggests that “Capital Gain” plays an important role in reducing the log loss while relatively speaking it may not be that important for the model to make the prediction compared to “Relationship”. It’s noteworthy that the summary plot in Fig. 5 should be interpreted with cautions, since the bar plot in Fig. 5 is calculated based on absolute mean, which means both the effect of reducing logloss and increasing logloss are taken into account to rank the importance of a feature. In plain language, a large magnitude of (absolute) contribution may not necessarily mean a feature is a “good” feature.
同样,我们有模型对数损失的汇总图(图5)。 这将告诉您特征如何导致模型对数损失(计算基于绝对均值)。 具有较大贡献的功能意味着它对模型损失做出了很大贡献,可能是增加某些数据实例的对数丢失或减少其他数据实例的对数丢失。 因此,这里的摘要图应通过图2中的shap值显示与顶部特征的一致性。但是我们可以看到排名顺序有些不同。 虽然“关系”仍然是头等大事,但“年龄”,“教育数字”,“资本收益”,“每周工作时间”,“职业”的顺序却不同。 而且,图5中的“资本收益”比图2中的贡献更大。这表明“资本收益”在减少对数损失方面起着重要的作用,而相对而言,对于模型而言可能并不那么重要。与“关系”相比做出预测。 值得注意的是,应谨慎解释图5中的摘要图,因为图5中的条形图是基于绝对均值计算的,这意味着要同时考虑减少对数损失和增加对数损失的影响来对重要性进行排名功能。 用通俗易懂的语言来说,(绝对)巨大的贡献不一定意味着某个功能就是“良好”功能。
Of course, you can use the scatter summary plot instead of the bar summary plot to see the detailed distribution to dive deeper for your model debugging (i.e., improve your model performance). The other way I investigate it is to decompose the shap loss values into negative component (Fig. 6) and positive component (Fig. 7). And in terms of the model debugging, you want to achieve a more negative value and reduce the positive value for all the features since you wish all the features reduce the final model logloss.
当然,您可以使用散点汇总图而不是条形汇总图来查看详细分布,以更深入地进行模型调试(即,提高模型性能)。 我研究的另一种方法是将波形损耗值分解为负分量(图6)和正分量(图7)。 在模型调试方面,您希望为所有功能实现一个更大的负值并减少正值,因为您希望所有功能都可以减少最终模型的对数损失。
监控区 (Monitoring plot)
Now we come to the most interesting part: use the shap loss value to monitor your model. Model drift and data drift are real-world problems that your model deteriorates and leads to unreliable/inaccurate predictions. But these usually happen silently, and it is very hard to identify the root cause. In a recent paper[3] by the SHAP author, they use the shap loss values to monitor the model health. The idea is very appealing and I wish to explore more on that. Note that the API is available but seems under ongoing development.
现在我们来讨论最有趣的部分:使用整形损耗值监视模型。 模型漂移和数据漂移是现实世界中的问题,您的模型会恶化并导致不可靠/不准确的预测。 但是,这些通常是无声的,很难确定根本原因。 在SHAP作者的最新论文[3]中,他们使用突变损失值来监视模型的运行状况。 这个想法非常吸引人,我希望对此进行更多的探索。 请注意,该API可用,但似乎正在开发中。
First we need to calculate the shap loss values for the training data and test data. In the context of monitoring, you need to calculate the shap loss values for dataset from different time-snapshot. You may recall that we have done this in the beginning of this section. But note that we use the background data sampled from the entire dataset. For the rationale of monitoring, it makes more sense to calculate the shap loss values for the training dataset and the test dataset separately, by using the background data from the training dataset and the test dataset. The code snippets are as follows:
首先,我们需要为训练数据和测试数据计算成型损耗值。 在监视的上下文中,您需要计算来自不同时间快照的数据集的丢失损失值。 您可能还记得我们在本节的开头已经做到了。 但请注意,我们使用从整个数据集中采样的背景数据。 出于监视的基本原理,通过使用来自训练数据集和测试数据集的背景数据,分别计算训练数据集和测试数据集的shap loss值更有意义。 代码片段如下:
# shap loss values for training data
X_train_subsample = subsample_data(X=X_train, y=y_train)explainer_train_bg_100 = shap.TreeExplainer(model, X_train_subsample,
feature_perturbation="interventional", model_output="log_loss")shap_values_logloss_train = explainer_train_bg_100.shap_values(X_train, y_train)# shap loss values for test data
X_test_subsample = subsample_data(X=X_test, y=y_test)explainer_test_bg_100 = shap.TreeExplainer(model, X_test_subsample,
feature_perturbation="interventional", model_output="log_loss")shap_values_logloss_test = explainer_test_bg_100.shap_values(X_test, y_test)
The monitoring plots for the top features are shown in Fig. 8. First all the data instances will be ordered by the index. And here we assume the index indicates the evolution of time (from left to right along the axis). In this toy example, we don’t have data from different time snapshot so we simply treat the training data as the current data and the test data as the future data we would like to monitor.
顶级功能的监视图如图8所示。首先,所有数据实例将按索引排序。 在这里,我们假设索引指示时间的演变(沿轴从左到右)。 在此玩具示例中,我们没有来自不同时间快照的数据,因此我们只是将训练数据视为当前数据,将测试数据视为我们要监视的未来数据。
There are some important points to understand these monitoring plots, based on the current implementation in the SHAP pakcage. In order to see if the shap loss values are time-consistent, t-test will be repeatedly conducted to compare two data samples. The current implementation uses an increment of 50 data points to split the data. That means, the first t-test will compare data[0: 50] to data[50:]; and the second will compare data[0: 100] to data[100:], and so on. The t-test will fail if the p value is smaller than 0.05/n_features. In other words, it uses the confidence level of 95% and Bonferroni correction has been applied. Where the t-test fails, a vertical dash line will be plotted to indicate the location. A bit surprising, we see the monitoring plots show the inconsistency of shap loss values for [“Relationship”, “Education-Num”, “Capital Gain”], and that happens when we enter the time snapshot of test data (Fig. 8).
基于SHAP pakcage中的当前实现,了解这些监视图有一些重要点。 为了查看成型损耗值是否与时间一致,将重复进行t检验以比较两个数据样本。 当前实现使用增量为50个数据点来拆分数据。 也就是说,第一个t检验会将data [0:50]与data [50:]进行比较; 第二个将比较data [0:100]与data [100:],依此类推。 如果p值小于0.05 / n_features,则t检验将失败。 换句话说,它使用95%的置信度,并且已应用Bonferroni校正。 如果t检验失败,则会绘制一条垂直虚线以指示位置。 有点令人惊讶的是,我们看到监视图显示了[“关系”,“教育-数值”,“资本收益”]的急剧损失值的不一致,并且这种情况发生在我们输入测试数据的时间快照时(图8)。 )。
The reason for the use of an increment of 50 data points is not very clear to me. And in this example, since we know [0:26048] is the training data, and [-6513:] is the test data. I modified the increment to 6500 and see if it will give a different result. But the monitoring plots still show the same inconsistency (i.e., the failure of t-test) when it comes to comparing the test data (Fig. 9).
我不清楚使用50个数据点增量的原因。 在此示例中,由于我们知道[0:26048]是训练数据,[-6513:]是测试数据。 我将增量修改为6500,看看它是否会给出不同的结果。 但是,在比较测试数据时,监视图仍显示出相同的不一致(即t检验失败)(图9)。
Finally, I think it’s a good idea to check the t-test on the training data and test data directly. And this verifies the conclusion again, the shap loss values are inconsistent between the training dataset and the test dataset.
最后,我认为对训练数据和直接测试数据进行t检验是一个好主意。 并且这再次验证了结论,训练数据集和测试数据集之间的成型损耗值不一致。
# t-test for top features (assume equal variance)
t-test for feature: Relationship , p value: 2.9102249320497517e-06
t-test for feature: Age , p value: 0.22246187841821208
t-test for feature: Education-Num , p value: 4.169244713493427e-06
t-test for feature: Capital Gain , p value: 1.0471308847541212e-27# t-test for top features (unequal variance, i.e., Welch’s t-test,)
t-test for feature: Relationship , p value: 1.427849321056383e-05
t-test for feature: Age , p value: 0.2367209506867293
t-test for feature: Education-Num , p value: 3.3161498092593535e-06
t-test for feature: Capital Gain , p value: 1.697971581168647e-24
The inconsistency of shap loss values between training data and test data is actually very unexpected, and can be troublesome. Remember that we simply use training/test split from the entire dataset, so there is a good reason to believe that training dataset and test dataset should be consistent, in terms of data distribution or shap loss values contribution. By any means, this is just a simple experiment and more investigations should be performed to draw any firm conclusion. But I think there may be some reasons why the SHAP package indicates the monitoring functionality is just preliminary, for example:
训练数据和测试数据之间的毛发损失值不一致实际上是非常出乎意料的,并且可能很麻烦。 请记住,我们只是使用整个数据集中的训练/测试拆分,因此有充分的理由相信,就数据分布或shap loss值的贡献而言,训练数据集和测试数据集应保持一致。 无论如何,这仅仅是一个简单的实验,应该进行更多的研究以得出任何肯定的结论。 但是我认为SHAP软件包可能表明某些监视功能只是初步的原因,例如:
- the use of an increment of 50 data points looks arbitrary to me; 在我看来,使用50个数据点的增量是任意的;
- the t-test looks very sensitive and can give many false alarms. t检验看起来非常敏感,可以发出许多错误警报。
Another interesting discussion point is the use of background data. Note that for the monitoring plots, the shap loss values on the training dataset and the test dataset are calculated using different background data (subsamples from training dataset/test dataset). Since the “interventional” approach to calculate shap loss values is very expensive, I only tried the subsamples data of a size of 100 data instances. That could yield a high-variance result of the shap loss values. Perhaps a background data of a large size will reduce the variance and give the consistency of shap loss values in the monitoring plots. And when I used the same background data (subsamples from the entire dataset), there will not be inconsistency in the monitoring plot. So how you choose the background data matters a lot!
另一个有趣的讨论点是背景数据的使用。 请注意,对于监视图,使用不同的背景数据(来自训练数据集/测试数据集的子样本)来计算训练数据集和测试数据集上的异常损失值。 由于计算干扰损失值的“介入”方法非常昂贵,因此我仅尝试了100个数据实例大小的子样本数据。 这可能会导致锐变损耗值的高方差结果。 较大的背景数据也许会减少方差,并在监视图中提供稳定的损耗值。 当我使用相同的背景数据(来自整个数据集的子样本)时,监视图中不会出现不一致的情况。 因此,如何选择背景数据非常重要!
结论与讨论 (Conclusions and Discussions)
I hope this post can give you a useful introduction to the shap loss values. You can better debug your ML models by investigating the shap loss values. It can also be a useful approach to monitoring your ML models for model drift and data drift, which is still a very big challenge in the community. But note the limitation: in order to use the shap loss values for monitoring, you need to have the ground truth for the new coming data, which is usually only available after a certain period. Also, unfortunately this functionality is still under development, and the appropriateness of the use of t-test needs to be further justified.
我希望这篇文章能为您提供有关损耗损失值的有用介绍。 您可以通过调查波形损耗值来更好地调试ML模型。 这对于监视ML模型的模型漂移和数据漂移也是一种有用的方法,这在社区中仍然是很大的挑战。 但是请注意以下限制:为了使用突变损失值进行监视,您需要具有新的即将到来的数据的地面真实性,通常仅在特定时间段之后才可用。 同样,不幸的是,该功能仍在开发中,需要进一步证明使用t检验的适当性。
Last but not least, calculating shap values (TreeShap) by marginal distribution or conditional distribution can give different results (see the equations). The use of conditional distribution will introduce the problem of causality, while marginal distribution will provide unlikely data points to the model[4]. There seems no consensus about which one to use, depending on what scenarios[2,5]. This paper[6] has some interesting comments on this topic which I would like to quote here:
最后但并非最不重要的一点是,通过边际分布或条件分布计算形状值(TreeShap)可以得出不同的结果(请参见方程式)。 条件分布的使用将引入因果关系问题,而边际分布将为模型提供不太可能的数据点[4]。 关于使用哪种方案似乎没有共识,具体取决于哪种方案[2,5]。 本文[6]对这个话题有一些有趣的评论,我想在这里引用:
In general, whether or not users should present their models with inputs that don’t belong to the original training distribution is a subject of ongoing debate.
总的来说,用户是否应该向他们的模型提供不属于原始训练分布的输入,这是一个不断争论的话题。
….
…。
This problem fits into a larger discussion about whether or not your attribution method should be “true to the model” or “true to the data” which has been discussed in several recent articles.
这个问题适合于有关归因方法是“对模型正确”还是对数据真实的更大讨论,最近几篇文章对此进行了讨论。
Thank you for your time. And don’t hesitate to leave any comments and discussions!
感谢您的时间。 并且不要犹豫,留下任何评论和讨论!
All the plots in this post are created by the author by using the SHAP package. Please kindly let me know if you think any of your work is not properly cited.
这篇文章中的所有图都是作者使用SHAP包创建的。 如果您认为自己的作品没有被正确引用,请告诉我。
[1] Introduction to Responsible Machine Learning
[1] 负责任的机器学习简介
[2] Janzing, D., Minorics, L., & Blöbaum, P. (2019). Feature relevance quantification in explainable AI: A causality problem. https://arxiv.org/abs/1910.13413
[2] Janzing,D.,Minorics,L.和Blöbaum,P.(2019)。 可解释的AI中的特征相关性量化:因果关系问题。 https://arxiv.org/abs/1910.13413
[3] Lundberg, S.M., Erion, G., Chen, H., DeGrave, A., Prutkin, J.M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N. and Lee, S.I. (2020). From local explanations to global understanding with explainable AI for trees. Nature machine intelligence, 2(1), 2522–5839.
[3] Lundberg,SM,Erion,G.,Chen,H.,DeGrave,A.,Prutkin,JM,Nair,B.,Katz,R.,Himmelfarb,J.,Bansal,N.和Lee,SI( 2020)。 从本地解释到对树的可解释AI的全球理解。 自然机器智能 , 2 (1),2522-5839。
[4] https://christophm.github.io/interpretable-ml-book/shap.html
[4] https://christophm.github.io/interpretable-ml-book/shap.html
[5] Sundararajan, M., & Najmi, A. (2019). The many Shapley values for model explanation. arXiv preprint arXiv:1908.08474.
[5] Sundararajan,M.和Najmi,A.(2019)。 用于模型解释的许多Shapley值。 arXiv预印本arXiv:1908.08474 。
[6] Sturmfels, P., Lundberg, S., & Lee, S. I. (2020). Visualizing the impact of feature attribution baselines. Distill, 5(1), e22.
[6] Sturmfels,P.,Lundberg,S。,&Lee,SI(2020)。 可视化特征归因基线的影响 。 蒸馏 ,5(1),E22。
翻译自: https://towardsdatascience.com/use-shap-loss-values-to-debug-monitor-your-model-83f7808af40f
shap 模型