如何将逻辑回归模型获得的系数映射到pyspark中的特征名称

2024-04-15

我使用 databricks 列出的管道流构建了一个逻辑回归模型。https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

特征(数字和字符串特征)使用编码OneHotEncoderEstimator然后使用标准缩放器进行转换。

我想知道如何将从逻辑回归获得的权重(系数)映射到原始数据框中的特征名称。

也就是说,如何得到模型得到的权重或者系数对应的特征

谢谢

我试图从 lrModel.schema 中提取特征,它给出了一个列表structField显示特征

我尝试从模式中提取特征并映射到权重,但没有成功

from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="scaledFeatures", maxIter=10)

# Train model with Training Data

lrModel = lr.fit(trainingData)

predictions = lrModel.transform(trainingData)

LRschema = predictions.schema

提取元组列表的预期结果(特征权重,特征名称)


不是 LogisticRegression 的直接输出,但可以使用我使用的以下函数获得:

def ExtractFeatureCoeficient(model, dataset, excludedCols = None):
    test = model.transform(dataset)
    weights = model.coefficients
    print('This is model weights: \n', weights)
    weights = [(float(w),) for w in weights]  # convert numpy type to float, and to tuple
    if excludedCols == None:
        feature_col = [f for f in test.schema.names if f not in ['y', 'classWeights', 'features', 'label', 'rawPrediction', 'probability', 'prediction']]
    else:
        feature_col = [f for f in test.schema.names if f not in excludedCols]
    if len(weights) == len(feature_col):
        weightsDF = sqlContext.createDataFrame(zip(weights, feature_col), schema= ["Coeficients", "FeatureName"])
    else:
        print('Coeficients are not matching with remaining Fetures in the model, please check field lists with model.transform(dataset).schema.names')
    
    return weightsDF

results = ExtractFeatureCoeficient(lr_model, trainingData)

results.show()

这将生成一个包含以下字段的 Spark 数据框:

+--------------------+--------------------+
|         Coeficients|         FeatureName|
+--------------------+--------------------+
|[0.15834847825223...|    name            |
|               [0.0]|  lat               |
+--------------------+--------------------+

或者您可以按如下方式拟合 GML 模型:

model = GeneralizedLinearRegression(family="binomial", link="logit", featuresCol="features", labelCol="label", maxIter = 1000, regParam = 0.8, weightCol="classWeights")

# Train model.  This also runs the indexer.
models = glmModel.fit(trainingData)

# then get summary of the model:

summary = model.summary
print(summary)

生成输出:

Coefficients:
        Feature       Estimate Std Error  T Value P Value
    (Intercept)       -1.3079    0.0705 -18.5549  0.0000
    name               0.1248    0.0158   7.9129  0.0000
    lat                0.0239    0.0209   1.1455  0.2520
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

如何将逻辑回归模型获得的系数映射到pyspark中的特征名称 的相关文章

随机推荐