不是 LogisticRegression 的直接输出,但可以使用我使用的以下函数获得:
def ExtractFeatureCoeficient(model, dataset, excludedCols = None):
test = model.transform(dataset)
weights = model.coefficients
print('This is model weights: \n', weights)
weights = [(float(w),) for w in weights] # convert numpy type to float, and to tuple
if excludedCols == None:
feature_col = [f for f in test.schema.names if f not in ['y', 'classWeights', 'features', 'label', 'rawPrediction', 'probability', 'prediction']]
else:
feature_col = [f for f in test.schema.names if f not in excludedCols]
if len(weights) == len(feature_col):
weightsDF = sqlContext.createDataFrame(zip(weights, feature_col), schema= ["Coeficients", "FeatureName"])
else:
print('Coeficients are not matching with remaining Fetures in the model, please check field lists with model.transform(dataset).schema.names')
return weightsDF
results = ExtractFeatureCoeficient(lr_model, trainingData)
results.show()
这将生成一个包含以下字段的 Spark 数据框:
+--------------------+--------------------+
| Coeficients| FeatureName|
+--------------------+--------------------+
|[0.15834847825223...| name |
| [0.0]| lat |
+--------------------+--------------------+
或者您可以按如下方式拟合 GML 模型:
model = GeneralizedLinearRegression(family="binomial", link="logit", featuresCol="features", labelCol="label", maxIter = 1000, regParam = 0.8, weightCol="classWeights")
# Train model. This also runs the indexer.
models = glmModel.fit(trainingData)
# then get summary of the model:
summary = model.summary
print(summary)
生成输出:
Coefficients:
Feature Estimate Std Error T Value P Value
(Intercept) -1.3079 0.0705 -18.5549 0.0000
name 0.1248 0.0158 7.9129 0.0000
lat 0.0239 0.0209 1.1455 0.2520