这不太可能是一个错误。你没有提供重现问题所需的代码 https://stackoverflow.com/help/mcve但很可能您将 Spark 2.0 与 ML 转换器一起使用,并且比较了错误的实体。
让我们用一个例子来说明这一点。简单数据
from pyspark.ml.feature import OneHotEncoder
row = OneHotEncoder(inputCol="x", outputCol="features").transform(
sc.parallelize([(1.0, )]).toDF(["x"])
).first()
现在让我们导入不同的向量类:
from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
并进行测试:
isinstance(row.features, MLLibVector)
False
isinstance(row.features, MLVector)
True
正如你所看到的,我们拥有的是pyspark.ml.linalg.Vector
not pyspark.mllib.linalg.Vector
与旧 API 不兼容:
LabeledPoint(0.0, row.features)
TypeError Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector
您可以将 ML 对象转换为 MLLib 对象:
from pyspark.ml import linalg as ml_linalg
def as_mllib(v):
if isinstance(v, ml_linalg.SparseVector):
return MLLibVectors.sparse(v.size, v.indices, v.values)
elif isinstance(v, ml_linalg.DenseVector):
return MLLibVectors.dense(v.toArray())
else:
raise TypeError("Unsupported type: {0}".format(type(v)))
LabeledPoint(0, as_mllib(row.features))
LabeledPoint(0.0, (1,[],[]))
或者简单地:
LabeledPoint(0, MLLibVectors.fromML(row.features))
LabeledPoint(0.0, (1,[],[]))
但一般来说,您应该在必要时避免出现这种情况。