我正在尝试将 DenseVector 的 pyspark 数据帧列转换为数组,但总是出现错误。
data = [(Vectors.dense([8.0, 1.0, 3.0, 2.0, 5.0]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
我尝试定义一个 UDF 并使用 toArray()
to_array = udf(lambda x: x.toArray(), ArrayType(FloatType()))
df = df.withColumn('features', to_array('features'))
但是,如果我执行 df.collect() ,我会收到以下错误
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 17.0 failed 4 times,
most recent failure: Lost task 1.3 in stage 17.0 (TID 100, 10.139.64.6, executor 0):
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict
(for numpy.core.multiarray._reconstruct)
关于我如何实现这一目标有什么想法吗?