我想更换所有n/a
以下数据框中的值unknown
。
它可以是scalar
or complex nested column
。
如果它是一个StructField column
我可以循环遍历列并替换n\a
using WithColumn
。
但我希望这能在generic way
尽管type
列的
因为我不想明确指定列名,因为我的例子中有 100 个列名?
case class Bar(x: Int, y: String, z: String)
case class Foo(id: Int, name: String, status: String, bar: Seq[Bar])
val df = spark.sparkContext.parallelize(
Seq(
Foo(123, "Amy", "Active", Seq(Bar(1, "first", "n/a"))),
Foo(234, "Rick", "n/a", Seq(Bar(2, "second", "fifth"),Bar(22, "second", "n/a"))),
Foo(567, "Tom", "null", Seq(Bar(3, "second", "sixth")))
)).toDF
df.printSchema
df.show(20, false)
Result:
+---+----+------+---------------------------------------+
|id |name|status|bar |
+---+----+------+---------------------------------------+
|123|Amy |Active|[[1, first, n/a]] |
|234|Rick|n/a |[[2, second, fifth], [22, second, n/a]]|
|567|Tom |null |[[3, second, sixth]] |
+---+----+------+---------------------------------------+
预期输出:
+---+----+----------+---------------------------------------------------+
|id |name|status |bar |
+---+----+----------+---------------------------------------------------+
|123|Amy |Active |[[1, first, unknown]] |
|234|Rick|unknown |[[2, second, fifth], [22, second, unknown]] |
|567|Tom |null |[[3, second, sixth]] |
+---+----+----------+---------------------------------------------------+
对此有什么建议吗?