array_contains()
有效,之后您只需按玩家对结果进行分组即可。
让我们从两个数据集开始,一个用于演奏者,一个用于吉他:
val player = Seq(("Eric Clapton", Array(1,5)), ("Paco de Lucia", Array(1,2)), ("Jimi Hendrix", Array(3))).toDF("player", "guitars")
val guitar = Seq((1, "Gibson", "SG", "Electric"), (2, "Faustino Conde", "Media Luna", "Acoustic"), (3, "Pulsebeatguitars", "Spider", "Electric"), (4, "Yamaha", "FG800", "Acoustic"), (5, "Fender", "Stratocaster", "Electric")).toDF("guitarId", "make", "model", "type")
+-------------+-------+
| player|guitars|
+-------------+-------+
| Eric Clapton| [1, 5]|
|Paco de Lucia| [1, 2]|
| Jimi Hendrix| [3]|
+-------------+-------+
+--------+----------------+------------+--------+
|guitarId| make| model| type|
+--------+----------------+------------+--------+
| 1| Gibson| SG|Electric|
| 2| Faustino Conde| Media Luna|Acoustic|
| 3|Pulsebeatguitars| Spider|Electric|
| 4| Yamaha| FG800|Acoustic|
| 5| Fender|Stratocaster|Electric|
+--------+----------------+------------+--------+
为了使分组操作更容易一些,我们的想法是在连接之前将吉他数据集的三列组合成一个结构体:
val guitar2 = guitar.withColumn("guitar", struct('make, 'model, 'type))
连接后,我们按玩家对结果进行分组,得到正确的结果:
player.join(guitar2, expr("array_contains(guitars, guitarId)"))
.groupBy("player")
.agg(collect_list('guitar))
.show(false)
prints
+-------------+----------------------------------------------------------------+
|player |collect_list(guitar) |
+-------------+----------------------------------------------------------------+
|Jimi Hendrix |[[Pulsebeatguitars, Spider, Electric]] |
|Eric Clapton |[[Gibson, SG, Electric], [Fender, Stratocaster, Electric]] |
|Paco de Lucia|[[Gibson, SG, Electric], [Faustino Conde, Media Luna, Acoustic]]|
+-------------+----------------------------------------------------------------+