Spark 2.2.1 Dataframes multiple joins bug?

Dipl.-Inf. Rico Bergmann Mon, 23 Mar 2020 00:46:46 -0700

Hi all!

Is it possible that Spark creates under certain circumstances duplicaterows when doing multiple joins?


What I did:

buse.count

res0: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count

res1: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"),$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"),$"bdef._c9"===$"crnb._c4").count


res2: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"),$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"),$"bdef._c9"===$"crnb._c4").join(wreg.alias("wreg"),$"crnb._c1"===$"wreg._c5").count


res3: Long = 21633023

For explanation: buse and crnb are 1:1 relationship tables.

In the last join I expected again 20554365 but suddenly duplicate rowsexist. "wreg._c5" is a unique key, so it should not create more records:


wreg.groupBy($"_c5").agg(count($"_c2") as "cnt").filter($"cnt">1).show
+---+---+
|_c5|cnt|
+---+---+
+---+---+

When doing a distinct on the 4-way join I get the expected number ofrecords:

buse.alias("buse").join(bdef.alias("bdef"),$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"),$"bdef._c9"===$"crnb._c4").join(wreg.alias("wreg"),$"crnb._c1"===$"wreg._c5").distinct.count

res10: Long = 20554365

This (in my opinion) means, that Spark is creating duplicte rows,although it shouldn't. Or do I miss something?



Best, Rico.

Spark 2.2.1 Dataframes multiple joins bug?

Reply via email to