Hi all!
Is it possible that Spark creates under certain circumstances duplicate
rows when doing multiple joins?
What I did:
buse.count
res0: Long = 20554365
buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count
res1: Long = 20554365
buse.alias("buse").join(bdef.alias("bdef"),
$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"),
$"bdef._c9"===$"crnb._c4").count
res2: Long = 20554365
buse.alias("buse").join(bdef.alias("bdef"),
$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"),
$"bdef._c9"===$"crnb._c4").join(wreg.alias("wreg"),
$"crnb._c1"===$"wreg._c5").count
res3: Long = 21633023
For explanation: buse and crnb are 1:1 relationship tables.
In the last join I expected again 20554365 but suddenly duplicate rows
exist. "wreg._c5" is a unique key, so it should not create more records:
wreg.groupBy($"_c5").agg(count($"_c2") as "cnt").filter($"cnt">1).show
+---+---+
|_c5|cnt|
+---+---+
+---+---+
When doing a distinct on the 4-way join I get the expected number of
records:
buse.alias("buse").join(bdef.alias("bdef"),
$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"),
$"bdef._c9"===$"crnb._c4").join(wreg.alias("wreg"),
$"crnb._c1"===$"wreg._c5").distinct.count
res10: Long = 20554365
This (in my opinion) means, that Spark is creating duplicte rows,
although it shouldn't. Or do I miss something?
Best, Rico.