Starting from Spark 2, these kind of operation are implemented in left anti
join, instead of using RDD operation directly.
Same issue also on sqlContext.
scala> spark.version
res25: String = 2.0.2
spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
== Physical Plan ==
*HashAggregate(keys=[], functions=[], output=[])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[], output=[])
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
:- Scan ExistingRDD[]
+- BroadcastExchange IdentityBroadcastMode
+- Scan ExistingRDD[]
This arguably means a bug. But my guess is liking the logic of comparing NULL =
NULL, should it return true or false, causing this kind of confusion.
Yong
________________________________
From: Ravindra <[email protected]>
Sent: Friday, March 17, 2017 4:30 AM
To: [email protected]
Subject: Spark 2.0.2 -
hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()
Can someone please explain why
println ( " Empty count " +
hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()
prints - Empty count 1
This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and found
this. This causes my tests to fail. Is there another way to check full equality
of 2 dataframes.
Thanks,
Ravindra.