Hello,
Here’s a behavior that I find strange. A filtering with a predicate with
zero selectivity is much quicker than a filtering with multiple
predicates, but with the same zero-selectivity predicate in first place.
For example, in google’s English One Million 1-grams dataset (Spark 2.2,
wholeStageCodegen enabled, local mode):
|val df = spark.read.parquet("../../1-grams.parquet") def metric[T](f: =>
T): T = { val t0 = System.nanoTime() val res = f println("Executed in "
+ (System.nanoTime()-t0)/1000000 + "ms") res } df.count() // res11: Long
= 261823186 metric { df.filter('gram === "does not exist").count() } //
Executed in 1794ms // res13: Long = 0 metric { df.filter('gram === "does
not exist" && 'year > 0 && 'times > 0 && 'books > 0).count() } //
Executed in 4233ms // res15: Long = 0 |
In generated code, the behavior is exact the same; first predicate will
continue the loop in the same point. Why this time difference exists?
Thanks