Hello,

Here’s a behavior that I find strange. A filtering with a predicate with zero selectivity is much quicker than a filtering with multiple predicates, but with the same zero-selectivity predicate in first place.

For example, in google’s English One Million 1-grams dataset (Spark 2.2, wholeStageCodegen enabled, local mode):

|val df = spark.read.parquet("../../1-grams.parquet") def metric[T](f: => T): T = { val t0 = System.nanoTime() val res = f println("Executed in " + (System.nanoTime()-t0)/1000000 + "ms") res } df.count() // res11: Long = 261823186 metric { df.filter('gram === "does not exist").count() } // Executed in 1794ms // res13: Long = 0 metric { df.filter('gram === "does not exist" && 'year > 0 && 'times > 0 && 'books > 0).count() } // Executed in 4233ms // res15: Long = 0 |

In generated code, the behavior is exact the same; first predicate will continue the loop in the same point. Why this time difference exists?

Thanks

​

Reply via email to