Time delay in multiple predicate Filter

Nikodimos Nikolaidis Fri, 16 Mar 2018 06:28:00 -0700

Hello,

Here’s a behavior that I find strange. A filtering with a predicate withzero selectivity is much quicker than a filtering with multiplepredicates, but with the same zero-selectivity predicate in first place.

For example, in google’s English One Million 1-grams dataset (Spark 2.2,wholeStageCodegen enabled, local mode):

|val df = spark.read.parquet("../../1-grams.parquet") def metric[T](f: =>T): T = { val t0 = System.nanoTime() val res = f println("Executed in "+ (System.nanoTime()-t0)/1000000 + "ms") res } df.count() // res11: Long= 261823186 metric { df.filter('gram === "does not exist").count() } //Executed in 1794ms // res13: Long = 0 metric { df.filter('gram === "doesnot exist" && 'year > 0 && 'times > 0 && 'books > 0).count() } //Executed in 4233ms // res15: Long = 0 |

In generated code, the behavior is exact the same; first predicate willcontinue the loop in the same point. Why this time difference exists?


Thanks

Time delay in multiple predicate Filter

Reply via email to