I’ve opened a PR on this: https://github.com/apache/spark/pull/5509
On April 14, 2015 at 11:57:34 AM, Yijie Shen (henry.yijies...@gmail.com) wrote: Hi, Suppose I have a table t(id: String, event: String) saved as parquet file, and have directory hierarchy: hdfs://path/to/data/root/dt=2015-01-01/hr=00 After partition discovery, the result schema should be (id: String, event: String, dt: String, hr: Int) If I have a query like: df.select($“id”).filter(event match).filter($“dt” > “2015-01-01”).filter($”hr” > 13) In current implementation, after (dt > 2015-01-01 && hr >13) is used to filter partitions, these two filters remains in execution plan and result in each row returned from parquet add two fields dt & hr each time, which I think is useless, if we could rewrite execution.Filter’s predicate and eliminate them. What’s your opinion? Is it a general assumption or it’s just my job’s specific requirement? If it’s a general one, I would love to discuss further about the implementations. If specific, I would just make my own workaround :) — Best Regards! Yijie Shen