Re: Eliminate partition filters in execution.Filter after filter pruning

Yijie Shen Tue, 14 Apr 2015 07:30:45 -0700

I’ve opened a PR on this: https://github.com/apache/spark/pull/5509


On April 14, 2015 at 11:57:34 AM, Yijie Shen (henry.yijies...@gmail.com) wrote:

Hi,

Suppose I have a table t(id: String, event: String) saved as parquet file, and 
have directory hierarchy:  
hdfs://path/to/data/root/dt=2015-01-01/hr=00
After partition discovery, the result schema should be (id: String, event: 
String, dt: String, hr: Int)

If I have a query like:

df.select($“id”).filter(event match).filter($“dt” > “2015-01-01”).filter($”hr” 
> 13)

In current implementation, after (dt > 2015-01-01 && hr >13) is used to filter 
partitions, 
these two filters remains in execution plan and result in each row returned 
from parquet add two fields dt & hr each time,  
which I think is useless, if we could rewrite execution.Filter’s predicate and 
eliminate them.

What’s your opinion? Is it a general assumption or it’s just my job’s specific 
requirement?  

If it’s a general one, I would love to discuss further about the 
implementations. 
If specific, I would just make my own workaround :)

— 
Best Regards!
Yijie Shen

Re: Eliminate partition filters in execution.Filter after filter pruning

Reply via email to