subject:"Re\: Pushdown in DataSourceV2 question"

Re: Pushdown in DataSourceV2 question

2018-12-11 Thread Ryan Blue

In v2, it is up to the data source to tell Spark that a pushed filter is satisfied, by returning the pushed filters that Spark should run. You can indicate that a filter is handled by the source by not returning it for Spark. You can also show that a filter is used by the source by showing it in th

Re: Pushdown in DataSourceV2 question

2018-12-11 Thread Noritaka Sekiyama

Hi, Thank you for responding to this thread. I'm really interested in this discussion. My original idea might be the same as what Alessandro said, introducing a mechanism that Spark can communicate with DataSource and get metadata which shows if pushdown is supported or not. I'm wondering if it wi

Re: Pushdown in DataSourceV2 question

2018-12-10 Thread Alessandro Solimando

I think you are generally right, but there are so many different scenarios that it might not always be the best option, consider for instance a "fast" network in between a single data source and "Spark", lots of data, an "expensive" (with low selectivity) expression as Wenchen suggested. In such a

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Jörn Franke

It is not about lying or not or trust or not. Some or all filters may not be supported by a data source. Some might only be applied under certain environmental conditions (eg enough memory etc). It is much more expensive to communicate between Spark and a data source which filters have been ap

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Wenchen Fan

expressions/functions can be expensive and I do think Spark should trust data source and not re-apply pushed filters. If data source lies, many things can go wrong... On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke wrote: > Well even if it has to apply it again, if pushdown is activated then it > wil

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Jörn Franke

Well even if it has to apply it again, if pushdown is activated then it will be much less cost for spark to see if the filter has been applied or not. Applying the filter is negligible, what it really avoids if the file format implements it is IO cost (for reading) as well as cost for converting

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Alessandro Solimando

Hello, that's an interesting question, but after Frank's reply I am a bit puzzled. If there is no control over the pushdown status how can Spark guarantee the correctness of the final query? Consider a filter pushed down to the data source, either Spark has to know if it has been applied or not,

Re: Pushdown in DataSourceV2 question

2018-12-08 Thread Jörn Franke

BTW. Even for json a pushdown can make sense to avoid that data is unnecessary ending in Spark ( because it would cause unnecessary overhead). In the datasource v2 api you need to implement a SupportsPushDownFilter > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama : > > Hi, > > I'm a support

Re: Pushdown in DataSourceV2 question

2018-12-08 Thread Jörn Franke

It was already available before DataSourceV2, but I think it might have been an internal/semi-official API (eg json is an internal datasource since some time now). The filters were provided to the datasource, but you will never know if the datasource has indeed leveraged them or if for other rea

Re: Pushdown in DataSourceV2 question

Re: Pushdown in DataSourceV2 question

Re: Pushdown in DataSourceV2 question

Re: Pushdown in DataSourceV2 question

Re: Pushdown in DataSourceV2 question

Re: Pushdown in DataSourceV2 question

Re: Pushdown in DataSourceV2 question

Re: Pushdown in DataSourceV2 question

Re: Pushdown in DataSourceV2 question

9 matches

Site Navigation

Mail list logo

Footer information