In v2, it is up to the data source to tell Spark that a pushed filter is
satisfied, by returning the pushed filters that Spark should run. You can
indicate that a filter is handled by the source by not returning it for
Spark. You can also show that a filter is used by the source by showing it
in th
Hi,
Thank you for responding to this thread. I'm really interested in this
discussion.
My original idea might be the same as what Alessandro said, introducing a
mechanism that Spark can communicate with DataSource and get metadata which
shows if pushdown is supported or not.
I'm wondering if it wi
I think you are generally right, but there are so many different scenarios
that it might not always be the best option, consider for instance a "fast"
network in between a single data source and "Spark", lots of data, an
"expensive" (with low selectivity) expression as Wenchen suggested.
In such a
It is not about lying or not or trust or not. Some or all filters may not be
supported by a data source. Some might only be applied under certain
environmental conditions (eg enough memory etc).
It is much more expensive to communicate between Spark and a data source which
filters have been ap
expressions/functions can be expensive and I do think Spark should trust
data source and not re-apply pushed filters. If data source lies, many
things can go wrong...
On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke wrote:
> Well even if it has to apply it again, if pushdown is activated then it
> wil
Well even if it has to apply it again, if pushdown is activated then it will be
much less cost for spark to see if the filter has been applied or not. Applying
the filter is negligible, what it really avoids if the file format implements
it is IO cost (for reading) as well as cost for converting
Hello,
that's an interesting question, but after Frank's reply I am a bit puzzled.
If there is no control over the pushdown status how can Spark guarantee the
correctness of the final query?
Consider a filter pushed down to the data source, either Spark has to know
if it has been applied or not,
BTW. Even for json a pushdown can make sense to avoid that data is unnecessary
ending in Spark ( because it would cause unnecessary overhead).
In the datasource v2 api you need to implement a SupportsPushDownFilter
> Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama :
>
> Hi,
>
> I'm a support
It was already available before DataSourceV2, but I think it might have been an
internal/semi-official API (eg json is an internal datasource since some time
now). The filters were provided to the datasource, but you will never know if
the datasource has indeed leveraged them or if for other rea