Hello,
that's an interesting question, but after Frank's reply I am a bit puzzled.
If there is no control over the pushdown status how can Spark guarantee the
correctness of the final query?
Consider a filter pushed down to the data source, either Spark has to know
if it has been applied or not,
Well even if it has to apply it again, if pushdown is activated then it will be
much less cost for spark to see if the filter has been applied or not. Applying
the filter is negligible, what it really avoids if the file format implements
it is IO cost (for reading) as well as cost for converting
expressions/functions can be expensive and I do think Spark should trust
data source and not re-apply pushed filters. If data source lies, many
things can go wrong...
On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke wrote:
> Well even if it has to apply it again, if pushdown is activated then it
> wil
It is not about lying or not or trust or not. Some or all filters may not be
supported by a data source. Some might only be applied under certain
environmental conditions (eg enough memory etc).
It is much more expensive to communicate between Spark and a data source which
filters have been ap
Gitter is cool and convenient.
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
I think this has come up before, and the issue is really that it adds
yet another channel for people to follow to get 100% of the discussion
about the project. I don't believe the project would bless an official
chat channel, but, anyone can run an unofficial one of course.
On Sun, Dec 9, 2018 at 3