And how are we doing here on integer pushdowns? If someone does e.g. CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" without the cast?
On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <russell.spit...@gmail.com> wrote: > Thanks! That's exactly what I was hoping for! Thanks for finding the Jira > for me! > > On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> I think this is not a problem in 3.0 anymore, see >> https://issues.apache.org/jira/browse/SPARK-27638 >> >> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> I've just run into this issue again with another user and I feel like >>> most folks here have seen some flavor of this at some point. >>> >>> The user registers a Datasource with a column of type Date (or some non >>> string) then performs a query that looks like. >>> >>> *SELECT * from Source WHERE date_col > '2020-08-03'* >>> >>> Seeing that the predicate literal here is a String, Spark needs to make >>> a change so that the DataSource column will be of the same type (Date), >>> so it places a "Cast" on the Datasource column so our plan ends up >>> looking like. >>> >>> Cast(date_col as String) > '2020-08-03' >>> >>> Since the Datasource Strategies can't handle a push down of the "Cast" >>> function we lose the predicate pushdown we could >>> have had. This can change a Job from a single partition lookup into a >>> full scan leading to a very confusing situation for >>> the end user. I also wonder about the relative cost here since we could >>> be avoiding doing X casts and instead just do a single >>> one on the predicate, in addition we could be doing the cast at the >>> Analysis phase and cut the run short before any work even >>> starts rather than doing a perhaps meaningless comparison between a date >>> and a non-date string. >>> >>> I think we should seriously consider whether in cases like this we >>> should attempt to cast the literal rather than casting the >>> source column. >>> >>> Please let me know if anyone has thoughts on this, or has some previous >>> Jiras I could dig into if it's been discussed before, >>> Russ >>> >> -- Bart Samwel bart.sam...@databricks.com