Re: Spark sql not pushing down timestamp range queries

Kiran Chitturi Fri, 15 Apr 2016 11:47:44 -0700

Mich,

I am curious as well on how Spark casts between different types of filters.


For example: the conversions happen implicitly for 'EqualTo' filter

scala> sqlContext.sql("SELECT * from events WHERE `registration` =
> '2015-05-28'").explain()
>
> 16/04/15 11:44:15 INFO ParseDriver: Parsing command: SELECT * from events
> WHERE `registration` = '2015-05-28'
>
> 16/04/15 11:44:15 INFO ParseDriver: Parse Completed
>
> 16/04/15 11:44:15 INFO SolrRelation: Constructed SolrQuery:
> q=*:*&rows=1000&collection=demo&fq=registration:"2015-05-28T07:00:00.000Z"
>
> == Physical Plan ==
>
> Filter (registration#17 = 1432796400000000)
> +- Scan 
> com.lucidworks.spark.SolrRelation@483f70bf[method#0,author#1,location#2,song#3,timestamp#4,auth#5,page#6,userAgent#7,lastName#8,firstName#9,id#10,itemInSession#11L,artist#12,status#13L,sessionId#14L,length#15,_version_#16L,registration#17,userId#18L,title#19,gender#20,level#21]
> PushedFilters: [EqualTo(registration,2015-05-28 00:00:00.0)]


whereas for other filters 'GreaterThan', 'GreatherThanOrEqual', 'LessThan',
'LessThanOREqual', the values are casted to String.

scala> sqlContext.sql("SELECT * from events WHERE `registration` >=
> '2015-05-28 00.00.00'").explain()
> 16/04/15 11:45:13 INFO ParseDriver: Parsing command: SELECT * from events
> WHERE `registration` >= '2015-05-28 00.00.00'
> 16/04/15 11:45:13 INFO ParseDriver: Parse Completed
> == Physical Plan ==
> Filter (cast(registration#17 as string) >= 2015-05-28 00.00.00)
> +- Scan com.lucidworks.spark.SolrRelation@483f70bf
> [method#0,author#1,location#2,song#3,timestamp#4,auth#5,page#6,userAgent#7,lastName#8,firstName#9,id#10,itemInSession#11L,artist#12,status#13L,sessionId#14L,length#15,_version_#16L,registration#17,userId#18L,title#19,gender#20,level#21]



How does Hive work with date/timestamp queries ? Can you share any docs
regarding this ?

I have found this in Hive docs
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-PartitionBasedQueries
.

SELECT page_views.*
> FROM page_views
> WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'


In the above SQL query, is there any implicit casting to String or does
Hive try to convert it to a Date/Timestamp and then do the comparison ?

 Thanks,

On Fri, Apr 15, 2016 at 1:53 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Takeshi,
>
> I did check it. I believe you are referring to this statement
>
> "This is likely because we cast this expression weirdly to be compatible
> with Hive. Specifically I think this turns into, CAST(c_date AS STRING)
> >= "2016-01-01", and we don't push down casts down into data sources.
>
> The reason for casting this way is because users currently expect the
> following to work c_date >= "2016". "
>
> There are two issues here:
>
>    1. The CAST expression is not pushed down
>    2. I still don't trust the string comparison of dates. It may or may
>    not work. I recall this as an issue in Hive
>
> '2012-11-23' is not a DATE; It is a string. In general from my
> experience one should not try to compare a DATE with a string. The results
> will depend on several factors, some related to the tool, some related to
> the session.
>
> For example the following will convert a date as string format into a DATE
> type in Hive
>
>
> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(TransactionDate,'dd/MM/yyyy'),'yyyy-MM-dd'))
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 April 2016 at 06:58, Takeshi Yamamuro <linguin....@gmail.com> wrote:
>
>> Hi, Mich
>>
>> Did you check the URL Josh referred to?;
>> the cast for string comparisons is needed for accepting `c_date >=
>> "2016"`.
>>
>> // maropu
>>
>>
>> On Fri, Apr 15, 2016 at 10:30 AM, Hyukjin Kwon <gurwls...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> String comparison itself is pushed down fine but the problem is to deal
>>> with Cast.
>>>
>>>
>>> It was pushed down before but is was reverted, (
>>> https://github.com/apache/spark/pull/8049).
>>>
>>> Several fixes were tried here,
>>> https://github.com/apache/spark/pull/11005 and etc. but there were no
>>> changes to make it.
>>>
>>>
>>> To cut it short, it is not being pushed down because it is unsafe to
>>> resolve cast (eg. long to integer)
>>>
>>> For an workaround,  the implementation of Solr data source should be
>>> changed to one with CatalystScan, which take all the filters.
>>>
>>> But CatalystScan is not designed to be binary compatible across
>>> releases, however it looks some think it is stable now, as mentioned here,
>>> https://github.com/apache/spark/pull/10750#issuecomment-175400704.
>>>
>>>
>>> Thanks!
>>>
>>>
>>> 2016-04-15 3:30 GMT+09:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>>>
>>>> Hi Josh,
>>>>
>>>> Can you please clarify whether date comparisons as two strings work at
>>>> all?
>>>>
>>>> I was under the impression is that with string comparison only first
>>>> characters are compared?
>>>>
>>>> Thanks
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 14 April 2016 at 19:26, Josh Rosen <joshro...@databricks.com> wrote:
>>>>
>>>>> AFAIK this is not being pushed down because it involves an implicit
>>>>> cast and we currently don't push casts into data sources or scans; see
>>>>> https://github.com/databricks/spark-redshift/issues/155 for a
>>>>> possibly-related discussion.
>>>>>
>>>>> On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Are you comparing strings in here or timestamp?
>>>>>>
>>>>>> Filter ((cast(registration#37 as string) >= 2015-05-28) &&
>>>>>> (cast(registration#37 as string) <= 2015-05-29))
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 14 April 2016 at 18:04, Kiran Chitturi <
>>>>>> kiran.chitt...@lucidworks.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Timestamp range filter queries in SQL are not getting pushed down to
>>>>>>> the PrunedFilteredScan instances. The filtering is happening at the 
>>>>>>> Spark
>>>>>>> layer.
>>>>>>>
>>>>>>> The physical plan for timestamp range queries is not showing the
>>>>>>> pushed filters where as range queries on other types is working fine as 
>>>>>>> the
>>>>>>> physical plan is showing the pushed filters.
>>>>>>>
>>>>>>> Please see below for code and examples.
>>>>>>>
>>>>>>> *Example:*
>>>>>>>
>>>>>>> *1.* Range filter queries on Timestamp types
>>>>>>>
>>>>>>>    *code: *
>>>>>>>
>>>>>>>> sqlContext.sql("SELECT * from events WHERE `registration` >=
>>>>>>>> '2015-05-28' AND `registration` <= '2015-05-29' ")
>>>>>>>
>>>>>>>    *Full example*:
>>>>>>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>>>>>>> *    plan*:
>>>>>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql
>>>>>>>
>>>>>>> *2. * Range filter queries on Long types
>>>>>>>
>>>>>>>     *code*:
>>>>>>>
>>>>>>>> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
>>>>>>>> `length` <= '1000'")
>>>>>>>
>>>>>>>     *Full example*:
>>>>>>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>>>>>>>     *plan*:
>>>>>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql
>>>>>>>
>>>>>>> The SolrRelation class we use extends
>>>>>>> <https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala#L37>
>>>>>>> the PrunedFilteredScan.
>>>>>>>
>>>>>>> Since Solr supports date ranges, I would like for the timestamp
>>>>>>> filters to be pushed down to the Solr query.
>>>>>>>
>>>>>>> Are there limitations on the type of filters that are passed down
>>>>>>> with Timestamp types ?
>>>>>>> Is there something that I should do in my code to fix this ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> --
>>>>>>> Kiran Chitturi
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
Kiran Chitturi

Re: Spark sql not pushing down timestamp range queries

Reply via email to