Re: Using Spark SQL for temporal data

Corey Nolet Thu, 12 Feb 2015 21:39:47 -0800

Ok. I just verified that this is the case with a little test:

WHERE (a = 'v1' and b = 'v2')    PrunedFilteredScan passes down 2 filters
WHERE(a = 'v1' and b = 'v2') or (a = 'v3') PrunedFilteredScan passes down 0
filters


On Fri, Feb 13, 2015 at 12:28 AM, Corey Nolet <cjno...@gmail.com> wrote:

> Michael,
>
> I haven't been paying close attention to the JIRA tickets for
> PrunedFilteredScan but I noticed some weird behavior around the filters
> being applied when OR expressions were used in the WHERE clause. From what
> I was seeing, it looks like it could be possible that the "start" and "end"
> ranges you are proposing to place in the WHERE clause could actually never
> be pushed down to the PrunedFilteredScan if there's an OR expression in
> there, like: (start > "2014-12-01" and end < "2015-02-12") or (....). I
> haven't done a unit test for this case yet, but I did file SPARK-5296
> because of the behavior I was seeing. I'm requiring a time range in the
> services I'm writing because without it, the full Accumulo table would be
> scanned- and that's no good.
>
> Are there any plans on making the CatalystScan public in the near future
> (possibly once SparkSQL reaches the proposed stability in 1.3?)
>
>
> On Fri, Feb 13, 2015 at 12:14 AM, Michael Armbrust <mich...@databricks.com
> > wrote:
>
>> Hi Corey,
>>
>> I would not recommend using the CatalystScan for this.  Its lower level,
>> and not stable across releases.
>>
>> You should be able to do what you want with PrunedFilteredScan
>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L155>,
>> though.  The filters
>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/filters.scala>
>> that it pushes down are already normalized so you can easily look for range
>> predicates with the start/end columns you care about.
>>
>> val start = filters.find {
>>   case GreaterThan("start", startDate: String) => 
>> DateTime.parse(startDate).toDate
>> }.getOrElse(<min possible start date>)
>> val end = filters.find {
>>   case LessThan("end", endDate: String) => DateTime.parse(endDate).toDate
>> }.getOrElse(<max possible date>)
>>
>> ...
>>
>> Filters are advisory, so you can ignore ones that aren't start/end.
>>
>> Michael
>>
>> On Thu, Feb 12, 2015 at 8:32 PM, Corey Nolet <cjno...@gmail.com> wrote:
>>
>>> I have a temporal data set in which I'd like to be able to query using
>>> Spark SQL. The dataset is actually in Accumulo and I've already written a
>>> CatalystScan implementation and RelationProvider[1] to register with the
>>> SQLContext so that I can apply my SQL statements.
>>>
>>> With my current implementation, the start and stop time ranges are set
>>> on the RelationProvider (so ultimately they become a per-table setting).
>>> I'd much rather be able to register the table without the time ranges and
>>> just specify them through the SQL query string itself (perhaps a expression
>>> in the WHERE clause?)
>>>
>>>
>>> [1]
>>> https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStoreCatalyst.scala
>>>
>>
>>
>

Re: Using Spark SQL for temporal data

Reply via email to