Re: Using Spark SQL for temporal data

Michael Armbrust Thu, 12 Feb 2015 21:17:36 -0800

Hi Corey,

I would not recommend using the CatalystScan for this.  Its lower level,
and not stable across releases.

You should be able to do what you want with PrunedFilteredScan
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L155>,
though.  The filters
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/filters.scala>
that it pushes down are already normalized so you can easily look for range
predicates with the start/end columns you care about.

val start = filters.find {
  case GreaterThan("start", startDate: String) =>
DateTime.parse(startDate).toDate
}.getOrElse(<min possible start date>)
val end = filters.find {
  case LessThan("end", endDate: String) => DateTime.parse(endDate).toDate
}.getOrElse(<max possible date>)

...

Filters are advisory, so you can ignore ones that aren't start/end.

Michael

On Thu, Feb 12, 2015 at 8:32 PM, Corey Nolet <cjno...@gmail.com> wrote:

> I have a temporal data set in which I'd like to be able to query using
> Spark SQL. The dataset is actually in Accumulo and I've already written a
> CatalystScan implementation and RelationProvider[1] to register with the
> SQLContext so that I can apply my SQL statements.
>
> With my current implementation, the start and stop time ranges are set on
> the RelationProvider (so ultimately they become a per-table setting). I'd
> much rather be able to register the table without the time ranges and just
> specify them through the SQL query string itself (perhaps a expression in
> the WHERE clause?)
>
>
> [1]
> https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStoreCatalyst.scala
>

Re: Using Spark SQL for temporal data

Reply via email to