Michael, I haven't been paying close attention to the JIRA tickets for PrunedFilteredScan but I noticed some weird behavior around the filters being applied when OR expressions were used in the WHERE clause. From what I was seeing, it looks like it could be possible that the "start" and "end" ranges you are proposing to place in the WHERE clause could actually never be pushed down to the PrunedFilteredScan if there's an OR expression in there, like: (start > "2014-12-01" and end < "2015-02-12") or (....). I haven't done a unit test for this case yet, but I did file SPARK-5296 because of the behavior I was seeing. I'm requiring a time range in the services I'm writing because without it, the full Accumulo table would be scanned- and that's no good.
Are there any plans on making the CatalystScan public in the near future (possibly once SparkSQL reaches the proposed stability in 1.3?) On Fri, Feb 13, 2015 at 12:14 AM, Michael Armbrust <mich...@databricks.com> wrote: > Hi Corey, > > I would not recommend using the CatalystScan for this. Its lower level, > and not stable across releases. > > You should be able to do what you want with PrunedFilteredScan > <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L155>, > though. The filters > <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/filters.scala> > that it pushes down are already normalized so you can easily look for range > predicates with the start/end columns you care about. > > val start = filters.find { > case GreaterThan("start", startDate: String) => > DateTime.parse(startDate).toDate > }.getOrElse(<min possible start date>) > val end = filters.find { > case LessThan("end", endDate: String) => DateTime.parse(endDate).toDate > }.getOrElse(<max possible date>) > > ... > > Filters are advisory, so you can ignore ones that aren't start/end. > > Michael > > On Thu, Feb 12, 2015 at 8:32 PM, Corey Nolet <cjno...@gmail.com> wrote: > >> I have a temporal data set in which I'd like to be able to query using >> Spark SQL. The dataset is actually in Accumulo and I've already written a >> CatalystScan implementation and RelationProvider[1] to register with the >> SQLContext so that I can apply my SQL statements. >> >> With my current implementation, the start and stop time ranges are set on >> the RelationProvider (so ultimately they become a per-table setting). I'd >> much rather be able to register the table without the time ranges and just >> specify them through the SQL query string itself (perhaps a expression in >> the WHERE clause?) >> >> >> [1] >> https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStoreCatalyst.scala >> > >