Ok. I just verified that this is the case with a little test: WHERE (a = 'v1' and b = 'v2') PrunedFilteredScan passes down 2 filters WHERE(a = 'v1' and b = 'v2') or (a = 'v3') PrunedFilteredScan passes down 0 filters
On Fri, Feb 13, 2015 at 12:28 AM, Corey Nolet <cjno...@gmail.com> wrote: > Michael, > > I haven't been paying close attention to the JIRA tickets for > PrunedFilteredScan but I noticed some weird behavior around the filters > being applied when OR expressions were used in the WHERE clause. From what > I was seeing, it looks like it could be possible that the "start" and "end" > ranges you are proposing to place in the WHERE clause could actually never > be pushed down to the PrunedFilteredScan if there's an OR expression in > there, like: (start > "2014-12-01" and end < "2015-02-12") or (....). I > haven't done a unit test for this case yet, but I did file SPARK-5296 > because of the behavior I was seeing. I'm requiring a time range in the > services I'm writing because without it, the full Accumulo table would be > scanned- and that's no good. > > Are there any plans on making the CatalystScan public in the near future > (possibly once SparkSQL reaches the proposed stability in 1.3?) > > > On Fri, Feb 13, 2015 at 12:14 AM, Michael Armbrust <mich...@databricks.com > > wrote: > >> Hi Corey, >> >> I would not recommend using the CatalystScan for this. Its lower level, >> and not stable across releases. >> >> You should be able to do what you want with PrunedFilteredScan >> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L155>, >> though. The filters >> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/filters.scala> >> that it pushes down are already normalized so you can easily look for range >> predicates with the start/end columns you care about. >> >> val start = filters.find { >> case GreaterThan("start", startDate: String) => >> DateTime.parse(startDate).toDate >> }.getOrElse(<min possible start date>) >> val end = filters.find { >> case LessThan("end", endDate: String) => DateTime.parse(endDate).toDate >> }.getOrElse(<max possible date>) >> >> ... >> >> Filters are advisory, so you can ignore ones that aren't start/end. >> >> Michael >> >> On Thu, Feb 12, 2015 at 8:32 PM, Corey Nolet <cjno...@gmail.com> wrote: >> >>> I have a temporal data set in which I'd like to be able to query using >>> Spark SQL. The dataset is actually in Accumulo and I've already written a >>> CatalystScan implementation and RelationProvider[1] to register with the >>> SQLContext so that I can apply my SQL statements. >>> >>> With my current implementation, the start and stop time ranges are set >>> on the RelationProvider (so ultimately they become a per-table setting). >>> I'd much rather be able to register the table without the time ranges and >>> just specify them through the SQL query string itself (perhaps a expression >>> in the WHERE clause?) >>> >>> >>> [1] >>> https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStoreCatalyst.scala >>> >> >> >