I see now. It optimizes the selection semantics so that less things need to be included just to do a count(). Very nice. I did a collect() instead of a count just to see what would happen and it looks like the all the expected select fields were propagated down as expected. Thanks.
On Sat, Jan 17, 2015 at 4:29 PM, Michael Armbrust <mich...@databricks.com> wrote: > How are you running your test here? Are you perhaps doing a .count()? > > On Sat, Jan 17, 2015 at 12:54 PM, Corey Nolet <cjno...@gmail.com> wrote: > >> Michael, >> >> What I'm seeing (in Spark 1.2.0) is that the required columns being >> pushed down to the DataRelation are not the product of the SELECT clause >> but rather just the columns explicitly included in the WHERE clause. >> >> Examples from my testing: >> >> SELECT * FROM myTable --> The required columns are empty. >> SELECT key1 FROM myTable --> The required columns are empty >> SELECT * FROM myTable where key1 = 'val1' --> The required columns >> contains key1. >> SELECT key1,key2 FROM myTable where key1 = 'val1' --> The required >> columns contains key1 >> SELECT key1,key2 FROM myTable where key1 = 'val1' and key2 = 'val2' --> >> The required columns cintains key1,key2 >> >> >> >> I created SPARK-5296 for the OR predicate to be pushed down in some >> capacity. >> >> >> >> >> >> >> >> On Sat, Jan 17, 2015 at 3:38 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> 1) The fields in the SELECT clause are not pushed down to the predicate >>>> pushdown API. I have many optimizations that allow fields to be filtered >>>> out before the resulting object is serialized on the Accumulo tablet >>>> server. How can I get the selection information from the execution plan? >>>> I'm a little hesitant to implement the data relation that allows me to see >>>> the logical plan because it's noted in the comments that it could change >>>> without warning. >>>> >>> >>> I'm not sure I understand. The list of required columns should be >>> pushed down to the data source. Are you looking for something more >>> complicated? >>> >>> >>>> 2) I'm surprised to find that the predicate pushdown filters get >>>> completely removed when I do anything more complex in a where clause other >>>> than simple AND statements. Using an OR statement caused the filter array >>>> that was passed into the PrunedFilteredDataSource to be empty. >>>> >>> >>> This was just an initial cut at the set of predicates to push down. We >>> can add Or. Mind opening a JIRA? >>> >> >> >