Yes, the problem is, I've turned the flag on. One possible reason for this is, the parquet file supports "predicate pushdown" by setting statistical min/max value of each column on parquet blocks. If in my test, the "groupID=10113000" is scattered in all parquet blocks, then the predicate pushdown fails.
But, I'm not quite sure about that. I don't know whether there is any other reason that can lead to this. On Wed, Jan 7, 2015 at 10:14 PM, Cody Koeninger <c...@koeninger.org> wrote: > But Xuelin already posted in the original message that the code was using > > SET spark.sql.parquet.filterPushdown=true > > On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv <danielru...@gmail.com> > wrote: > >> Quoting Michael: >> Predicate push down into the input format is turned off by default >> because there is a bug in the current parquet library that null pointers >> when there are full row groups that are null. >> >> https://issues.apache.org/jira/browse/SPARK-4258 >> >> You can turn it on if you want: >> http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration >> >> Daniel >> >> On 7 בינו׳ 2015, at 08:18, Xuelin Cao <xuelin...@yahoo.com.INVALID> >> wrote: >> >> >> Hi, >> >> I'm testing parquet file format, and the predicate pushdown is a >> very useful feature for us. >> >> However, it looks like the predicate push down doesn't work after >> I set >> sqlContext.sql("SET spark.sql.parquet.filterPushdown=true") >> >> Here is my sql: >> *sqlContext.sql("select adId, adTitle from ad where >> groupId=10113000").collect* >> >> Then, I checked the amount of input data on the WEB UI. But the >> amount of input data is ALWAYS 80.2M regardless whether I turn the >> spark.sql.parquet.filterPushdown >> flag on or off. >> >> I'm not sure, if there is anything that I must do when *generating >> *the parquet file in order to make the predicate pushdown available. >> (Like ORC file, when creating the ORC file, I need to explicitly sort the >> field that will be used for predicate pushdown) >> >> Anyone have any idea? >> >> And, anyone knows the internal mechanism for parquet predicate >> pushdown? >> >> Thanks >> >> >> >> >