Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

Michael Allman Tue, 17 Jan 2017 20:53:43 -0800

What version of Spark are you running?


> On Jan 17, 2017, at 8:42 PM, Raju Bairishetti <r...@apache.org> wrote:
> 
>  describe dummy;
> 
> OK
> 
> sample              string                                       
> 
> year                string                                       
> 
> month               string                                        
> 
> # Partition Information        
> 
> # col_name            data_type           comment    
> 
> year                string                                       
> 
> 
> month               string 
> 
> 
> 
> val df = sqlContext.sql("select count(1) from rajub.dummy where year='2017'")
> 
> df: org.apache.spark.sql.DataFrame = [_c0: bigint]
> 
> scala> df.explain
> 
> == Physical Plan ==
> 
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[_c0#3070L])
> 
> +- TungstenExchange SinglePartition, None
> 
>    +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#3076L])
> 
>       +- Scan ParquetRelation: rajub.dummy[] InputPaths: 
> maprfs:/user/rajub/dummy/sample/year=2016/month=10, 
> maprfs:/user/rajub/dummy/sample/year=2016/month=11, 
> maprfs:/user/rajub/dummy/sample/year=2016/month=9, 
> maprfs:/user/rajub/dummy/sample/year=2017/month=10, 
> maprfs:/user/rajub/dummy/sample/year=2017/month=11, 
> maprfs:/user/rajub/dummy/sample/year=2017/month=9
> 
> 
> On Wed, Jan 18, 2017 at 12:25 PM, Michael Allman <mich...@videoamp.com 
> <mailto:mich...@videoamp.com>> wrote:
> Can you paste the actual query plan here, please?
> 
>> On Jan 17, 2017, at 7:38 PM, Raju Bairishetti <r...@apache.org 
>> <mailto:r...@apache.org>> wrote:
>> 
>> 
>> On Wed, Jan 18, 2017 at 11:13 AM, Michael Allman <mich...@videoamp.com 
>> <mailto:mich...@videoamp.com>> wrote:
>> What is the physical query plan after you set 
>> spark.sql.hive.convertMetastoreParquet to true?
>> Physical plan continas all the partition locations 
>> 
>> Michael
>> 
>>> On Jan 17, 2017, at 6:51 PM, Raju Bairishetti <r...@apache.org 
>>> <mailto:r...@apache.org>> wrote:
>>> 
>>> Thanks Michael for the respopnse.
>>> 
>>> 
>>> On Wed, Jan 18, 2017 at 2:45 AM, Michael Allman <mich...@videoamp.com 
>>> <mailto:mich...@videoamp.com>> wrote:
>>> Hi Raju,
>>> 
>>> I'm sorry this isn't working for you. I helped author this functionality 
>>> and will try my best to help.
>>> 
>>> First, I'm curious why you set spark.sql.hive.convertMetastoreParquet to 
>>> false? 
>>> I had set as suggested in SPARK-6910 and corresponsing pull reqs. It did 
>>> not work for me without  setting spark.sql.hive.convertMetastoreParquet 
>>> property. 
>>> 
>>> Can you link specifically to the jira issue or spark pr you referred to? 
>>> The first thing I would try is setting 
>>> spark.sql.hive.convertMetastoreParquet to true. Setting that to false might 
>>> also explain why you're getting parquet decode errors. If you're writing 
>>> your table data with Spark's parquet file writer and reading with Hive's 
>>> parquet file reader, there may be an incompatibility accounting for the 
>>> decode errors you're seeing. 
>>> 
>>>  https://issues.apache.org/jira/browse/SPARK-6910 
>>> <https://issues.apache.org/jira/browse/SPARK-6910> . My main motivation is 
>>> to avoid fetching all the partitions. We reverted 
>>> spark.sql.hive.convertMetastoreParquet  setting to true to decoding errors. 
>>> After reverting this it is fetching all partiitons from the table.
>>> 
>>> Can you reply with your table's Hive metastore schema, including partition 
>>> schema?
>>>      col1 string
>>>      col2 string
>>>      year int
>>>      month int
>>>      day int
>>>      hour int   
>>> # Partition Information      
>>> 
>>> # col_name            data_type           comment    
>>> 
>>> year  int
>>> 
>>> month int
>>> 
>>> day int
>>> 
>>> hour int
>>> 
>>> venture string
>>> 
>>>  
>>> Where are the table's files located?
>>> In hadoop. Under some user directory. 
>>> If you do a "show partitions <dbname>.<tablename>" in the spark-sql shell, 
>>> does it show the partitions you expect to see? If not, run "msck repair 
>>> table <dbname>.<tablename>".
>>> Yes. It is listing the partitions
>>> Cheers,
>>> 
>>> Michael
>>> 
>>> 
>>>> On Jan 17, 2017, at 12:02 AM, Raju Bairishetti <r...@apache.org 
>>>> <mailto:r...@apache.org>> wrote:
>>>> 
>>>> Had a high level look into the code. Seems getHiveQlPartitions  method 
>>>> from HiveMetastoreCatalog is getting called irrespective of 
>>>> metastorePartitionPruning conf value.
>>>> 
>>>>  It should not fetch all partitions if we set metastorePartitionPruning to 
>>>> true (Default value for this is false) 
>>>> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] 
>>>> = {
>>>>   val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) {
>>>>     table.getPartitions(predicates)
>>>>   } else {
>>>>     allPartitions
>>>>   }
>>>> ...
>>>> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] =
>>>>   client.getPartitionsByFilter(this, predicates)
>>>> lazy val allPartitions = table.getAllPartitions
>>>> But somehow getAllPartitions is getting called eventough after setting 
>>>> metastorePartitionPruning to true.
>>>> Am I missing something or looking at wrong place?
>>>> 
>>>> On Tue, Jan 17, 2017 at 4:01 PM, Raju Bairishetti <r...@apache.org 
>>>> <mailto:r...@apache.org>> wrote:
>>>> Hello,
>>>>       
>>>>    Spark sql is generating query plan with all partitions information even 
>>>> though if we apply filters on partitions in the query.  Due to this, 
>>>> sparkdriver/hive metastore is hitting with OOM as each table is with lots 
>>>> of partitions.
>>>> 
>>>> We can confirm from hive audit logs that it tries to fetch all partitions 
>>>> from hive metastore.
>>>> 
>>>>  2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]: HiveMetaStore.audit 
>>>> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub    ip=/x.x.x.x   
>>>> cmd=get_partitions : db=xxxx tbl=xxxxx
>>>> 
>>>> 
>>>> Configured the following parameters in the spark conf to fix the above 
>>>> issue(source: from spark-jira & github pullreq):
>>>>     spark.sql.hive.convertMetastoreParquet   false
>>>>     spark.sql.hive.metastorePartitionPruning   true
>>>> 
>>>>    plan:  rdf.explain
>>>>    == Physical Plan ==
>>>>        HiveTableScan [rejection_reason#626], MetastoreRelation dbname, 
>>>> tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 = 
>>>> 28),(hour#317 = 2),(venture#318 = DEFAULT)]
>>>> 
>>>>     get_partitions_by_filter method is called and fetching only required 
>>>> partitions.
>>>> 
>>>>     But we are seeing parquetDecode errors in our applications frequently 
>>>> after this. Looks like these decoding errors were because of changing 
>>>> serde fromspark-builtin to hive serde.
>>>> 
>>>> I feel like, fixing query plan generation in the spark-sql is the right 
>>>> approach instead of forcing users to use hive serde.
>>>> 
>>>> Is there any workaround/way to fix this issue? I would like to hear more 
>>>> thoughts on this :)
>>>> 
>>>> 
>>>> On Tue, Jan 17, 2017 at 4:00 PM, Raju Bairishetti <r...@apache.org 
>>>> <mailto:r...@apache.org>> wrote:
>>>> Had a high level look into the code. Seems getHiveQlPartitions  method 
>>>> from HiveMetastoreCatalog is getting called irrespective of 
>>>> metastorePartitionPruning conf value.
>>>> 
>>>>  It should not fetch all partitions if we set metastorePartitionPruning to 
>>>> true (Default value for this is false) 
>>>> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] 
>>>> = {
>>>>   val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) {
>>>>     table.getPartitions(predicates)
>>>>   } else {
>>>>     allPartitions
>>>>   }
>>>> ...
>>>> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] =
>>>>   client.getPartitionsByFilter(this, predicates)
>>>> lazy val allPartitions = table.getAllPartitions
>>>> But somehow getAllPartitions is getting called eventough after setting 
>>>> metastorePartitionPruning to true.
>>>> Am I missing something or looking at wrong place?
>>>> 
>>>> On Mon, Jan 16, 2017 at 12:53 PM, Raju Bairishetti <r...@apache.org 
>>>> <mailto:r...@apache.org>> wrote:
>>>> Waiting for suggestions/help on this... 
>>>> 
>>>> On Wed, Jan 11, 2017 at 12:14 PM, Raju Bairishetti <r...@apache.org 
>>>> <mailto:r...@apache.org>> wrote:
>>>> Hello,
>>>>       
>>>>    Spark sql is generating query plan with all partitions information even 
>>>> though if we apply filters on partitions in the query.  Due to this, spark 
>>>> driver/hive metastore is hitting with OOM as each table is with lots of 
>>>> partitions.
>>>> 
>>>> We can confirm from hive audit logs that it tries to fetch all partitions 
>>>> from hive metastore.
>>>> 
>>>>  2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]: HiveMetaStore.audit 
>>>> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub    ip=/x.x.x.x   
>>>> cmd=get_partitions : db=xxxx tbl=xxxxx
>>>> 
>>>> 
>>>> Configured the following parameters in the spark conf to fix the above 
>>>> issue(source: from spark-jira & github pullreq):
>>>>     spark.sql.hive.convertMetastoreParquet   false
>>>>     spark.sql.hive.metastorePartitionPruning   true
>>>> 
>>>>    plan:  rdf.explain
>>>>    == Physical Plan ==
>>>>        HiveTableScan [rejection_reason#626], MetastoreRelation dbname, 
>>>> tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 = 
>>>> 28),(hour#317 = 2),(venture#318 = DEFAULT)]
>>>> 
>>>>     get_partitions_by_filter method is called and fetching only required 
>>>> partitions.
>>>> 
>>>>     But we are seeing parquetDecode errors in our applications frequently 
>>>> after this. Looks like these decoding errors were because of changing 
>>>> serde from spark-builtin to hive serde.
>>>> 
>>>> I feel like, fixing query plan generation in the spark-sql is the right 
>>>> approach instead of forcing users to use hive serde.
>>>> 
>>>> Is there any workaround/way to fix this issue? I would like to hear more 
>>>> thoughts on this :)
>>>> 
>>>> ------
>>>> Thanks,
>>>> Raju Bairishetti,
>>>> www.lazada.com <http://www.lazada.com/>
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> ------
>>>> Thanks,
>>>> Raju Bairishetti,
>>>> www.lazada.com <http://www.lazada.com/>
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> ------
>>>> Thanks,
>>>> Raju Bairishetti,
>>>> www.lazada.com <http://www.lazada.com/>
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> ------
>>>> Thanks,
>>>> Raju Bairishetti,
>>>> www.lazada.com <http://www.lazada.com/>
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> ------
>>>> Thanks,
>>>> Raju Bairishetti,
>>>> www.lazada.com <http://www.lazada.com/>
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> ------
>>> Thanks,
>>> Raju Bairishetti,
>>> www.lazada.com <http://www.lazada.com/>
>> 
>> 
>> 
>> -- 
>> 
>> ------
>> Thanks,
>> Raju Bairishetti,
>> www.lazada.com <http://www.lazada.com/>
> 
> 
> 
> -- 
> 
> ------
> Thanks,
> Raju Bairishetti,
> www.lazada.com <http://www.lazada.com/>

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

Reply via email to