Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

Raju Bairishetti Tue, 17 Jan 2017 18:53:27 -0800

Thanks Michael for the respopnse.


On Wed, Jan 18, 2017 at 2:45 AM, Michael Allman <[email protected]>
wrote:

> Hi Raju,
>
> I'm sorry this isn't working for you. I helped author this functionality
> and will try my best to help.
>
> First, I'm curious why you set spark.sql.hive.convertMetastoreParquet to
> false?
>
I had set as suggested in SPARK-6910 and corresponsing pull reqs. It did
not work for me without  setting *spark.sql.hive.convertMetastoreParquet*
property.

Can you link specifically to the jira issue or spark pr you referred to?
> The first thing I would try is setting spark.sql.hive.convertMetastoreParquet
> to true. Setting that to false might also explain why you're getting
> parquet decode errors. If you're writing your table data with Spark's
> parquet file writer and reading with Hive's parquet file reader, there may
> be an incompatibility accounting for the decode errors you're seeing.
>
>  https://issues.apache.org/jira/browse/SPARK-6910 . My main motivation is
to avoid fetching all the partitions. We reverted
spark.sql.hive.convertMetastoreParquet
 setting to true to decoding errors. After reverting this it is fetching
all partiitons from the table.

Can you reply with your table's Hive metastore schema, including partition
> schema?
>
     col1 string
     col2 string
     year int
     month int
     day int
     hour int

# Partition Information

# col_name            data_type           comment

year  int

month int

day int

hour int

venture string

>
>
Where are the table's files located?
>
In hadoop. Under some user directory.

> If you do a "show partitions <dbname>.<tablename>" in the spark-sql shell,
> does it show the partitions you expect to see? If not, run "msck repair
> table <dbname>.<tablename>".
>
Yes. It is listing the partitions

> Cheers,
>
> Michael
>
>
> On Jan 17, 2017, at 12:02 AM, Raju Bairishetti <[email protected]> wrote:
>
> Had a high level look into the code. Seems getHiveQlPartitions  method
> from HiveMetastoreCatalog is getting called irrespective of 
> metastorePartitionPruning
> conf value.
>
>  It should not fetch all partitions if we set metastorePartitionPruning to
> true (Default value for this is false)
>
> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] = {
>   val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) {
>     table.getPartitions(predicates)
>   } else {
>     allPartitions
>   }
>
> ...
>
> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] =
>   client.getPartitionsByFilter(this, predicates)
>
> lazy val allPartitions = table.getAllPartitions
>
> But somehow getAllPartitions is getting called eventough after setting 
> metastorePartitionPruning to true.
>
> Am I missing something or looking at wrong place?
>
>
> On Tue, Jan 17, 2017 at 4:01 PM, Raju Bairishetti <[email protected]> wrote:
>
>> Hello,
>>
>>    Spark sql is generating query plan with all partitions information
>> even though if we apply filters on partitions in the query.  Due to
>> this, sparkdriver/hive metastore is hitting with OOM as each table is
>> with lots of partitions.
>>
>> We can confirm from hive audit logs that it tries to
>> *fetch all partitions* from hive metastore.
>>
>>  2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub    ip=/x.x.x.x
>> cmd=get_partitions : db=xxxx tbl=xxxxx
>>
>>
>> Configured the following parameters in the spark conf to fix the above
>> issue(source: from spark-jira & github pullreq):
>>
>> *spark.sql.hive.convertMetastoreParquet   false*
>> *    spark.sql.hive.metastorePartitionPruning   true*
>>
>>
>> *   plan:  rdf.explain*
>> *   == Physical Plan ==*
>>        HiveTableScan [rejection_reason#626], MetastoreRelation dbname,
>> tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 =
>> 28),(hour#317 = 2),(venture#318 = DEFAULT)]
>>
>> *    get_partitions_by_filter* method is called and fetching only
>> required partitions.
>>
>>     But we are seeing parquetDecode errors in our applications frequently
>> after this. Looks like these decoding errors were because of changing
>> serde fromspark-builtin to hive serde.
>>
>> I feel like,* fixing query plan generation in the spark-sql* is the
>> right approach instead of forcing users to use hive serde.
>>
>> Is there any workaround/way to fix this issue? I would like to hear more
>> thoughts on this :)
>>
>>
>> On Tue, Jan 17, 2017 at 4:00 PM, Raju Bairishetti <[email protected]>
>> wrote:
>>
>>> Had a high level look into the code. Seems getHiveQlPartitions  method
>>> from HiveMetastoreCatalog is getting called irrespective of 
>>> metastorePartitionPruning
>>> conf value.
>>>
>>>  It should not fetch all partitions if we set metastorePartitionPruning to
>>> true (Default value for this is false)
>>>
>>> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] 
>>> = {
>>>   val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) {
>>>     table.getPartitions(predicates)
>>>   } else {
>>>     allPartitions
>>>   }
>>>
>>> ...
>>>
>>> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] =
>>>   client.getPartitionsByFilter(this, predicates)
>>>
>>> lazy val allPartitions = table.getAllPartitions
>>>
>>> But somehow getAllPartitions is getting called eventough after setting 
>>> metastorePartitionPruning to true.
>>>
>>> Am I missing something or looking at wrong place?
>>>
>>>
>>> On Mon, Jan 16, 2017 at 12:53 PM, Raju Bairishetti <[email protected]>
>>> wrote:
>>>
>>>> Waiting for suggestions/help on this...
>>>>
>>>> On Wed, Jan 11, 2017 at 12:14 PM, Raju Bairishetti <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>>    Spark sql is generating query plan with all partitions information
>>>>> even though if we apply filters on partitions in the query.  Due to this,
>>>>> spark driver/hive metastore is hitting with OOM as each table is with lots
>>>>> of partitions.
>>>>>
>>>>> We can confirm from hive audit logs that it tries to *fetch all
>>>>> partitions* from hive metastore.
>>>>>
>>>>>  2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]:
>>>>> HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(371)) -
>>>>> ugi=rajub    ip=/x.x.x.x   cmd=get_partitions : db=xxxx tbl=xxxxx
>>>>>
>>>>>
>>>>> Configured the following parameters in the spark conf to fix the above
>>>>> issue(source: from spark-jira & github pullreq):
>>>>>
>>>>> *spark.sql.hive.convertMetastoreParquet   false*
>>>>> *    spark.sql.hive.metastorePartitionPruning   true*
>>>>>
>>>>>
>>>>> *   plan:  rdf.explain*
>>>>> *   == Physical Plan ==*
>>>>>        HiveTableScan [rejection_reason#626], MetastoreRelation dbname,
>>>>> tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 =
>>>>> 28),(hour#317 = 2),(venture#318 = DEFAULT)]
>>>>>
>>>>> *    get_partitions_by_filter* method is called and fetching only
>>>>> required partitions.
>>>>>
>>>>>     But we are seeing parquetDecode errors in our applications
>>>>> frequently after this. Looks like these decoding errors were because of
>>>>> changing serde from spark-builtin to hive serde.
>>>>>
>>>>> I feel like,* fixing query plan generation in the spark-sql* is the
>>>>> right approach instead of forcing users to use hive serde.
>>>>>
>>>>> Is there any workaround/way to fix this issue? I would like to hear
>>>>> more thoughts on this :)
>>>>>
>>>>> ------
>>>>> Thanks,
>>>>> Raju Bairishetti,
>>>>> www.lazada.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ------
>>>> Thanks,
>>>> Raju Bairishetti,
>>>> www.lazada.com
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> ------
>>> Thanks,
>>> Raju Bairishetti,
>>> www.lazada.com
>>>
>>
>>
>>
>> --
>>
>> ------
>> Thanks,
>> Raju Bairishetti,
>> www.lazada.com
>>
>
>
>
> --
>
> ------
> Thanks,
> Raju Bairishetti,
> www.lazada.com
>
>
>


-- 

------
Thanks,
Raju Bairishetti,
www.lazada.com

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

Reply via email to