[ 
https://issues.apache.org/jira/browse/HIVE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016461#comment-13016461
 ] 

Ning Zhang commented on HIVE-2082:
----------------------------------

@Edward, HIVE-1913 fixed a bug in PartitionDesc where previously table 
properties are returned even if partition properties are present. This patch 
doesn't change that. 

What this patch changed is how the PartitionDesc.properties is constructed. 
Previously properties is constructed using part.getSchema(), which will 
construct a new Properties object for each partition. The most memory consuming 
part is the colNames, colTypes and partStrings (see 
MetaStoreUtils.getSchema()). Since they are constructed using the table level 
StorageDescriptor, all partitions have the same colNames, colTypes and 
partStrings. So we could use the same objects for all partitions. 

This patch introduces a new PartitionDesc constructor with an additional 
TableDesc argument. The properties is constructed by using 
part.getSchemaFromTableSchema(tblDesc.getProperties()), which construct the 
properties by cloning the table level properties to the partiton level 
properties first and then overwrite it with partition specific arguments. 
Basically all except the colNames, colTypes and partStrings will be overwritten 
with the partition level Properties. 

> Reduce memory consumption in preparing MapReduce job
> ----------------------------------------------------
>
>                 Key: HIVE-2082
>                 URL: https://issues.apache.org/jira/browse/HIVE-2082
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-2082.patch, HIVE-2082.patch, HIVE-2082.patch
>
>
> Hive client side consume a lot of memory when the number of input partitions 
> is large. One reason is that each partition maintains a list of FieldSchema 
> which are intended to deal with schema evolution. However they are not used 
> currently and Hive uses the table level schema for all partitions. This will 
> be fixed in HIVE-2050. The memory consumption by this part will be reduced by 
> almost half (1.2GB to 700BM for 20k partitions). 
> Another large chunk of memory consumption is in the MapReduce job setup phase 
> when a PartitionDesc is created from each Partition object. A property object 
> is maintained in PartitionDesc which contains a full list of columns and 
> types. Due to the same reason, these should be the same as in the table level 
> schema. Also the deserializer initialization takes large amount of memory, 
> which should be avoided. My initial testing for these optimizations cut the 
> memory consumption in half (700MB to 300MB for 20k partitions). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to