Re: Review Request 67485: HIVE-19783 Retrieve only locations in HiveMetaStore.dropPartitionsAndGetLocations

Peter Vary via Review Board Tue, 19 Jun 2018 05:25:51 -0700


> On jún. 13, 2018, 4:50 du, Vihang Karajgaonkar wrote:
> > standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
> > Line 2545 (original), 2538 (patched)
> > <https://reviews.apache.org/r/67485/diff/2/?file=2036273#file2036273line2545>
> >
> >     My concern here is that we are removing the batch processing from this 
> > method. While the memory footprint of this method has reduced since we are 
> > not retrieving the fully loaded partition objects, I am worried that it may 
> > still cause OOMs for very large tables. Do you have any testing results 
> > which shows that this implementation is not any worse than what we already 
> > have in terms of the memory footprint?
> 
> Peter Vary wrote:
>     I was able to run tests with a HMS using 4G memory dropping 1 million 
> partitions without problems (It was harder to create the test tables, than 
> dropping them :) )
>     
>     I think the typical size for partitionName is for a 5 level parititoned 
> table is ~150 bytes, and the location is ~300 bytes (partition name, plus 
> table location), which is around 500 bytes for partitions. The theortical 
> maximum is partitionName 767 bytes, and location 4000 bytes.
>     Currently there are customers who are not able to drop tables with 100k 
> partitions. For this number, the typical location map is 50M. The theortical 
> maximum is for the map ~500M.
>     
>     I think for a metastore where a table contains 100k partitions 50M of 
> memory allocation should not cause a problem. These customers often have 64G 
> of memory set for HMS.
>     
>     Also we rutinely query every partition name for a table (see: 
> PartitionIterator). If we have a 5 level partitioned table, then the memory 
> pressure is in the range of this method, and we do not allow any other query 
> run against this table.
>     
>     I improved the change with your idea, so from now on getPartitionLocation 
> will not return the locations which are parent for the base directory. So for 
> typical managed tables it will return null for every partition thus the load 
> will be raffly the same than the PartitionIterator.
>     
>     If we decice we should query the partition locations in batches then we 
> could do it in a follow-up jira:
>     - new configuration parameter - Like: 
> metastore.batch.retrieve.table.partition.location.max = 10000
>     - modify getPartitionLocations to have input like partitionNames. We will 
> have move to use getPartQueryWithParams which we have to check how it handles 
> big numbers of partitionNames. 
>     - get the partition name list when dropping partitions, and getting the 
> locations for batches.
>     
>     What do you think? Is the possibility of memory problems in this case 
> worth the extra complexity and risk?


Added back the batch processing


- Peter


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67485/#review204700
-----------------------------------------------------------


On jún. 19, 2018, 12:23 du, Peter Vary wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67485/
> -----------------------------------------------------------
> 
> (Updated jún. 19, 2018, 12:23 du)
> 
> 
> Review request for hive, Alexander Kolbasov and Vihang Karajgaonkar.
> 
> 
> Bugs: HIVE-19783
>     https://issues.apache.org/jira/browse/HIVE-19783
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Added a new getPartitionLocations method to the RawStore interface.
> 
> Implemented getPartitionLocations in ObjectStore using JDQL.
> Question: In CachedObjectStore: Shall I call rawStore.getPartitionLocations 
> or reimplement it using getPartitions?
> 
> Modified dropPartitionsAndGetLocations:
> - Instead of querying every partition data. Query only the locations using 
> the new interface method
> - Removed partKeys parameter which become unneccessary
> 
> 
> Diffs
> -----
> 
>   
> itests/hcatalog-unit/src/test/java/org/apache/hive/hcatalog/listener/DummyRawStoreFailEvent.java
>  8f9a03fcd1 
>   
> standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
>  e88f9a5fee 
>   
> standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
>  e99f888eef 
>   
> standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/RawStore.java
>  bbbdf21d4b 
>   
> standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/cache/CachedStore.java
>  7c3588d104 
>   
> standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
>  ec9e9e2b95 
>   
> standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreControlledCommit.java
>  7c7429db15 
>   
> standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
>  e4f2a17d64 
>   
> standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/client/MetaStoreFactoryForTests.java
>  1a57df2680 
>   
> standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/client/TestTablesCreateDropAlterTruncate.java
>  e1c3dcb47f 
> 
> 
> Diff: https://reviews.apache.org/r/67485/diff/4/
> 
> 
> Testing
> -------
> 
> Run the TestTablesCreateDropAlterTruncate test (partitioned table creation 
> and drop)
> 
> 
> Thanks,
> 
> Peter Vary
> 
>

Re: Review Request 67485: HIVE-19783 Retrieve only locations in HiveMetaStore.dropPartitionsAndGetLocations

Reply via email to