Sourabh Goyal created HIVE-25800:
------------------------------------

             Summary: loadDynamicPartitions in Hive.java should not load all 
partitions of a table from HMS 
                 Key: HIVE-25800
                 URL: https://issues.apache.org/jira/browse/HIVE-25800
             Project: Hive
          Issue Type: Improvement
          Components: Hive
            Reporter: Sourabh Goyal
            Assignee: Sourabh Goyal


HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java to 
not add partitions one by one in HMS. As part of that improvement, following 
code was introduced: 
{code:java}
// fetch all the partitions matching the part spec using the partition iterable
// this way the maximum batch size configuration parameter is considered
PartitionIterable partitionIterable = new PartitionIterable(Hive.get(), tbl, 
partSpec,
          conf.getInt(MetastoreConf.ConfVars.BATCH_RETRIEVE_MAX.getVarname(), 
300));
Iterator<Partition> iterator = partitionIterable.iterator();

// Match valid partition path to partitions
while (iterator.hasNext()) {
  Partition partition = iterator.next();
  partitionDetailsMap.entrySet().stream()
          .filter(entry -> 
entry.getValue().fullSpec.equals(partition.getSpec()))
          .findAny().ifPresent(entry -> {
            entry.getValue().partition = partition;
            entry.getValue().hasOldPartition = true;
          });
} {code}
The above code fetches all the existing partitions for a table from HMS and 
compare that dynamic partitions list to decide old and new partitions to be 
added to HMS (in batches). The call to fetch all partitions has introduced a 
performance regression for tables with large number of partitions (of the order 
of 100K). 

 

The fix is to skip fetching all partitions. Instead, in the threadPool which 
loads each partition individually,  call get_partition() to check if the 
partition already exists in HMS or not.  

This will introduce additional getPartition() call for every partition to be 
loaded dynamically but removes fetching all existing partitions for a table. 

I believe this is fine since for tables with small number of existing 
partitions in HMS - getPartitions() won't add too much overhead but for tables 
with large number of existing partitions, it will certainly avoid getting all 
partitions from HMS 

cc - [~lpinter] [~ngangam] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to