Yao Guangdong created HIVE-25837:
------------------------------------

             Summary: Hive merge file operation may consume long time
                 Key: HIVE-25837
                 URL: https://issues.apache.org/jira/browse/HIVE-25837
             Project: Hive
          Issue Type: Improvement
          Components: Hive
    Affects Versions: All Versions
            Reporter: Yao Guangdong


  It will cost very long time in some cases when we use hive merge files.This 
is because we have thousands, even tens of thousands or more small files.But 
this files is very small.Most of small files only have a little kb.The merge 
file implement is only consider the target size(default 256M) at now.Which make 
one map will merge thousands, even tens of thousands or more small files.Which 
will cost too long time.

  In this case,we change the code not only consider the targe size but also 
care about the number of merge files per map(default 1024/map).Which may cause 
the target files small than user's setting,but compare with the cost on merge 
files i think user can accept it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to