Yao Guangdong created HIVE-25837: ------------------------------------ Summary: Hive merge file operation may consume long time Key: HIVE-25837 URL: https://issues.apache.org/jira/browse/HIVE-25837 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: All Versions Reporter: Yao Guangdong
It will cost very long time in some cases when we use hive merge files.This is because we have thousands, even tens of thousands or more small files.But this files is very small.Most of small files only have a little kb.The merge file implement is only consider the target size(default 256M) at now.Which make one map will merge thousands, even tens of thousands or more small files.Which will cost too long time. In this case,we change the code not only consider the targe size but also care about the number of merge files per map(default 1024/map).Which may cause the target files small than user's setting,but compare with the cost on merge files i think user can accept it. -- This message was sent by Atlassian Jira (v8.20.1#820001)