liuxiaolong created HADOOP-16872: ------------------------------------ Summary: Performance improvement when distcp files in large dir with -direct option Key: HADOOP-16872 URL: https://issues.apache.org/jira/browse/HADOOP-16872 Project: Hadoop Common Issue Type: Improvement Reporter: liuxiaolong
We use distcp with -direct option to copy a file between two large directories. We found it costed a few minutes. If we launch too much distcp jobs at the same time, NameNode performance degradation is serious. hadoop -direct -skipcrccheck -update -prbugaxt -i -numListstatusThreads 1 hdfs://cluster1:8020/source/100.log hdfs://cluster2:8020/target/100.jpg || ||Dir path||Count|| ||Source dir|| hdfs://cluster1:8020/source/ ||100k+ files|| ||Target dir||hdfs://cluster2:8020/target/ ||100k+ files|| Check code in CopyCommitter.java, we find in function deleteAttemptTempFiles() has a code targetFS.globStatus(new Path(targetWorkPath, ".distcp.tmp." + jobId.replaceAll("job","attempt") + "*")); it will waste a lot of time when distcp between two large dirs. When we use distcp with -direct option, it will direct write to the target file without generate a '.distcp.tmp' temp file. So, i think this code need add a judgment in function deleteAttemptTempFiles, if distcp with -direct option, do nothing , directly return . -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org