MapReduce Output File Names

dam6923 Fri, 27 Jul 2018 11:50:16 -0700

Hello,

When Hive MapReduce jobs create HDFS output files, they use the format:


000000_0.gz
000000_0.gz_copy_1
000000_0.gz_copy_2
000000_0.gz_copy_3
...

This seems like it could become a long running list over time.  In
fact, the code says "leave the below loop for now until a better
approach is found."

https://github.com/apache/hive/blob/758ff449099065a84c46d63f9418201c8a6731b1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L3710

Would it be problematic to simply prefix a random number, or
timestamp, on the front of the file name to make it unique?  This
would save the code from having to loop to ask the FileSystem
(NameNode) "is copy 1 there?", "is copy 2 there?", "is copy 1 there?"
etc.

Thanks.

MapReduce Output File Names

Reply via email to