We had a Hive MR job writing to a table using the ORC file format. The cluster had fair share scheduler with task preemption enabled. For one of the job's reduce tasks, task attempt 0 finished putting the output file to the final location and, before it completed, was killed by preemption. All subsequent attempts of that task ran and failed, complaining that the output file already existed:
Caused by: java.io.IOException: File already exists:... ... at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:538) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.ensureWriter(WriterImpl.java:1320) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1337) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:173) at org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:162) at org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:151) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:1475) at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:88) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:688) ... 12 more I think the problem happened with ORC but not with text or sequence file because of this: 1. org.apache.hadoop.hive.ql.io.orc.WriterImpl.ensureWriter rawWriter = fs.create(path, false, HDFS_BUFFER_SIZE, 2. org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter final OutputStream outStream = Utilities.createCompressedStream(jc, fs .create(outPath), isCompressed); 3. org.apache.hadoop.io.SequenceFile.BlockCompressWriter.BlockCompressWriter fs.create(name, true, bufferSize, replication, blockSize, progress), In #2 and #3, fs.create is called with the 2nd parameter overwrite == true (overwrite defaults to true). In #1, fs.create is called with overwrite == false. So, it appears the fix is to change false to true. Right? It is a very narrow time window between putting the final output file and completing the task attempt. Hence the problem has been rarely seen by us. One can imagine that the same problem will occur if the task attempt's JVM or machine crashed in that time window instead of getting preempted. The vast majority of Hive jobs on this cluster write to text or sequence files, yet to our knowledge this problem never occurred with either format. This give me more confidence that it is ORC-specific. Thanks. Steven