We had a Hive MR job writing to a table using the ORC file format. The
cluster had fair share scheduler with task preemption enabled. For one of
the job's reduce tasks, task attempt 0 finished putting the output file to
the final location and, before it completed, was killed by preemption. All
subsequent attempts of that task ran and failed, complaining that the
output file already existed:

Caused by: java.io.IOException: File already exists:...
    ...
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:538)
    at
org.apache.hadoop.hive.ql.io.orc.WriterImpl.ensureWriter(WriterImpl.java:1320)
    at
org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1337)
    at
org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:173)
    at
org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:162)
    at
org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:151)
    at
org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:1475)
    at
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:88)
    at
org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:688)
    ... 12 more

I think the problem happened with ORC but not with text or sequence file
because of this:

1. org.apache.hadoop.hive.ql.io.orc.WriterImpl.ensureWriter
      rawWriter = fs.create(path, false, HDFS_BUFFER_SIZE,

2.
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter
    final OutputStream outStream = Utilities.createCompressedStream(jc, fs
        .create(outPath), isCompressed);

3. org.apache.hadoop.io.SequenceFile.BlockCompressWriter.BlockCompressWriter
                 fs.create(name, true, bufferSize, replication, blockSize,
progress),

In #2 and #3, fs.create is called with the 2nd parameter overwrite == true
(overwrite defaults to true). In #1, fs.create is called with overwrite ==
false. So, it appears the fix is to change false to true. Right?

It is a very narrow time window between putting the final output file and
completing the task attempt. Hence the problem has been rarely seen by us.
One can imagine that the same problem will occur if the task attempt's JVM
or machine crashed in that time window instead of getting preempted.

The vast majority of Hive jobs on this cluster write to text or sequence
files, yet to our knowledge this problem never occurred with either format.
This give me more confidence that it is ORC-specific.

Thanks.
Steven

Reply via email to