HadoopOutputFormat has issues with LocalExecutionEnvironment?

Ken Krugler Tue, 25 Aug 2020 16:11:25 -0700

Hi devs,

In HadoopOutputFormat.close(), I see code that is trying to rename 
<outputPath>/tmp-r-00001 to be <outputPath>/1


But when I run my Flink 1.9.2 code using a local MiniCluster, the actual 
location of the tmp-r-00001 file is:

<outputPath>/_temporary/0/task__0000_r_000001/tmp-r-00001

I think this is because the default behavior of Hadoop’s FileOutputCommitter 
(with algorithm == 1) is to put files in task-specific sub-dirs.

It’s depending on a post-completion “merge paths” action to be taken by what is 
(for Hadoop) the Application Master.

I assume that when running on a real cluster, the 
HadoopOutputFormat.finalizeGlobal() method’s call to commitJob() would do this, 
but it doesn’t seem to be happening when I run locally.

If I set the algorithm version to 2, then “merge paths” is handled by 
FileOutputCommitter immediately, and the HadoopOutputFormat code finds files in 
the expected location.

Wondering if Flink should always be using version 2 of the algorithm, as that’s 
more performant when there are a lot of results (which is why it was added).

Thanks,

— Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

HadoopOutputFormat has issues with LocalExecutionEnvironment?

Reply via email to