Hi Ken,

sorry for the late reply. This could be a bug in Flink. Does the issue also
occur on Flink 1.11?
Have you set a breakpoint in the HadoopOutputFormat.finalizeGlobal() when
running locally to validate that this method doesn't get called?

What do you mean by "algorithm version 2"? Where can you set this? (Sorry
for the question, I'm not an expert with Hadoop's FileOutputCommitter)

Note to others: There's a related discussion here:
https://issues.apache.org/jira/browse/FLINK-19069

Best,
Robert


On Wed, Aug 26, 2020 at 1:10 AM Ken Krugler <kkrugler_li...@transpac.com>
wrote:

> Hi devs,
>
> In HadoopOutputFormat.close(), I see code that is trying to rename
> <outputPath>/tmp-r-00001 to be <outputPath>/1
>
> But when I run my Flink 1.9.2 code using a local MiniCluster, the actual
> location of the tmp-r-00001 file is:
>
> <outputPath>/_temporary/0/task__0000_r_000001/tmp-r-00001
>
> I think this is because the default behavior of Hadoop’s
> FileOutputCommitter (with algorithm == 1) is to put files in task-specific
> sub-dirs.
>
> It’s depending on a post-completion “merge paths” action to be taken by
> what is (for Hadoop) the Application Master.
>
> I assume that when running on a real cluster, the
> HadoopOutputFormat.finalizeGlobal() method’s call to commitJob() would do
> this, but it doesn’t seem to be happening when I run locally.
>
> If I set the algorithm version to 2, then “merge paths” is handled by
> FileOutputCommitter immediately, and the HadoopOutputFormat code finds
> files in the expected location.
>
> Wondering if Flink should always be using version 2 of the algorithm, as
> that’s more performant when there are a lot of results (which is why it was
> added).
>
> Thanks,
>
> — Ken
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>

Reply via email to