Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

Ufuk Celebi Mon, 07 Nov 2016 02:47:00 -0800

On 4 November 2016 at 17:09:25, Josh (jof...@gmail.com) wrote:
> Thanks, I didn't know about the -z flag!
>  
> I haven't been able to get it to work though (using yarn-cluster, with a
> zookeeper root configured to /flink in my flink-conf.yaml)
>  
> I can see my job directory in ZK under
> /flink/application_1477475694024_0015 and I've tried a few ways to restore
> the job:
>  
> ./bin/flink run -m yarn-cluster -yz /application_1477475694024_0015 ....
> ./bin/flink run -m yarn-cluster -yz application_1477475694024_0015 ....
> ./bin/flink run -m yarn-cluster -yz /flink/application_1477475694024_0015/
> ....
> ./bin/flink run -m yarn-cluster -yz /flink/application_1477475694024_0015
> ....
>  
> The job starts from scratch each time, without restored state.
>  
> Am I doing something wrong? I've also tried with -z instead of -yz but I'm
> using yarn-cluster to run a single job, so I think it should be -yz.

Can you please check the JobManager logs of the initial job that you want to 
resume and look for a line like this:

Using ‘.../flink/application_...' as Zookeeper namespace.

Now you need to set the part after 'flink/' as the namespace, probably 
"application_1477475694024_0015" (from your last message).

The flag should be just -z. You can also set it in the Flink config file:

high-availability.cluster-id: application_1477475694024_0015

Does this help?

---

There is also a new feature in Flink 1.2 allowing you to persist every 
checkpoint externally. The feature is already in, but the configuration will be 
adjusted (https://github.com/apache/flink/pull/2752).

Currently you can configure it by specifying a checkpoint directory manually 
via:

state.checkpoints.dir: hdfs:///flink/checkpoints

In the CheckpointConfig you enable it via

CheckpoingConfig config = env.getCheckpointConfig();
config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

– Ufuk

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

Reply via email to