Re: Flink Checkpoint on yarn

Simone Robutti Sat, 19 Mar 2016 07:33:14 -0700

Ok, i run another test.

I launched two identical jobs, one after the other, on yarn (without the
long running session). I then killed a job manager and both the jobs got
problems and then resumed their work after a few seconds. The problem is
the first job restored the state of the second job and vice versa.


Here are the logs: https://gist.github.com/chobeat/38f8eee753aeaca51acc

At line 141 of the first job and at line 131 of the second job I killed the
job manager. As you can see, the first stopped at 48 and resumed at 39
while the second stopped at 38 and resumed at 48. I hope there's something
wrong with my configuration because otherwise this really looks like a bug.

Thanks in advance,

Simone

2016-03-16 18:55 GMT+01:00 Simone Robutti <simone.robu...@radicalbit.io>:

> Actually the test was intended for a single job. The fact that there are
> more jobs is unexpected and it will be the first thing to verify.
> Considering these problems we will go for deeper tests with multiple jobs.
>
> The logs are collected with "yarn logs" but log aggregation is not
> properly configured so I wouldn't rely too much on that. Before doing the
> tests tomorrow I will clear all the existing logs just to be sure.
>
> 2016-03-16 18:19 GMT+01:00 Ufuk Celebi <u...@apache.org>:
>
>> OK, so you are submitting multiple jobs, but you submit them with -m
>> yarn-cluster and therefore expect them to start separate YARN
>> clusters. Makes sense and I would expect the same.
>>
>> I think that you can check in the client logs printed to stdout to
>> which cluster the job is submitted.
>>
>> PS: The logs you have shared are out-of-order, how did you gather
>> them? Do you have an idea why they are out of order? Maybe something
>> is mixed up in the way we gather the logs and we only think that
>> something is wrong because of this.
>>
>>
>> On Wed, Mar 16, 2016 at 6:11 PM, Simone Robutti
>> <simone.robu...@radicalbit.io> wrote:
>> > I didn't resubmitted the job. Also the jobs are submitted one by one
>> with -m
>> > yarn-master, not with a long running yarn session so I don't really
>> know if
>> > they could mix up.
>> >
>> > I will repeat the test with a cleaned state because we saw that killing
>> the
>> > job with yarn application -kill left the "flink run" process alive so
>> that
>> > may be the problem. We just noticed a few minutes ago.
>> >
>> > If the problem persists, I will eventually come back with a full log.
>> >
>> > Thanks for now,
>> >
>> > Simone
>> >
>> > 2016-03-16 18:04 GMT+01:00 Ufuk Celebi <u...@apache.org>:
>> >>
>> >> Hey Simone,
>> >>
>> >> from the logs it looks like multiple jobs have been submitted to the
>> >> cluster, not just one. The different files correspond to different
>> >> jobs recovering. The filtered logs show three jobs running/recovering
>> >> (with IDs 10d8ccae6e87ac56bf763caf4bc4742f,
>> >> 124f29322f9026ac1b35435d5de9f625, 7f280b38065eaa6335f5c3de4fc82547).
>> >>
>> >> Did you manually re-submit the job after killing a job manager?
>> >>
>> >> Regarding the counts, it can happen that they are rolled back to a
>> >> previous consistent state if the checkpoint was not completed yet
>> >> (including the write to ZooKeeper). In that case the job state will be
>> >> rolled back to an earlier consistent state.
>> >>
>> >> Can you please share the complete job manager logs of your program?
>> >> The most helpful thing will be to have a log for each started job
>> >> manager container. I don't know if that is easily possible.
>> >>
>> >> – Ufuk
>> >>
>> >> On Wed, Mar 16, 2016 at 4:12 PM, Simone Robutti
>> >> <simone.robu...@radicalbit.io> wrote:
>> >> > This is the log filtered to check messages from
>> >> > ZooKeeperCompletedCheckpointStore.
>> >> >
>> >> > https://gist.github.com/chobeat/0222b31b87df3fa46a23
>> >> >
>> >> > It looks like it finds only a checkpoint but I'm not sure if the
>> >> > different
>> >> > hashes and IDs of the checkpoints are meaningful or not.
>> >> >
>> >> >
>> >> >
>> >> > 2016-03-16 15:33 GMT+01:00 Ufuk Celebi <u...@apache.org>:
>> >> >>
>> >> >> Can you please have a look into the JobManager log file and report
>> >> >> which checkpoints are restored? You should see messages from
>> >> >> ZooKeeperCompletedCheckpointStore like:
>> >> >> - Found X checkpoints in ZooKeeper
>> >> >> - Initialized with X. Removing all older checkpoints
>> >> >>
>> >> >> You can share the complete job manager log file as well if you like.
>> >> >>
>> >> >> – Ufuk
>> >> >>
>> >> >> On Wed, Mar 16, 2016 at 2:50 PM, Simone Robutti
>> >> >> <simone.robu...@radicalbit.io> wrote:
>> >> >> > Hello,
>> >> >> >
>> >> >> > I'm testing the checkpointing functionality with hdfs as a
>> backend.
>> >> >> >
>> >> >> > For what I can see it uses different checkpointing files and
>> resume
>> >> >> > the
>> >> >> > computation from different points and not from the latest
>> available.
>> >> >> > This is
>> >> >> > to me an unexpected behaviour.
>> >> >> >
>> >> >> > I log every second, for every worker, a counter that is increased
>> by
>> >> >> > 1
>> >> >> > at
>> >> >> > each step.
>> >> >> >
>> >> >> > So for example on node-1 the count goes up to 5, then I kill a job
>> >> >> > manager
>> >> >> > or task manager and it resumes from 5 or 4 and it's ok. The next
>> time
>> >> >> > I
>> >> >> > kill
>> >> >> > a job manager the count is at 15 and it resumes at 14 or 15.
>> >> >> > Sometimes
>> >> >> > it
>> >> >> > may happen that at a third kill the work resumes at 4 or 5 as if
>> the
>> >> >> > checkpoint resumed the second time wasn't there.
>> >> >> >
>> >> >> > Once I even saw it jump forward: the first kill is at 10 and it
>> >> >> > resumes
>> >> >> > at
>> >> >> > 9, the second kill is at 70 and it resumes at 9, the third kill
>> is at
>> >> >> > 15
>> >> >> > but
>> >> >> > it resumes at 69 as if it resumed from the second kill checkpoint.
>> >> >> >
>> >> >> > This is clearly inconsistent.
>> >> >> >
>> >> >> > Also, in the logs I can find that sometimes it uses a checkpoint
>> file
>> >> >> > different from the previous, consistent resume.
>> >> >> >
>> >> >> > What am I doing wrong? Is it a known bug?
>> >> >
>> >> >
>> >
>> >
>>
>
>

Re: Flink Checkpoint on yarn

Reply via email to