Hi Yun,

Yes, the job’s status change to Running pretty fast after failure (~ 1 min).

As soon as the status change to running, first checkpoint is kick off and
it took 30 mins. I need to have exactly-one as i maintining some
aggregation metric, do you know whats the diffrent between first checkpoint
and checkpoints after that? (it’s fairely quick after that)

Here is size of my checkpoints ( i config to keep 5 latest checkpoints)
449M    chk-1626
775M    chk-1627
486M    chk-1628
7.8G    chk-1629
7.5G    chk-1630

I dont know why the size is too diffrent.
Metrics on checkpoints looks good as besides the spike in the first
checkpoint, everything looks fine.

@Vino: Yes, i can try to switch to DEBuG to see if i got any information.


On Thu, Sep 6, 2018 at 7:09 AM vino yang <yanghua1...@gmail.com> wrote:

> Hi trung,
>
> Can you provide more information to aid in positioning? For example, the
> size of the state generated by a checkpoint and more log information, you
> can try to switch the log level to DEBUG.
>
> Thanks, vino.
>
> Yun Tang <myas...@live.com> 于2018年9月6日周四 下午7:42写道:
>
>> Hi Kien
>>
>> From your description, your job has already started to execute checkpoint
>> after job failover, which means your job was in RUNNING status. From my
>> point of view, the actual recovery time should be the time during job's
>> status: RESTARTING->CREATED->RUNNING[1].
>> Your trouble sounds more like the long time needed for the first
>> checkpoint to complete after job failover. Afaik, It's probably because
>> your job is heavily back pressured after the failover and the checkpoint
>> mode is exactly-once, some operators need to receive all the input
>> checkpoint barrier to trigger the checkpoint. You can watch your metrics of
>> checkpoint alignment time to verify the root cause, and if you do not need
>> the exactly once guarantees, you can change the checkpoint mode to
>> at-least-once[2].
>>
>> Best
>> Yun Tang
>>
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/internals/job_scheduling.html#jobmanager-data-structures
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once
>>
>> <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once>
>> Apache Flink 1.6 Documentation: Data Streaming Fault Tolerance
>> <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once>
>> Apache Flink offers a fault tolerance mechanism to consistently recover
>> the state of data streaming applications. The mechanism ensures that even
>> in the presence of failures, the program’s state will eventually reflect
>> every record from the data stream exactly once. Note that there is a switch
>> to ...
>> ci.apache.org
>>
>> ------------------------------
>> *From:* trung kien <kient...@gmail.com>
>> *Sent:* Thursday, September 6, 2018 18:50
>> *To:* user@flink.apache.org
>> *Subject:* Flink failure recovery tooks very long time
>>
>> Hi all,
>>
>> I am trying to test failure recovery of a Flink job when a JM or TM goes
>> down.
>> Our target is having job auto restart and back to normal condition in any
>> case.
>>
>> However, what's I am seeing is very strange and hope someone here help me
>> to understand it.
>>
>> When JM or TM went down, I see the job was being restarted but as soon as
>> it restart it's working on checkingpoint and usually took 30+ minutes to
>> finish (usually in normal case, it only take 1-2 mins for checkpoint), As
>> soon as the checkpoint is finish, the job is back to normal condition.
>>
>> I'm using 1.4.2, but seeing similar thing on 1.6.0 as well.
>>
>> Could anyone please help to explain this behavior? We really want to
>> reduce the time of recovery but doesn't seem to find any document mentioned
>> about recovery process in detail.
>>
>> Any help is really appreciate.
>>
>>
>> --
>> Thanks
>> Kien
>> --
>> Thanks
>> Kien
>>
> --
Thanks
Kien

Reply via email to