Hi Yun, Yes, the job’s status change to Running pretty fast after failure (~ 1 min).
As soon as the status change to running, first checkpoint is kick off and it took 30 mins. I need to have exactly-one as i maintining some aggregation metric, do you know whats the diffrent between first checkpoint and checkpoints after that? (it’s fairely quick after that) Here is size of my checkpoints ( i config to keep 5 latest checkpoints) 449M chk-1626 775M chk-1627 486M chk-1628 7.8G chk-1629 7.5G chk-1630 I dont know why the size is too diffrent. Metrics on checkpoints looks good as besides the spike in the first checkpoint, everything looks fine. @Vino: Yes, i can try to switch to DEBuG to see if i got any information. On Thu, Sep 6, 2018 at 7:09 AM vino yang <yanghua1...@gmail.com> wrote: > Hi trung, > > Can you provide more information to aid in positioning? For example, the > size of the state generated by a checkpoint and more log information, you > can try to switch the log level to DEBUG. > > Thanks, vino. > > Yun Tang <myas...@live.com> 于2018年9月6日周四 下午7:42写道: > >> Hi Kien >> >> From your description, your job has already started to execute checkpoint >> after job failover, which means your job was in RUNNING status. From my >> point of view, the actual recovery time should be the time during job's >> status: RESTARTING->CREATED->RUNNING[1]. >> Your trouble sounds more like the long time needed for the first >> checkpoint to complete after job failover. Afaik, It's probably because >> your job is heavily back pressured after the failover and the checkpoint >> mode is exactly-once, some operators need to receive all the input >> checkpoint barrier to trigger the checkpoint. You can watch your metrics of >> checkpoint alignment time to verify the root cause, and if you do not need >> the exactly once guarantees, you can change the checkpoint mode to >> at-least-once[2]. >> >> Best >> Yun Tang >> >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-master/internals/job_scheduling.html#jobmanager-data-structures >> [2] >> https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once >> >> <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once> >> Apache Flink 1.6 Documentation: Data Streaming Fault Tolerance >> <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once> >> Apache Flink offers a fault tolerance mechanism to consistently recover >> the state of data streaming applications. The mechanism ensures that even >> in the presence of failures, the program’s state will eventually reflect >> every record from the data stream exactly once. Note that there is a switch >> to ... >> ci.apache.org >> >> ------------------------------ >> *From:* trung kien <kient...@gmail.com> >> *Sent:* Thursday, September 6, 2018 18:50 >> *To:* user@flink.apache.org >> *Subject:* Flink failure recovery tooks very long time >> >> Hi all, >> >> I am trying to test failure recovery of a Flink job when a JM or TM goes >> down. >> Our target is having job auto restart and back to normal condition in any >> case. >> >> However, what's I am seeing is very strange and hope someone here help me >> to understand it. >> >> When JM or TM went down, I see the job was being restarted but as soon as >> it restart it's working on checkingpoint and usually took 30+ minutes to >> finish (usually in normal case, it only take 1-2 mins for checkpoint), As >> soon as the checkpoint is finish, the job is back to normal condition. >> >> I'm using 1.4.2, but seeing similar thing on 1.6.0 as well. >> >> Could anyone please help to explain this behavior? We really want to >> reduce the time of recovery but doesn't seem to find any document mentioned >> about recovery process in detail. >> >> Any help is really appreciate. >> >> >> -- >> Thanks >> Kien >> -- >> Thanks >> Kien >> > -- Thanks Kien