To be honest, from my point of view current description should have already
give enough explanations [1] in "Overview Tab".
Latest Completed Checkpoint: The latest successfully completed checkpoints.
Latest Restore: There are two types of restore operations.
* Restore from Checkpoint: We restored from a regular periodic checkpoint.
* Restore from Savepoint: We restored from a savepoint.
You could still create a JIRA issue and give your ideas in that issue. If
agreed to work on in that ticket, you can create a PR to edit
checkpoint_monitoring.md [2] and checkpoint_monitoring.zh.md [3] to update
related documentation.
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/checkpoint_monitoring.html#overview-tab
[2]
https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.md
[3]
https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.zh.md
Best
Yun Tang
________________________________
From: Vijay Bhaskar <[email protected]>
Sent: Tuesday, May 26, 2020 15:18
To: Yun Tang <[email protected]>
Cc: user <[email protected]>
Subject: Re: In consistent Check point API response
Thanks Yun. How can i contribute better documentation of the same by opening
Jira on this?
Regards
Bhaskar
On Tue, May 26, 2020 at 12:32 PM Yun Tang
<[email protected]<mailto:[email protected]>> wrote:
Hi Bhaskar
I think I have understood your scenario now. And I think this is what expected
in Flink.
As you only allow your job could restore 5 times, the "restore" would only
record the checkpoint to restore at the 5th recovery, and the checkpoint id
would always stay there.
"Restored" is for last restored checkpoint and "completed" is for last
completed checkpoint, they are actually not the same thing.
The only scenario that they're the same in numbers is when Flink just restore
successfully before a new checkpoint completes.
Best
Yun Tang
________________________________
From: Vijay Bhaskar <[email protected]<mailto:[email protected]>>
Sent: Tuesday, May 26, 2020 12:19
To: Yun Tang <[email protected]<mailto:[email protected]>>
Cc: user <[email protected]<mailto:[email protected]>>
Subject: Re: In consistent Check point API response
Hi Yun
Understood the issue now:
"restored" always shows only the check point that is used for restoring
previous state
In all the attempts < 6 ( in my case max attempts are 5, 6 is the last attempt)
Flink HA is restoring the state, so restored and latest are same value
if the last attempt == 6
Flink job already has few check points
After that job failed and Flink HA gave up and marked the job state as "FAILED"
At this point "restored". value is the one which is in 5'th attempt but
latest is the one which is the latest checkpoint which is retained
Shall i file any documentation improvement Jira? I want to add more
documentation with the help of the above scenarios.
Regards
Bhaskar
On Tue, May 26, 2020 at 8:14 AM Yun Tang
<[email protected]<mailto:[email protected]>> wrote:
Hi Bhaskar
It seems I still not understand your case-5 totally. Your job failed 6 times,
and recover from previous checkpoint to restart again. However, you found the
REST API told the wrong answer.
How do you ensure your "restored" field is giving the wrong checkpoint file
which is not latest? Have you ever checked the log in JM to view related
contents: "Restoring job xxx from latest valid checkpoint: x@xxxx" [1] to know
exactly which checkpoint choose to restore?
I think you could give a more concrete example e.g. which expected/actual
checkpoint to restore, to tell your story.
[1]
https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250
Best
Yun Tang
________________________________
From: Vijay Bhaskar <[email protected]<mailto:[email protected]>>
Sent: Monday, May 25, 2020 17:01
To: Yun Tang <[email protected]<mailto:[email protected]>>
Cc: user <[email protected]<mailto:[email protected]>>
Subject: Re: In consistent Check point API response
Thanks Yun.
Here is the problem i am facing:
I am using jobs/:jobID/checkpoints API to recover the failed job. We have the
remote manager which monitors the jobs. We are using "restored" field of the
API response to get the latest check point file to use. Its giving correct
checkpoint file for all the 4 cases except the 5'th case. Where the "restored"
field is giving the wrong check point file which is not latest. When we
compare the check point file returned by the "completed". field, both are
giving identical checkpoints in all 4 cases, except 5'th case
We can't use flink UI in because of security reasons
Regards
Bhaskar
On Mon, May 25, 2020 at 12:57 PM Yun Tang
<[email protected]<mailto:[email protected]>> wrote:
Hi Vijay
If I understand correct, do you mean your last "restored" checkpoint is null
via REST api when the job failed 6 times and then recover successfully with
another several successful checkpoints?
First of all, if your job just recovered successfully, can you observe the
"last restored" checkpoint in web UI?
Secondly, how long will you cannot see the "restored " field after recover
successfully?
Last but not least, I cannot see the real difference among your cases, what's
the core difference in your case(5)?
>From the implementation of Flink, it will create the checkpoint statics
>without restored checkpoint and assign it once the latest savepoint/checkpoint
>is restored. [1]
[1]
https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285
Best
Yun Tang
________________________________
From: Vijay Bhaskar <[email protected]<mailto:[email protected]>>
Sent: Monday, May 25, 2020 14:20
To: user <[email protected]<mailto:[email protected]>>
Subject: In consistent Check point API response
Hi
I am using flink retained check points and along with jobs/:jobid/checkpoints
API for retrieving the latest retained check point
Following the response of Flink Checkpoints API:
I have my jobs restart attempts are 5
check point API response in "latest" key, check point file name of both
"restored" and "completed" values are having following behavior
1)Suppose the job is failed 3 times and recovered 4'th time, then both values
are same
2)Suppose the job is failed 4 times and recovered 5'th time, then both values
are same
3)Suppose the job is failed 5 times and recovered 6'th time, then both values
are same
4) Suppose the job is failed all 6 times and the job marked failed. then also
both the values are same
5)Suppose job is failed 6'th time , after recovering from 5 attempts and made
few check points, then both values are different.
During case (1), case (2), case (3) and case (4) i never had any issue. Only
When case (5) i had severe issue in my production as the "restored " field
check point doesn't exist
Please suggest any
{
"counts":{
"restored":6,
"total":3,
"in_progress":0,
"completed":3,
"failed":0
},
"summary":{
"state_size":{
"min":4879,
"max":4879,
"avg":4879
},
"end_to_end_duration":{
"min":25,
"max":130,
"avg":87
},
"alignment_buffered":{
"min":0,
"max":0,
"avg":0
}
},
"latest":{
"completed":{
"@class":"completed",
"id":7094,
"status":"COMPLETED",
"is_savepoint":false,
"trigger_timestamp":1590382502772,
"latest_ack_timestamp":1590382502902,
"state_size":4879,
"end_to_end_duration":130,
"alignment_buffered":0,
"num_subtasks":2,
"num_acknowledged_subtasks":2,
"tasks":{
},
"external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
"discarded":false
},
"savepoint":null,
"failed":null,
"restored":{
"id":7093,
"restore_timestamp":1590382478448,
"is_savepoint":false,
"external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
}
},
"history":[
{
"@class":"completed",
"id":7094,
"status":"COMPLETED",
"is_savepoint":false,
"trigger_timestamp":1590382502772,
"latest_ack_timestamp":1590382502902,
"state_size":4879,
"end_to_end_duration":130,
"alignment_buffered":0,
"num_subtasks":2,
"num_acknowledged_subtasks":2,
"tasks":{
},
"external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
"discarded":false
},
{
"@class":"completed",
"id":7093,
"status":"COMPLETED",
"is_savepoint":false,
"trigger_timestamp":1590382310195,
"latest_ack_timestamp":1590382310220,
"state_size":4879,
"end_to_end_duration":25,
"alignment_buffered":0,
"num_subtasks":2,
"num_acknowledged_subtasks":2,
"tasks":{
},
"external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
"discarded":false
},
{
"@class":"completed",
"id":7092,
"status":"COMPLETED",
"is_savepoint":false,
"trigger_timestamp":1590382190195,
"latest_ack_timestamp":1590382190303,
"state_size":4879,
"end_to_end_duration":108,
"alignment_buffered":0,
"num_subtasks":2,
"num_acknowledged_subtasks":2,
"tasks":{
},
"external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
"discarded":true
}
]
}