Thanks Yun. How can i contribute better documentation of the same by opening Jira on this?
Regards Bhaskar On Tue, May 26, 2020 at 12:32 PM Yun Tang <myas...@live.com> wrote: > Hi Bhaskar > > I think I have understood your scenario now. And I think this is what > expected in Flink. > As you only allow your job could restore 5 times, the "restore" would only > record the checkpoint to restore at the 5th recovery, and the checkpoint id > would always stay there. > > "Restored" is for last restored checkpoint and "completed" is for last > completed checkpoint, they are actually not the same thing. > The only scenario that they're the same in numbers is when Flink just > restore successfully before a new checkpoint completes. > > Best > Yun Tang > > > ------------------------------ > *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com> > *Sent:* Tuesday, May 26, 2020 12:19 > *To:* Yun Tang <myas...@live.com> > *Cc:* user <user@flink.apache.org> > *Subject:* Re: In consistent Check point API response > > Hi Yun > Understood the issue now: > "restored" always shows only the check point that is used for restoring > previous state > In all the attempts < 6 ( in my case max attempts are 5, 6 is the last > attempt) > Flink HA is restoring the state, so restored and latest are same value > if the last attempt == 6 > Flink job already has few check points > After that job failed and Flink HA gave up and marked the job state as > "FAILED" > At this point "restored". value is the one which is in 5'th attempt but > latest is the one which is the latest checkpoint which is retained > > Shall i file any documentation improvement Jira? I want to add more > documentation with the help of the above scenarios. > > Regards > Bhaskar > > > > On Tue, May 26, 2020 at 8:14 AM Yun Tang <myas...@live.com> wrote: > > Hi Bhaskar > > It seems I still not understand your case-5 totally. Your job failed 6 > times, and recover from previous checkpoint to restart again. However, you > found the REST API told the wrong answer. > How do you ensure your "restored" field is giving the wrong checkpoint > file which is not latest? Have you ever checked the log in JM to view > related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx" > [1] to know exactly which checkpoint choose to restore? > > I think you could give a more concrete example e.g. which expected/actual > checkpoint to restore, to tell your story. > > [1] > https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250 > > Best > Yun Tang > ------------------------------ > *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com> > *Sent:* Monday, May 25, 2020 17:01 > *To:* Yun Tang <myas...@live.com> > *Cc:* user <user@flink.apache.org> > *Subject:* Re: In consistent Check point API response > > Thanks Yun. > Here is the problem i am facing: > > I am using jobs/:jobID/checkpoints API to recover the failed job. We > have the remote manager which monitors the jobs. We are using "restored" > field of the API response to get the latest check point file to use. Its > giving correct checkpoint file for all the 4 cases except the 5'th case. > Where the "restored" field is giving the wrong check point file which is > not latest. When we compare the check point file returned by the > "completed". field, both are giving identical checkpoints in all 4 cases, > except 5'th case > We can't use flink UI in because of security reasons > > Regards > Bhaskar > > On Mon, May 25, 2020 at 12:57 PM Yun Tang <myas...@live.com> wrote: > > Hi Vijay > > If I understand correct, do you mean your last "restored" checkpoint is > null via REST api when the job failed 6 times and then recover successfully > with another several successful checkpoints? > > First of all, if your job just recovered successfully, can you observe the > "last restored" checkpoint in web UI? > Secondly, how long will you cannot see the "restored " field after > recover successfully? > Last but not least, I cannot see the real difference among your cases, > what's the core difference in your case(5)? > > From the implementation of Flink, it will create the checkpoint statics > without restored checkpoint and assign it once the latest > savepoint/checkpoint is restored. [1] > > [1] > https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285 > > Best > Yun Tang > > ------------------------------ > *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com> > *Sent:* Monday, May 25, 2020 14:20 > *To:* user <user@flink.apache.org> > *Subject:* In consistent Check point API response > > Hi > I am using flink retained check points and along with > jobs/:jobid/checkpoints API for retrieving the latest retained check point > Following the response of Flink Checkpoints API: > > I have my jobs restart attempts are 5 > check point API response in "latest" key, check point file name of both > "restored" and "completed" values are having following behavior > 1)Suppose the job is failed 3 times and recovered 4'th time, then both > values are same > 2)Suppose the job is failed 4 times and recovered 5'th time, then both > values are same > 3)Suppose the job is failed 5 times and recovered 6'th time, then both > values are same > 4) Suppose the job is failed all 6 times and the job marked failed. then > also both the values are same > 5)Suppose job is failed 6'th time , after recovering from 5 attempts > and made few check points, then both values are different. > > During case (1), case (2), case (3) and case (4) i never had any issue. > Only When case (5) i had severe issue in my production as the "restored " > field check point doesn't exist > > Please suggest any > > > > { > "counts":{ > "restored":6, > "total":3, > "in_progress":0, > "completed":3, > "failed":0 > }, > "summary":{ > "state_size":{ > "min":4879, > "max":4879, > "avg":4879 > }, > "end_to_end_duration":{ > "min":25, > "max":130, > "avg":87 > }, > "alignment_buffered":{ > "min":0, > "max":0, > "avg":0 > } > }, > "latest":{ > "completed":{ > "@class":"completed", > "id":7094, > "status":"COMPLETED", > "is_savepoint":false, > "trigger_timestamp":1590382502772, > "latest_ack_timestamp":1590382502902, > "state_size":4879, > "end_to_end_duration":130, > "alignment_buffered":0, > "num_subtasks":2, > "num_acknowledged_subtasks":2, > "tasks":{ > > }, > > > "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094", > "discarded":false > }, > "savepoint":null, > "failed":null, > "restored":{ > "id":7093, > "restore_timestamp":1590382478448, > "is_savepoint":false, > > > "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093" > } > }, > "history":[ > { > "@class":"completed", > "id":7094, > "status":"COMPLETED", > "is_savepoint":false, > "trigger_timestamp":1590382502772, > "latest_ack_timestamp":1590382502902, > "state_size":4879, > "end_to_end_duration":130, > "alignment_buffered":0, > "num_subtasks":2, > "num_acknowledged_subtasks":2, > "tasks":{ > > }, > > > "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094", > "discarded":false > }, > { > "@class":"completed", > "id":7093, > "status":"COMPLETED", > "is_savepoint":false, > "trigger_timestamp":1590382310195, > "latest_ack_timestamp":1590382310220, > "state_size":4879, > "end_to_end_duration":25, > "alignment_buffered":0, > "num_subtasks":2, > "num_acknowledged_subtasks":2, > "tasks":{ > > }, > > > "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093", > "discarded":false > }, > { > "@class":"completed", > "id":7092, > "status":"COMPLETED", > "is_savepoint":false, > "trigger_timestamp":1590382190195, > "latest_ack_timestamp":1590382190303, > "state_size":4879, > "end_to_end_duration":108, > "alignment_buffered":0, > "num_subtasks":2, > "num_acknowledged_subtasks":2, > "tasks":{ > > }, > > > "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092", > "discarded":true > } > ] > } > >