Created JIRA for it: https://issues.apache.org/jira/browse/FLINK-17966
Regards Bhaskar On Wed, May 27, 2020 at 1:28 PM Vijay Bhaskar <bhaskar.eba...@gmail.com> wrote: > Thanks Yun. In that case it would be good to give the reference of that > documentation in the Flink Rest API: > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html > while explaining about the checkpoints. Tomorrow any one want to use REST > API, they will get easy reference of the monitoring document of > checkpoints. It would give them complete idea. So I will open Jira with > this requirement > > Regards > Bhaskar > > On Wed, May 27, 2020 at 11:59 AM Yun Tang <myas...@live.com> wrote: > >> To be honest, from my point of view current description should have >> already give enough explanations [1] in "Overview Tab". >> * Latest Completed Checkpoint*: The latest successfully completed >> checkpoints. >> *Latest Restore*: There are two types of restore operations. >> >> - Restore from Checkpoint: We restored from a regular periodic >> checkpoint. >> - Restore from Savepoint: We restored from a savepoint. >> >> >> You could still create a JIRA issue and give your ideas in that issue. If >> agreed to work on in that ticket, you can create a PR to edit >> checkpoint_monitoring.md [2] and checkpoint_monitoring.zh.md [3] to >> update related documentation. >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/checkpoint_monitoring.html#overview-tab >> [2] >> https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.md >> [3] >> https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.zh.md >> >> Best >> Yun Tang >> ------------------------------ >> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com> >> *Sent:* Tuesday, May 26, 2020 15:18 >> *To:* Yun Tang <myas...@live.com> >> *Cc:* user <user@flink.apache.org> >> *Subject:* Re: In consistent Check point API response >> >> Thanks Yun. How can i contribute better documentation of the same by >> opening Jira on this? >> >> Regards >> Bhaskar >> >> On Tue, May 26, 2020 at 12:32 PM Yun Tang <myas...@live.com> wrote: >> >> Hi Bhaskar >> >> I think I have understood your scenario now. And I think this is what >> expected in Flink. >> As you only allow your job could restore 5 times, the "restore" would >> only record the checkpoint to restore at the 5th recovery, and the >> checkpoint id would always stay there. >> >> "Restored" is for last restored checkpoint and "completed" is for last >> completed checkpoint, they are actually not the same thing. >> The only scenario that they're the same in numbers is when Flink just >> restore successfully before a new checkpoint completes. >> >> Best >> Yun Tang >> >> >> ------------------------------ >> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com> >> *Sent:* Tuesday, May 26, 2020 12:19 >> *To:* Yun Tang <myas...@live.com> >> *Cc:* user <user@flink.apache.org> >> *Subject:* Re: In consistent Check point API response >> >> Hi Yun >> Understood the issue now: >> "restored" always shows only the check point that is used for restoring >> previous state >> In all the attempts < 6 ( in my case max attempts are 5, 6 is the last >> attempt) >> Flink HA is restoring the state, so restored and latest are same value >> if the last attempt == 6 >> Flink job already has few check points >> After that job failed and Flink HA gave up and marked the job state as >> "FAILED" >> At this point "restored". value is the one which is in 5'th attempt >> but latest is the one which is the latest checkpoint which is retained >> >> Shall i file any documentation improvement Jira? I want to add more >> documentation with the help of the above scenarios. >> >> Regards >> Bhaskar >> >> >> >> On Tue, May 26, 2020 at 8:14 AM Yun Tang <myas...@live.com> wrote: >> >> Hi Bhaskar >> >> It seems I still not understand your case-5 totally. Your job failed 6 >> times, and recover from previous checkpoint to restart again. However, you >> found the REST API told the wrong answer. >> How do you ensure your "restored" field is giving the wrong checkpoint >> file which is not latest? Have you ever checked the log in JM to view >> related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx" >> [1] to know exactly which checkpoint choose to restore? >> >> I think you could give a more concrete example e.g. which expected/actual >> checkpoint to restore, to tell your story. >> >> [1] >> https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250 >> >> Best >> Yun Tang >> ------------------------------ >> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com> >> *Sent:* Monday, May 25, 2020 17:01 >> *To:* Yun Tang <myas...@live.com> >> *Cc:* user <user@flink.apache.org> >> *Subject:* Re: In consistent Check point API response >> >> Thanks Yun. >> Here is the problem i am facing: >> >> I am using jobs/:jobID/checkpoints API to recover the failed job. We >> have the remote manager which monitors the jobs. We are using "restored" >> field of the API response to get the latest check point file to use. Its >> giving correct checkpoint file for all the 4 cases except the 5'th case. >> Where the "restored" field is giving the wrong check point file which is >> not latest. When we compare the check point file returned by the >> "completed". field, both are giving identical checkpoints in all 4 cases, >> except 5'th case >> We can't use flink UI in because of security reasons >> >> Regards >> Bhaskar >> >> On Mon, May 25, 2020 at 12:57 PM Yun Tang <myas...@live.com> wrote: >> >> Hi Vijay >> >> If I understand correct, do you mean your last "restored" checkpoint is >> null via REST api when the job failed 6 times and then recover successfully >> with another several successful checkpoints? >> >> First of all, if your job just recovered successfully, can you observe >> the "last restored" checkpoint in web UI? >> Secondly, how long will you cannot see the "restored " field after >> recover successfully? >> Last but not least, I cannot see the real difference among your cases, >> what's the core difference in your case(5)? >> >> From the implementation of Flink, it will create the checkpoint statics >> without restored checkpoint and assign it once the latest >> savepoint/checkpoint is restored. [1] >> >> [1] >> https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285 >> >> Best >> Yun Tang >> >> ------------------------------ >> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com> >> *Sent:* Monday, May 25, 2020 14:20 >> *To:* user <user@flink.apache.org> >> *Subject:* In consistent Check point API response >> >> Hi >> I am using flink retained check points and along with >> jobs/:jobid/checkpoints API for retrieving the latest retained check point >> Following the response of Flink Checkpoints API: >> >> I have my jobs restart attempts are 5 >> check point API response in "latest" key, check point file name of both >> "restored" and "completed" values are having following behavior >> 1)Suppose the job is failed 3 times and recovered 4'th time, then both >> values are same >> 2)Suppose the job is failed 4 times and recovered 5'th time, then both >> values are same >> 3)Suppose the job is failed 5 times and recovered 6'th time, then both >> values are same >> 4) Suppose the job is failed all 6 times and the job marked failed. then >> also both the values are same >> 5)Suppose job is failed 6'th time , after recovering from 5 attempts >> and made few check points, then both values are different. >> >> During case (1), case (2), case (3) and case (4) i never had any issue. >> Only When case (5) i had severe issue in my production as the "restored " >> field check point doesn't exist >> >> Please suggest any >> >> >> >> { >> "counts":{ >> "restored":6, >> "total":3, >> "in_progress":0, >> "completed":3, >> "failed":0 >> }, >> "summary":{ >> "state_size":{ >> "min":4879, >> "max":4879, >> "avg":4879 >> }, >> "end_to_end_duration":{ >> "min":25, >> "max":130, >> "avg":87 >> }, >> "alignment_buffered":{ >> "min":0, >> "max":0, >> "avg":0 >> } >> }, >> "latest":{ >> "completed":{ >> "@class":"completed", >> "id":7094, >> "status":"COMPLETED", >> "is_savepoint":false, >> "trigger_timestamp":1590382502772, >> "latest_ack_timestamp":1590382502902, >> "state_size":4879, >> "end_to_end_duration":130, >> "alignment_buffered":0, >> "num_subtasks":2, >> "num_acknowledged_subtasks":2, >> "tasks":{ >> >> }, >> >> >> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094", >> "discarded":false >> }, >> "savepoint":null, >> "failed":null, >> "restored":{ >> "id":7093, >> "restore_timestamp":1590382478448, >> "is_savepoint":false, >> >> >> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093" >> } >> }, >> "history":[ >> { >> "@class":"completed", >> "id":7094, >> "status":"COMPLETED", >> "is_savepoint":false, >> "trigger_timestamp":1590382502772, >> "latest_ack_timestamp":1590382502902, >> "state_size":4879, >> "end_to_end_duration":130, >> "alignment_buffered":0, >> "num_subtasks":2, >> "num_acknowledged_subtasks":2, >> "tasks":{ >> >> }, >> >> >> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094", >> "discarded":false >> }, >> { >> "@class":"completed", >> "id":7093, >> "status":"COMPLETED", >> "is_savepoint":false, >> "trigger_timestamp":1590382310195, >> "latest_ack_timestamp":1590382310220, >> "state_size":4879, >> "end_to_end_duration":25, >> "alignment_buffered":0, >> "num_subtasks":2, >> "num_acknowledged_subtasks":2, >> "tasks":{ >> >> }, >> >> >> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093", >> "discarded":false >> }, >> { >> "@class":"completed", >> "id":7092, >> "status":"COMPLETED", >> "is_savepoint":false, >> "trigger_timestamp":1590382190195, >> "latest_ack_timestamp":1590382190303, >> "state_size":4879, >> "end_to_end_duration":108, >> "alignment_buffered":0, >> "num_subtasks":2, >> "num_acknowledged_subtasks":2, >> "tasks":{ >> >> }, >> >> >> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092", >> "discarded":true >> } >> ] >> } >> >>