Re: In consistent Check point API response

Vijay Bhaskar Tue, 26 May 2020 00:19:51 -0700

Thanks Yun. How can i contribute better documentation of the same by
opening Jira on this?


Regards
Bhaskar

On Tue, May 26, 2020 at 12:32 PM Yun Tang <myas...@live.com> wrote:

> Hi Bhaskar
>
> I think I have understood your scenario now. And I think this is what
> expected in Flink.
> As you only allow your job could restore 5 times, the "restore" would only
> record the checkpoint to restore at the 5th recovery, and the checkpoint id
> would always stay there.
>
> "Restored" is for last restored checkpoint and "completed" is for last
> completed checkpoint, they are actually not the same thing.
> The only scenario that they're the same in numbers is when Flink just
> restore successfully before a new checkpoint completes.
>
> Best
> Yun Tang
>
>
> ------------------------------
> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com>
> *Sent:* Tuesday, May 26, 2020 12:19
> *To:* Yun Tang <myas...@live.com>
> *Cc:* user <user@flink.apache.org>
> *Subject:* Re: In consistent Check point API response
>
> Hi Yun
> Understood the issue now:
> "restored" always shows only the check point that is used for restoring
> previous state
> In all the attempts < 6 ( in my case max attempts are 5, 6 is the last
> attempt)
>   Flink HA is  restoring the state, so restored and latest are same value
> if the last attempt  == 6
>  Flink job already has few check points
>  After that job failed and Flink HA gave up and marked the job state as
> "FAILED"
>    At this point "restored". value is the one which is in 5'th attempt but
> latest is the one which is the latest checkpoint which is retained
>
> Shall i file any documentation improvement Jira? I want to add more
> documentation with the help of  the above scenarios.
>
> Regards
> Bhaskar
>
>
>
> On Tue, May 26, 2020 at 8:14 AM Yun Tang <myas...@live.com> wrote:
>
> Hi Bhaskar
>
> It seems I still not understand your case-5 totally. Your job failed 6
> times, and recover from previous checkpoint to restart again. However, you
> found the REST API told the wrong answer.
> How do you ensure your "restored" field is giving the wrong checkpoint
> file which is not latest? Have you ever checked the log in JM to view
> related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx"
> [1] to know exactly which checkpoint choose to restore?
>
> I think you could give a more concrete example e.g. which expected/actual
> checkpoint to restore, to tell your story.
>
> [1]
> https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250
>
> Best
> Yun Tang
> ------------------------------
> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com>
> *Sent:* Monday, May 25, 2020 17:01
> *To:* Yun Tang <myas...@live.com>
> *Cc:* user <user@flink.apache.org>
> *Subject:* Re: In consistent Check point API response
>
> Thanks Yun.
> Here is the problem i am facing:
>
> I am using  jobs/:jobID/checkpoints  API to recover the failed job. We
> have the remote manager which monitors the jobs.  We are using "restored"
> field of the API response to get the latest check point file to use. Its
> giving correct checkpoint file for all the 4 cases except the 5'th case.
> Where the "restored" field is giving the wrong check point file which is
> not latest.  When we compare the  check point file returned by  the
> "completed". field, both are giving identical checkpoints in all 4 cases,
> except 5'th case
> We can't use flink UI in because of security reasons
>
> Regards
> Bhaskar
>
> On Mon, May 25, 2020 at 12:57 PM Yun Tang <myas...@live.com> wrote:
>
> Hi Vijay
>
> If I understand correct, do you mean your last "restored" checkpoint is
> null via REST api when the job failed 6 times and then recover successfully
> with another several successful checkpoints?
>
> First of all, if your job just recovered successfully, can you observe the
> "last restored" checkpoint in web UI?
> Secondly, how long will you cannot see the "restored " field  after
> recover successfully?
> Last but not least, I cannot see the real difference among your cases,
> what's the core difference in your case(5)?
>
> From the implementation of Flink, it will create the checkpoint statics
> without restored checkpoint and assign it once the latest
> savepoint/checkpoint is restored. [1]
>
> [1]
> https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285
>
> Best
> Yun Tang
>
> ------------------------------
> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com>
> *Sent:* Monday, May 25, 2020 14:20
> *To:* user <user@flink.apache.org>
> *Subject:* In consistent Check point API response
>
> Hi
> I am using flink retained check points and along with
>  jobs/:jobid/checkpoints API for retrieving the latest retained check point
> Following the response of Flink Checkpoints API:
>
> I have my jobs restart attempts are 5
>  check point API response in "latest" key, check point file name of both
> "restored" and "completed" values are having following behavior
> 1)Suppose the job is failed 3 times and recovered 4'th time, then both
> values are same
> 2)Suppose the job is failed 4 times and recovered 5'th time, then both
> values are same
> 3)Suppose the job is failed 5 times and recovered 6'th time, then both
> values are same
> 4) Suppose the job is failed all 6 times and the job marked failed. then
> also both the values are same
> 5)Suppose job is failed 6'th time , after recovering from 5 attempts
> and made few check points, then both values are different.
>
> During case (1), case (2), case (3) and case (4) i never had any issue.
> Only When case (5) i had severe issue in my production as the "restored "
> field check point doesn't exist
>
> Please suggest any
>
>
>
> {
>    "counts":{
>       "restored":6,
>       "total":3,
>       "in_progress":0,
>       "completed":3,
>       "failed":0
>    },
>    "summary":{
>       "state_size":{
>          "min":4879,
>          "max":4879,
>          "avg":4879
>       },
>       "end_to_end_duration":{
>          "min":25,
>          "max":130,
>          "avg":87
>       },
>       "alignment_buffered":{
>          "min":0,
>          "max":0,
>          "avg":0
>       }
>    },
>    "latest":{
>       "completed":{
>          "@class":"completed",
>          "id":7094,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382502772,
>          "latest_ack_timestamp":1590382502902,
>          "state_size":4879,
>          "end_to_end_duration":130,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  
> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>          "discarded":false
>       },
>       "savepoint":null,
>       "failed":null,
>       "restored":{
>          "id":7093,
>          "restore_timestamp":1590382478448,
>          "is_savepoint":false,
>
>  
> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
>       }
>    },
>    "history":[
>       {
>          "@class":"completed",
>          "id":7094,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382502772,
>          "latest_ack_timestamp":1590382502902,
>          "state_size":4879,
>          "end_to_end_duration":130,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  
> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>          "discarded":false
>       },
>       {
>          "@class":"completed",
>          "id":7093,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382310195,
>          "latest_ack_timestamp":1590382310220,
>          "state_size":4879,
>          "end_to_end_duration":25,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  
> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
>          "discarded":false
>       },
>       {
>          "@class":"completed",
>          "id":7092,
>          "status":"COMPLETED",
>          "is_savepoint":false,
>          "trigger_timestamp":1590382190195,
>          "latest_ack_timestamp":1590382190303,
>          "state_size":4879,
>          "end_to_end_duration":108,
>          "alignment_buffered":0,
>          "num_subtasks":2,
>          "num_acknowledged_subtasks":2,
>          "tasks":{
>
>          },
>
>  
> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
>          "discarded":true
>       }
>    ]
> }
>
>

Re: In consistent Check point API response

Reply via email to