Re: In consistent Check point API response

Vijay Bhaskar Wed, 27 May 2020 05:01:11 -0700

Created JIRA for it: https://issues.apache.org/jira/browse/FLINK-17966


Regards
Bhaskar



On Wed, May 27, 2020 at 1:28 PM Vijay Bhaskar <bhaskar.eba...@gmail.com>
wrote:

> Thanks Yun. In that case  it would be good to give the reference of that
> documentation in the Flink Rest API:
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html
> while explaining about the checkpoints. Tomorrow any one want to use REST
> API, they will get easy reference of the monitoring document of
> checkpoints. It would give them complete idea. So I will open Jira with
> this requirement
>
> Regards
> Bhaskar
>
> On Wed, May 27, 2020 at 11:59 AM Yun Tang <myas...@live.com> wrote:
>
>> To be honest, from my point of view current description should have
>> already give enough explanations [1] in "Overview Tab".
>> *    Latest Completed Checkpoint*: The latest successfully completed
>> checkpoints.
>>     *Latest Restore*: There are two types of restore operations.
>>
>>    - Restore from Checkpoint: We restored from a regular periodic
>>    checkpoint.
>>    - Restore from Savepoint: We restored from a savepoint.
>>
>>
>> You could still create a JIRA issue and give your ideas in that issue. If
>> agreed to work on in that ticket, you can create a PR to edit
>> checkpoint_monitoring.md [2] and checkpoint_monitoring.zh.md [3] to
>> update related documentation.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/checkpoint_monitoring.html#overview-tab
>> [2]
>> https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.md
>> [3]
>> https://github.com/apache/flink/blob/master/docs/monitoring/checkpoint_monitoring.zh.md
>>
>> Best
>> Yun Tang
>> ------------------------------
>> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com>
>> *Sent:* Tuesday, May 26, 2020 15:18
>> *To:* Yun Tang <myas...@live.com>
>> *Cc:* user <user@flink.apache.org>
>> *Subject:* Re: In consistent Check point API response
>>
>> Thanks Yun. How can i contribute better documentation of the same by
>> opening Jira on this?
>>
>> Regards
>> Bhaskar
>>
>> On Tue, May 26, 2020 at 12:32 PM Yun Tang <myas...@live.com> wrote:
>>
>> Hi Bhaskar
>>
>> I think I have understood your scenario now. And I think this is what
>> expected in Flink.
>> As you only allow your job could restore 5 times, the "restore" would
>> only record the checkpoint to restore at the 5th recovery, and the
>> checkpoint id would always stay there.
>>
>> "Restored" is for last restored checkpoint and "completed" is for last
>> completed checkpoint, they are actually not the same thing.
>> The only scenario that they're the same in numbers is when Flink just
>> restore successfully before a new checkpoint completes.
>>
>> Best
>> Yun Tang
>>
>>
>> ------------------------------
>> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com>
>> *Sent:* Tuesday, May 26, 2020 12:19
>> *To:* Yun Tang <myas...@live.com>
>> *Cc:* user <user@flink.apache.org>
>> *Subject:* Re: In consistent Check point API response
>>
>> Hi Yun
>> Understood the issue now:
>> "restored" always shows only the check point that is used for restoring
>> previous state
>> In all the attempts < 6 ( in my case max attempts are 5, 6 is the last
>> attempt)
>>   Flink HA is  restoring the state, so restored and latest are same value
>> if the last attempt  == 6
>>  Flink job already has few check points
>>  After that job failed and Flink HA gave up and marked the job state as
>> "FAILED"
>>    At this point "restored". value is the one which is in 5'th attempt
>> but latest is the one which is the latest checkpoint which is retained
>>
>> Shall i file any documentation improvement Jira? I want to add more
>> documentation with the help of  the above scenarios.
>>
>> Regards
>> Bhaskar
>>
>>
>>
>> On Tue, May 26, 2020 at 8:14 AM Yun Tang <myas...@live.com> wrote:
>>
>> Hi Bhaskar
>>
>> It seems I still not understand your case-5 totally. Your job failed 6
>> times, and recover from previous checkpoint to restart again. However, you
>> found the REST API told the wrong answer.
>> How do you ensure your "restored" field is giving the wrong checkpoint
>> file which is not latest? Have you ever checked the log in JM to view
>> related contents: "Restoring job xxx from latest valid checkpoint: x@xxxx"
>> [1] to know exactly which checkpoint choose to restore?
>>
>> I think you could give a more concrete example e.g. which expected/actual
>> checkpoint to restore, to tell your story.
>>
>> [1]
>> https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250
>>
>> Best
>> Yun Tang
>> ------------------------------
>> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com>
>> *Sent:* Monday, May 25, 2020 17:01
>> *To:* Yun Tang <myas...@live.com>
>> *Cc:* user <user@flink.apache.org>
>> *Subject:* Re: In consistent Check point API response
>>
>> Thanks Yun.
>> Here is the problem i am facing:
>>
>> I am using  jobs/:jobID/checkpoints  API to recover the failed job. We
>> have the remote manager which monitors the jobs.  We are using "restored"
>> field of the API response to get the latest check point file to use. Its
>> giving correct checkpoint file for all the 4 cases except the 5'th case.
>> Where the "restored" field is giving the wrong check point file which is
>> not latest.  When we compare the  check point file returned by  the
>> "completed". field, both are giving identical checkpoints in all 4 cases,
>> except 5'th case
>> We can't use flink UI in because of security reasons
>>
>> Regards
>> Bhaskar
>>
>> On Mon, May 25, 2020 at 12:57 PM Yun Tang <myas...@live.com> wrote:
>>
>> Hi Vijay
>>
>> If I understand correct, do you mean your last "restored" checkpoint is
>> null via REST api when the job failed 6 times and then recover successfully
>> with another several successful checkpoints?
>>
>> First of all, if your job just recovered successfully, can you observe
>> the "last restored" checkpoint in web UI?
>> Secondly, how long will you cannot see the "restored " field  after
>> recover successfully?
>> Last but not least, I cannot see the real difference among your cases,
>> what's the core difference in your case(5)?
>>
>> From the implementation of Flink, it will create the checkpoint statics
>> without restored checkpoint and assign it once the latest
>> savepoint/checkpoint is restored. [1]
>>
>> [1]
>> https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285
>>
>> Best
>> Yun Tang
>>
>> ------------------------------
>> *From:* Vijay Bhaskar <bhaskar.eba...@gmail.com>
>> *Sent:* Monday, May 25, 2020 14:20
>> *To:* user <user@flink.apache.org>
>> *Subject:* In consistent Check point API response
>>
>> Hi
>> I am using flink retained check points and along with
>>  jobs/:jobid/checkpoints API for retrieving the latest retained check point
>> Following the response of Flink Checkpoints API:
>>
>> I have my jobs restart attempts are 5
>>  check point API response in "latest" key, check point file name of both
>> "restored" and "completed" values are having following behavior
>> 1)Suppose the job is failed 3 times and recovered 4'th time, then both
>> values are same
>> 2)Suppose the job is failed 4 times and recovered 5'th time, then both
>> values are same
>> 3)Suppose the job is failed 5 times and recovered 6'th time, then both
>> values are same
>> 4) Suppose the job is failed all 6 times and the job marked failed. then
>> also both the values are same
>> 5)Suppose job is failed 6'th time , after recovering from 5 attempts
>> and made few check points, then both values are different.
>>
>> During case (1), case (2), case (3) and case (4) i never had any issue.
>> Only When case (5) i had severe issue in my production as the "restored "
>> field check point doesn't exist
>>
>> Please suggest any
>>
>>
>>
>> {
>>    "counts":{
>>       "restored":6,
>>       "total":3,
>>       "in_progress":0,
>>       "completed":3,
>>       "failed":0
>>    },
>>    "summary":{
>>       "state_size":{
>>          "min":4879,
>>          "max":4879,
>>          "avg":4879
>>       },
>>       "end_to_end_duration":{
>>          "min":25,
>>          "max":130,
>>          "avg":87
>>       },
>>       "alignment_buffered":{
>>          "min":0,
>>          "max":0,
>>          "avg":0
>>       }
>>    },
>>    "latest":{
>>       "completed":{
>>          "@class":"completed",
>>          "id":7094,
>>          "status":"COMPLETED",
>>          "is_savepoint":false,
>>          "trigger_timestamp":1590382502772,
>>          "latest_ack_timestamp":1590382502902,
>>          "state_size":4879,
>>          "end_to_end_duration":130,
>>          "alignment_buffered":0,
>>          "num_subtasks":2,
>>          "num_acknowledged_subtasks":2,
>>          "tasks":{
>>
>>          },
>>
>>  
>> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>>          "discarded":false
>>       },
>>       "savepoint":null,
>>       "failed":null,
>>       "restored":{
>>          "id":7093,
>>          "restore_timestamp":1590382478448,
>>          "is_savepoint":false,
>>
>>  
>> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
>>       }
>>    },
>>    "history":[
>>       {
>>          "@class":"completed",
>>          "id":7094,
>>          "status":"COMPLETED",
>>          "is_savepoint":false,
>>          "trigger_timestamp":1590382502772,
>>          "latest_ack_timestamp":1590382502902,
>>          "state_size":4879,
>>          "end_to_end_duration":130,
>>          "alignment_buffered":0,
>>          "num_subtasks":2,
>>          "num_acknowledged_subtasks":2,
>>          "tasks":{
>>
>>          },
>>
>>  
>> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
>>          "discarded":false
>>       },
>>       {
>>          "@class":"completed",
>>          "id":7093,
>>          "status":"COMPLETED",
>>          "is_savepoint":false,
>>          "trigger_timestamp":1590382310195,
>>          "latest_ack_timestamp":1590382310220,
>>          "state_size":4879,
>>          "end_to_end_duration":25,
>>          "alignment_buffered":0,
>>          "num_subtasks":2,
>>          "num_acknowledged_subtasks":2,
>>          "tasks":{
>>
>>          },
>>
>>  
>> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
>>          "discarded":false
>>       },
>>       {
>>          "@class":"completed",
>>          "id":7092,
>>          "status":"COMPLETED",
>>          "is_savepoint":false,
>>          "trigger_timestamp":1590382190195,
>>          "latest_ack_timestamp":1590382190303,
>>          "state_size":4879,
>>          "end_to_end_duration":108,
>>          "alignment_buffered":0,
>>          "num_subtasks":2,
>>          "num_acknowledged_subtasks":2,
>>          "tasks":{
>>
>>          },
>>
>>  
>> "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
>>          "discarded":true
>>       }
>>    ]
>> }
>>
>>

Re: In consistent Check point API response

Reply via email to