[ 
https://issues.apache.org/jira/browse/FLINK-27127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521603#comment-17521603
 ] 

Chesnay Schepler commented on FLINK-27127:
------------------------------------------

Alright, so here's what is going on:

When the job starts entering the restart loop (job is canceled -> restarts -> 
requests the slots it needs -> gets informed they aren't enough -> job is 
canceled -> ...) then it can happen that it evicts information about previous 
execution attempts if sufficient restarts occur.
In your setup this can happen quite easily because you aren't configuring a 
restart strategy, so it restarts once per second. After 16 (== [execution 
history|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#jobmanager-execution-attempts-history-size])
 restarts all the information about the last successful execution is being 
removed, and from that point Flink will no longer attempt to use local recovery 
for said execution.

This of course shouldn't happen, and I'll create a ticket to fix that.

As a workaround you could increase the history size or the restart delay.




> Local recovery is not triggered on task manager process restart
> ---------------------------------------------------------------
>
>                 Key: FLINK-27127
>                 URL: https://issues.apache.org/jira/browse/FLINK-27127
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.15.0
>            Reporter: Abdullah alkhawatrah
>            Assignee: Chesnay Schepler
>            Priority: Blocker
>
> Hey,
> I am experimenting with the support of local recovery after process restart 
> introduced in 1.15. I am trying this on minikube.
> So far, it seems that every time a pod restarts, remote recovery is triggered.
> I have created a repo with everything needed to test it locally with 
> minikube: [https://github.com/akhawatrahTW/flink-local-recovery-test].
> The readme contains the steps to reproduce.
>  
> Based on the documentation, I was expecting to have local recovery triggered 
> on pod restarts since the needed configs are set: 
> [https://github.com/akhawatrahTW/flink-local-recovery-test/blob/bfef14e45f475ba953a05b50b8829d9d33bdcec6/k8s/flink-configuration-configmap.yaml#L27.]
> So was expecting to see something similar to this in the logs of the 
> recreated task manager pod:
> *Expected:*
> {code:java}
> 2022-04-07 09:17:17,637 INFO  
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation
>  [] - Starting to restore from state handle: 
> IncrementalLocalKeyedStateHandle{metaDataState=File State: 
> file:/pv/tm_flink-taskmanager-2/localState/aid_e56a834e076a6d8f9dc1a2997e97a91a/jid_f88542b420546fadbc94db66b00cb5a0/vtx_20ba6b65f97481d5570070de90e4e791_sti_2/chk_1208/c2756339-8938-4949-84ff-d7ee3f4c55cf
>  [479 bytes]} 
> DirectoryKeyedStateHandle{directoryStateHandle=DirectoryStateHandle{directory=/pv/tm_flink-taskmanager-2/localState/aid_e56a834e076a6d8f9dc1a2997e97a91a/jid_f88542b420546fadbc94db66b00cb5a0/vtx_20ba6b65f97481d5570070de90e4e791_sti_2/chk_1208/5455302ce9554a1f81365aee368f267e},
>  keyGroupRange=KeyGroupRange{startKeyGroup=86, endKeyGroup=127}} without 
> rescaling.{code}
>  
>  
> But for some reason, remote recovery it triggered:
> *Actual:*
> {code:java}
> 2022-04-07 09:17:18,405 INFO  
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation
>  [] - Finished restoring from state handle: 
> IncrementalRemoteKeyedStateHandle{backendIdentifier=544f3300-36bd-40a6-9ee3-f78b0e47dfd6,
>  stateHandleId=c2753d01-2f6b-49f0-9ca1-df6b54c61490, 
> keyGroupRange=KeyGroupRange{startKeyGroup=0, endKeyGroup=42}, 
> checkpointId=1208, 
> sharedState={001526.sst=ByteStreamStateHandle{handleName='f5a113d0-8094-40e7-a1b1-adc4cfc690c2',
>  dataBytes=23107}, 
> 001527.sst=ByteStreamStateHandle{handleName='3806411e-8213-406a-bbd8-e498ab19d118',
>  dataBytes=15579}, 
> 001528.sst=ByteStreamStateHandle{handleName='4fef6ead-1522-4f61-a6ad-399b334b41ca',
>  dataBytes=15839}, 
> 001529.sst=ByteStreamStateHandle{handleName='f1324a0c-3eae-46b0-acc2-c03d32b0c24a',
>  dataBytes=16055}}, 
> privateState={OPTIONS-001237=ByteStreamStateHandle{handleName='2e36d07b-5f91-4c9d-9778-5a16bb6254d5',
>  dataBytes=9924}, 
> MANIFEST-001234=ByteStreamStateHandle{handleName='4c95b38a-4afa-4154-9c89-9518d6384a25',
>  dataBytes=27356}, 
> CURRENT=ByteStreamStateHandle{handleName='17bd5bab-c369-470a-bf29-e76279cef2ba',
>  dataBytes=16}}, 
> metaStateHandle=ByteStreamStateHandle{handleName='15827f44-0ab2-4562-b8eb-812b8d260206',
>  dataBytes=479}, registered=false} without rescaling.{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to