[ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452441#comment-17452441
 ] 

Enrique Lacal commented on FLINK-25098:
---------------------------------------

Hi Till,

I want to differentiate between a reinstall and this occurring whilst the Flink 
Cluster is running.  I agree that to uninstall everything it is safer to 
manually delete the ConfigMaps and this is what Adrian has seen above with them 
pointing to an outdated job graph. Deleting the ConfigMaps solve this issue, 
but not the latter.

The other problem we are seeing is that whilst a Flink Cluster is running the 
Configmap for one of the jobs becomes inconsistent for some reason and when the 
leader JM goes down and a new follower tries to restore that job using the 
ConfigMap it cannot find the checkpoint referenced. (This is what Neeraj has 
shared through the logs) I've done some investigation and it seems that Flink 
updates the ConfigMap before creating the `completedCheckpoint` folder in the 
dir set by `high-availability.storageDir`. I watched the ConfigMap and the Fs 
simultaneously. My assumption is that before the `completedCheckpoint` is 
written but after the ConfigMap is updated the leader JM goes down for some 
reason and the state becomes inconsistent. Then Flink cannot recover from this 
state without manual intervention, which is a significant problem. Another idea 
might be that the checkpoint fails but the CM is prematurely updated. I think 
this is less likely. 

I understand you want to see some logs on when the checkpoint fails to 
understand the root cause, so we have set up persistence for our logs and will 
share those as soon as we can reproduce this issue. I also couldn't find a way 
of reproducing this by crashing the job, killing the leader pod etc.. and since 
the interval between the CM being updated and then the file in FS being created 
is too short it's hard to crash the pod at a specific time. From my observation 
after the Flink Cluster is in this unrecoverable state, the actual checkpoint 
is stored in `state.checkpoints.dir` such as `chk-<number>` but the 
`completeCheckpoint` doesn't exist which mean that the checkpoint has been 
taken correctly but the reference to it is not there. I believe this is the way 
it works, but not 100% sure.

Just in case, these are the parameters used for checkpointing:
|Checkpointing Mode|Exactly Once|
|Checkpoint Storage|FileSystemCheckpointStorage|
|State Backend|EmbeddedRocksDBStateBackend|
|Interval|5s|
|Timeout|10m 0s|
|Minimum Pause Between Checkpoints|0ms|
|Maximum Concurrent Checkpoints|1|
|Unaligned Checkpoints|Disabled|
|Persist Checkpoints Externally|Enabled (retain on cancellation)|
|Tolerable Failed Checkpoints|0|


Do you think there is a workaround for this issue? Maybe changing the above 
configuration to be less strict?
Thanks for your time!

> Jobmanager CrashLoopBackOff in HA configuration
> -----------------------------------------------
>
>                 Key: FLINK-25098
>                 URL: https://issues.apache.org/jira/browse/FLINK-25098
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.13.2, 1.13.3
>         Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>            Reporter: Adrian Vasiliu
>            Priority: Critical
>         Attachments: 
> iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log, 
> jm-flink-ha-jobmanager-log.txt, jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///<dir>{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to