Using session cluster with three taskmanagers, cluster.evenly-spread-out-slots is set to true. 13 jobs running. Average parallelism of each job is 4. Flink version 1.11.2, Java 11. Running on AWS EC2 instances with EFS for high-availability.storageDir.
We are seeing very high checkpoint times and experiencing timeouts. The checkpoint timeout is the default 10 minutes. This does not seem to be related to EFS limits/throttling . We started experiencing these timeouts after upgrading from Flink 1.9.2/Java 8. Are there any known issues which cause very high checkpoint times? Also I noticed we did not set state.checkpoints.dir, I assume it is using high-availability.storageDir. Is that correct? For now we plan on setting execution.checkpointing.timeout<https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#execution-checkpointing-timeout>: 60 min execution.checkpointing.tolerable-failed-checkpoints<https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#execution-checkpointing-tolerable-failed-checkpoints>:12 execution.checkpointing.unaligned<https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#execution-checkpointing-unaligned> true and also explicitly set state.checkpoints.dir