I am working with an application that hasn't gone to production yet. We run
Flink as a cluster within a K8s environment. It has the following attributes

1) 2 Job Manager configured using HA, backed by Zookeeper and HDFS
2) 4 Task Managers
3) Configured to use RocksDB. The actual RocksDB files are configured to be
written to a locally attached NVMe drive.
4) We checkpoint every 15 seconds, with a minimum delay of 7.5 seconds.
5) There is currently very little load going through the system (it's in a
test environment). The web console indicates there isn't any Back Pressure
6) The cluster is running Flink 1.9.0
7) I don't see anything unexpected in the logs
8) Checkpoints take longer than 10 minutes with very little state (<1 mb),
they fail due to timeout
9) Eventually the job fails because it can't checkpoint.

What steps beyond what I have already done should I consider to debug this?

-Steve

Reply via email to