Hi, Mu Is there anything looks like `Received late message for now expired checkpoint attempt ${checkpointID} from ${taskkExecutionID} of job ${jobID}` in JM log?
If yes, that means this task complete the checkpoint too long (maybe receive barrier too late, maybe spend too much time to do checkpoint, can investigate more from TM log); Best Congxian On May 9, 2019, 14:44 +0800, Mu Kong <kong.mu....@gmail.com>, wrote: > Hi community, > > I'm glad that in Flink 1.8.0, it introduced cleanupInRocksdbCompactFilter to > support state clean up for rocksdb backend. > We have an application that heavily relies on managed keyed store. > As we are using rocksdb as the state backend, we were suffering the issue of > ever-growing state size. To be more specific, our checkpoint size grows into > 200GB in 2 weeks. > > After upgrade to 1.8.0 and utilize the cleanupInRocksdbCompactFilter ttl > config, the checkpoint size never grows over 10GB. > However, two days after upgrade, checkpointing started to fail because of the > "Checkpoint expired before completing". > > From the log, I could not get anything useful. > But in the Flink UI, the last successful checkpoint took 1m to finish, and > our checkpoint timeout is set to 15m. > It seems that the checkpoint period became extremely long all of a sudden. > > Is there anyway that I can further look into this? Or is there any direction > that I can tune the ttl for the application? > > Thanks in advance! > > Best regards, > Mu