Hi, Mu
Is there anything  looks like `Received  late message for now expired 
checkpoint attempt ${checkpointID} from ${taskkExecutionID} of job ${jobID}` in 
JM log?

If yes, that means this task complete the checkpoint too long (maybe receive 
barrier too late, maybe spend too much time to do checkpoint, can investigate 
more from TM log);


Best
Congxian
On May 9, 2019, 14:44 +0800, Mu Kong <kong.mu....@gmail.com>, wrote:
> Hi community,
>
> I'm glad that in Flink 1.8.0, it introduced cleanupInRocksdbCompactFilter to 
> support state clean up for rocksdb backend.
> We have an application that heavily relies on managed keyed store.
> As we are using rocksdb as the state backend, we were suffering the issue of 
> ever-growing state size. To be more specific, our checkpoint size grows into 
> 200GB in 2 weeks.
>
> After upgrade to 1.8.0 and utilize the cleanupInRocksdbCompactFilter ttl 
> config, the checkpoint size never grows over 10GB.
> However, two days after upgrade, checkpointing started to fail because of the 
> "Checkpoint expired before completing".
>
> From the log, I could not get anything useful.
> But in the Flink UI, the last successful checkpoint took 1m to finish, and 
> our checkpoint timeout is set to 15m.
> It seems that the checkpoint period became extremely long all of a sudden.
>
> Is there anyway that I can further look into this? Or is there any direction 
> that I can tune the ttl for the application?
>
> Thanks in advance!
>
> Best regards,
> Mu

Reply via email to