Hi Ravi, Consider moving to RocksDB state backend, where you can enable incremental checkpointing. This will make you checkpoints size stay pretty much constant even when your state becomes larger.
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/state_backends.html#the-rocksdbstatebackend Thanks, Rafi On Sat, Sep 7, 2019, 17:47 Ravi Bhushan Ratnakar < ravibhushanratna...@gmail.com> wrote: > Hi All, > > I am writing a streaming application using Flink 1.9. This application > consumes data from kinesis stream which is basically avro payload. > Application is using KeyedProcessFunction to execute business logic on the > basis of correlation id using event time characteristics with below > configuration -- > StateBackend - filesystem with S3 storage > registerTimeTimer duration for each key is - currentWatermark + 15 > seconds > checkpoint interval - 1min > minPauseBetweenCheckpointInterval - 1 min > checkpoint timeout - 10mins > > incoming data rate from kinesis - ~10 to 21GB/min > > Number of Task manager - 200 (r4.2xlarge -> 8cpu,61GB) > > First 2-4 checkpoints get completed within 1mins where the state size is > usually 50GB. As the state size grows beyond 50GB, then checkpointing time > starts taking more than 1mins and it increased till 10 mins and then > checkpoint fails. The moment the checkpoint starts taking more than 1 mins > to complete then application starts processing slow and start lagging in > output. > > Any suggestion to fine tune checkpoint performance would be highly > appreciated. > > Regards, > Ravi >