Ravi, have you looked at the io operation(iops) rate of the disk? You can monitoring the iops performance and tune it accordingly with your work load. This helped us in our project when we hit the wall tuning prototype much all the parameters.
Rohan ________________________________ From: Ravi Bhushan Ratnakar <ravibhushanratna...@gmail.com> Sent: Saturday, September 7, 2019 5:38 PM To: Rafi Aroch Cc: user Subject: Re: Checkpointing is not performing well Hi Rafi, Thank you for your quick response. I have tested with rocksdb state backend. Rocksdb required significantly more taskmanager to perform as compare to filesystem state backend. The problem here is that checkpoint process is not fast enough to complete. Our requirement is to do checkout as soon as possible like in 5 seconds to flush the output to output sink. As the incoming data rate is high, it is not able to complete quickly. If I increase the checkpoint duration, the state size grows much faster and hence takes much longer time to complete checkpointing. I also tried to use AT LEAST ONCE mode, but does not improve much. Adding more taskmanager to increase parallelism also does not improve the checkpointing performance. Is it possible to achieve checkpointing as short as 5 seconds with such high input volume? Regards, Ravi On Sat 7 Sep, 2019, 22:25 Rafi Aroch, <rafi.ar...@gmail.com<mailto:rafi.ar...@gmail.com>> wrote: Hi Ravi, Consider moving to RocksDB state backend, where you can enable incremental checkpointing. This will make you checkpoints size stay pretty much constant even when your state becomes larger. https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/state_backends.html#the-rocksdbstatebackend Thanks, Rafi On Sat, Sep 7, 2019, 17:47 Ravi Bhushan Ratnakar <ravibhushanratna...@gmail.com<mailto:ravibhushanratna...@gmail.com>> wrote: Hi All, I am writing a streaming application using Flink 1.9. This application consumes data from kinesis stream which is basically avro payload. Application is using KeyedProcessFunction to execute business logic on the basis of correlation id using event time characteristics with below configuration -- StateBackend - filesystem with S3 storage registerTimeTimer duration for each key is - currentWatermark + 15 seconds checkpoint interval - 1min minPauseBetweenCheckpointInterval - 1 min checkpoint timeout - 10mins incoming data rate from kinesis - ~10 to 21GB/min Number of Task manager - 200 (r4.2xlarge -> 8cpu,61GB) First 2-4 checkpoints get completed within 1mins where the state size is usually 50GB. As the state size grows beyond 50GB, then checkpointing time starts taking more than 1mins and it increased till 10 mins and then checkpoint fails. The moment the checkpoint starts taking more than 1 mins to complete then application starts processing slow and start lagging in output. Any suggestion to fine tune checkpoint performance would be highly appreciated. Regards, Ravi