Ravi, have you looked at the io operation(iops) rate of the disk? You can 
monitoring the iops performance and tune it accordingly with your work load. 
This helped us in our project when we hit the wall tuning prototype much all 
the parameters.

Rohan


________________________________
From: Ravi Bhushan Ratnakar <ravibhushanratna...@gmail.com>
Sent: Saturday, September 7, 2019 5:38 PM
To: Rafi Aroch
Cc: user
Subject: Re: Checkpointing is not performing well

Hi Rafi,

Thank you for your quick response.

I have tested with rocksdb state backend. Rocksdb required significantly more 
taskmanager to perform as compare to filesystem state backend. The problem here 
is that checkpoint process is not fast enough to complete.

Our requirement is to do checkout as soon as possible like in 5 seconds to 
flush the output to output sink. As the incoming data rate is high, it is not 
able to complete quickly. If I increase the checkpoint duration, the state size 
grows much faster and hence takes much longer time to complete checkpointing. I 
also tried to use AT LEAST ONCE mode, but does not improve much. Adding more 
taskmanager to increase parallelism also does not improve the checkpointing 
performance.

Is it possible to achieve checkpointing as short as 5 seconds with such high 
input volume?

Regards,
Ravi

On Sat 7 Sep, 2019, 22:25 Rafi Aroch, 
<rafi.ar...@gmail.com<mailto:rafi.ar...@gmail.com>> wrote:
Hi Ravi,

Consider moving to RocksDB state backend, where you can enable incremental 
checkpointing. This will make you checkpoints size stay pretty much constant 
even when your state becomes larger.

https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/state_backends.html#the-rocksdbstatebackend


Thanks,
Rafi

On Sat, Sep 7, 2019, 17:47 Ravi Bhushan Ratnakar 
<ravibhushanratna...@gmail.com<mailto:ravibhushanratna...@gmail.com>> wrote:
Hi All,

I am writing a streaming application using Flink 1.9. This application consumes 
data from kinesis stream which is basically avro payload. Application is using 
KeyedProcessFunction to execute business logic on the basis of correlation id 
using event time characteristics with below configuration --
StateBackend - filesystem with S3 storage
registerTimeTimer duration for each key is  -  currentWatermark  + 15 seconds
checkpoint interval - 1min
minPauseBetweenCheckpointInterval - 1 min
checkpoint timeout - 10mins

incoming data rate from kinesis -  ~10 to 21GB/min

Number of Task manager - 200 (r4.2xlarge -> 8cpu,61GB)

First 2-4 checkpoints get completed within 1mins where the state size is 
usually 50GB. As the state size grows beyond 50GB, then checkpointing time 
starts taking more than 1mins and it increased till 10 mins and then checkpoint 
fails. The moment the checkpoint starts taking more than 1 mins to complete 
then application starts processing slow and start lagging in output.

Any suggestion to fine tune checkpoint performance would be highly appreciated.

Regards,
Ravi

Reply via email to