Hi All,

I am writing a streaming application using Flink 1.9. This application
consumes data from kinesis stream which is basically avro payload.
Application is using KeyedProcessFunction to execute business logic on the
basis of correlation id using event time characteristics with below
configuration --
StateBackend - filesystem with S3 storage
registerTimeTimer duration for each key is  -  currentWatermark  + 15
seconds
checkpoint interval - 1min
minPauseBetweenCheckpointInterval - 1 min
checkpoint timeout - 10mins

incoming data rate from kinesis -  ~10 to 21GB/min

Number of Task manager - 200 (r4.2xlarge -> 8cpu,61GB)

First 2-4 checkpoints get completed within 1mins where the state size is
usually 50GB. As the state size grows beyond 50GB, then checkpointing time
starts taking more than 1mins and it increased till 10 mins and then
checkpoint fails. The moment the checkpoint starts taking more than 1 mins
to complete then application starts processing slow and start lagging in
output.

Any suggestion to fine tune checkpoint performance would be highly
appreciated.

Regards,
Ravi

Reply via email to