What is the upper limit of checkpoint size of Flink System? Regards, Ravi
On Wed 11 Sep, 2019, 06:48 Vijay Bhaskar, <bhaskar.eba...@gmail.com> wrote: > You crossed the upper limits of the check point system of Flink a way > high. Try to distribute events equally over time by adding some sort of > controlled back pressure after receiving data from kinesis streams. > Otherwise the spike coming during 5 seconds time would always create > problems. Tomorrow it may double so best solution in your case is to > deliver at configurable constant rate after receiving messages from kinesis > streams. Otherwise i am sure its always the problem whatever the kind of > streaming engine you use. Tune your configuration to get the optimal rate > so that flink checkpoint state is healthier. > > Regards > Bhaskar > > On Tue, Sep 10, 2019 at 11:16 PM Ravi Bhushan Ratnakar < > ravibhushanratna...@gmail.com> wrote: > >> @Rohan - I am streaming data to kafka sink after applying business logic. >> For checkpoint, I am using s3 as a distributed file system. For local >> recovery, I am using Optimized iops ebs volume. >> >> @Vijay - I forget to mention that incoming data volume is ~ 10 to 21GB >> per minute compressed(lz4) avro message. Generally 90% correlated events >> come within 5 seconds and 10% of the correlated events get extended to 65 >> minute. Due to this business requirement, the state size keep growing till >> 65 minutes, after that the state size becomes more or less stable. As the >> state size is growing and is around 350gb at peak load, checkpoint is not >> able to complete within 1 minutes. I want to check as quick as possible >> like every 5 second. >> >> Thanks, >> Ravi >> >> >> On Tue 10 Sep, 2019, 11:37 Vijay Bhaskar, <bhaskar.eba...@gmail.com> >> wrote: >> >>> For me task count seems to be huge in number with the mentioned resource >>> count. To rule out the possibility of issue with state backend can you >>> start writing sink data as <NO-Operation> , i.e., data ignore sink. And try >>> whether you could run it for longer duration without any issue. You can >>> start decreasing the task manager count until you find descent count of it >>> without having any side effects. Use that value as task manager count and >>> then start adding your state backend. First you can try with Rocks DB. With >>> reduced task manager count you might get good results. >>> >>> Regards >>> Bhaskar >>> >>> On Sun, Sep 8, 2019 at 10:15 AM Rohan Thimmappa < >>> rohan.thimma...@gmail.com> wrote: >>> >>>> Ravi, have you looked at the io operation(iops) rate of the disk? You >>>> can monitoring the iops performance and tune it accordingly with your work >>>> load. This helped us in our project when we hit the wall tuning prototype >>>> much all the parameters. >>>> >>>> Rohan >>>> >>>> >>>> ------------------------------ >>>> *From:* Ravi Bhushan Ratnakar <ravibhushanratna...@gmail.com> >>>> *Sent:* Saturday, September 7, 2019 5:38 PM >>>> *To:* Rafi Aroch >>>> *Cc:* user >>>> *Subject:* Re: Checkpointing is not performing well >>>> >>>> Hi Rafi, >>>> >>>> Thank you for your quick response. >>>> >>>> I have tested with rocksdb state backend. Rocksdb required >>>> significantly more taskmanager to perform as compare to filesystem state >>>> backend. The problem here is that checkpoint process is not fast enough to >>>> complete. >>>> >>>> Our requirement is to do checkout as soon as possible like in 5 seconds >>>> to flush the output to output sink. As the incoming data rate is high, it >>>> is not able to complete quickly. If I increase the checkpoint duration, the >>>> state size grows much faster and hence takes much longer time to complete >>>> checkpointing. I also tried to use AT LEAST ONCE mode, but does not improve >>>> much. Adding more taskmanager to increase parallelism also does not improve >>>> the checkpointing performance. >>>> >>>> Is it possible to achieve checkpointing as short as 5 seconds with such >>>> high input volume? >>>> >>>> Regards, >>>> Ravi >>>> >>>> On Sat 7 Sep, 2019, 22:25 Rafi Aroch, <rafi.ar...@gmail.com> wrote: >>>> >>>>> Hi Ravi, >>>>> >>>>> Consider moving to RocksDB state backend, where you can enable >>>>> incremental checkpointing. This will make you checkpoints size stay pretty >>>>> much constant even when your state becomes larger. >>>>> >>>>> >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/state_backends.html#the-rocksdbstatebackend >>>>> >>>>> >>>>> Thanks, >>>>> Rafi >>>>> >>>>> On Sat, Sep 7, 2019, 17:47 Ravi Bhushan Ratnakar < >>>>> ravibhushanratna...@gmail.com> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I am writing a streaming application using Flink 1.9. This >>>>>> application consumes data from kinesis stream which is basically avro >>>>>> payload. Application is using KeyedProcessFunction to execute business >>>>>> logic on the basis of correlation id using event time characteristics >>>>>> with >>>>>> below configuration -- >>>>>> StateBackend - filesystem with S3 storage >>>>>> registerTimeTimer duration for each key is - currentWatermark + 15 >>>>>> seconds >>>>>> checkpoint interval - 1min >>>>>> minPauseBetweenCheckpointInterval - 1 min >>>>>> checkpoint timeout - 10mins >>>>>> >>>>>> incoming data rate from kinesis - ~10 to 21GB/min >>>>>> >>>>>> Number of Task manager - 200 (r4.2xlarge -> 8cpu,61GB) >>>>>> >>>>>> First 2-4 checkpoints get completed within 1mins where the state size >>>>>> is usually 50GB. As the state size grows beyond 50GB, then checkpointing >>>>>> time starts taking more than 1mins and it increased till 10 mins and then >>>>>> checkpoint fails. The moment the checkpoint starts taking more than 1 >>>>>> mins >>>>>> to complete then application starts processing slow and start lagging in >>>>>> output. >>>>>> >>>>>> Any suggestion to fine tune checkpoint performance would be highly >>>>>> appreciated. >>>>>> >>>>>> Regards, >>>>>> Ravi >>>>>> >>>>>