I meant upper limit w.r.t resources you are using. Even if you increase resources, Spiking data is always a problem which anyways you need to take care of. Best thing is to add more back pressure from source.
Regards Bhaskar On Wed, Sep 11, 2019 at 1:43 PM Fabian Hueske <fhue...@gmail.com> wrote: > Hi, > > There is no upper limit for state size in Flink. There are applications > with 10+ TB state. > However, it is natural that checkpointing time increases with state size > as more data needs to be serialized (in case of FSStateBackend) and written > to stable storage. > (The same is btw true for recovery when the state needs to be loaded back.) > > There are a few tricks to reduce checkpointing time like using incremental > checkpoints which you tried already. > You can also scale out the application to use more machines and therefore > bandwidth + CPU (for serialization) during checkpoints. > > Fabian > > Am Mi., 11. Sept. 2019 um 09:38 Uhr schrieb Ravi Bhushan Ratnakar < > ravibhushanratna...@gmail.com>: > >> What is the upper limit of checkpoint size of Flink System? >> >> Regards, >> Ravi >> >> On Wed 11 Sep, 2019, 06:48 Vijay Bhaskar, <bhaskar.eba...@gmail.com> >> wrote: >> >>> You crossed the upper limits of the check point system of Flink a way >>> high. Try to distribute events equally over time by adding some sort of >>> controlled back pressure after receiving data from kinesis streams. >>> Otherwise the spike coming during 5 seconds time would always create >>> problems. Tomorrow it may double so best solution in your case is to >>> deliver at configurable constant rate after receiving messages from kinesis >>> streams. Otherwise i am sure its always the problem whatever the kind of >>> streaming engine you use. Tune your configuration to get the optimal rate >>> so that flink checkpoint state is healthier. >>> >>> Regards >>> Bhaskar >>> >>> On Tue, Sep 10, 2019 at 11:16 PM Ravi Bhushan Ratnakar < >>> ravibhushanratna...@gmail.com> wrote: >>> >>>> @Rohan - I am streaming data to kafka sink after applying business >>>> logic. For checkpoint, I am using s3 as a distributed file system. For >>>> local recovery, I am using Optimized iops ebs volume. >>>> >>>> @Vijay - I forget to mention that incoming data volume is ~ 10 to 21GB >>>> per minute compressed(lz4) avro message. Generally 90% correlated events >>>> come within 5 seconds and 10% of the correlated events get extended to 65 >>>> minute. Due to this business requirement, the state size keep growing till >>>> 65 minutes, after that the state size becomes more or less stable. As the >>>> state size is growing and is around 350gb at peak load, checkpoint is not >>>> able to complete within 1 minutes. I want to check as quick as possible >>>> like every 5 second. >>>> >>>> Thanks, >>>> Ravi >>>> >>>> >>>> On Tue 10 Sep, 2019, 11:37 Vijay Bhaskar, <bhaskar.eba...@gmail.com> >>>> wrote: >>>> >>>>> For me task count seems to be huge in number with the mentioned >>>>> resource count. To rule out the possibility of issue with state backend >>>>> can >>>>> you start writing sink data as <NO-Operation> , i.e., data ignore sink. >>>>> And >>>>> try whether you could run it for longer duration without any issue. You >>>>> can >>>>> start decreasing the task manager count until you find descent count of it >>>>> without having any side effects. Use that value as task manager count and >>>>> then start adding your state backend. First you can try with Rocks DB. >>>>> With >>>>> reduced task manager count you might get good results. >>>>> >>>>> Regards >>>>> Bhaskar >>>>> >>>>> On Sun, Sep 8, 2019 at 10:15 AM Rohan Thimmappa < >>>>> rohan.thimma...@gmail.com> wrote: >>>>> >>>>>> Ravi, have you looked at the io operation(iops) rate of the disk? You >>>>>> can monitoring the iops performance and tune it accordingly with your >>>>>> work >>>>>> load. This helped us in our project when we hit the wall tuning prototype >>>>>> much all the parameters. >>>>>> >>>>>> Rohan >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> *From:* Ravi Bhushan Ratnakar <ravibhushanratna...@gmail.com> >>>>>> *Sent:* Saturday, September 7, 2019 5:38 PM >>>>>> *To:* Rafi Aroch >>>>>> *Cc:* user >>>>>> *Subject:* Re: Checkpointing is not performing well >>>>>> >>>>>> Hi Rafi, >>>>>> >>>>>> Thank you for your quick response. >>>>>> >>>>>> I have tested with rocksdb state backend. Rocksdb required >>>>>> significantly more taskmanager to perform as compare to filesystem state >>>>>> backend. The problem here is that checkpoint process is not fast enough >>>>>> to >>>>>> complete. >>>>>> >>>>>> Our requirement is to do checkout as soon as possible like in 5 >>>>>> seconds to flush the output to output sink. As the incoming data rate is >>>>>> high, it is not able to complete quickly. If I increase the checkpoint >>>>>> duration, the state size grows much faster and hence takes much longer >>>>>> time >>>>>> to complete checkpointing. I also tried to use AT LEAST ONCE mode, but >>>>>> does >>>>>> not improve much. Adding more taskmanager to increase parallelism also >>>>>> does >>>>>> not improve the checkpointing performance. >>>>>> >>>>>> Is it possible to achieve checkpointing as short as 5 seconds with >>>>>> such high input volume? >>>>>> >>>>>> Regards, >>>>>> Ravi >>>>>> >>>>>> On Sat 7 Sep, 2019, 22:25 Rafi Aroch, <rafi.ar...@gmail.com> wrote: >>>>>> >>>>>>> Hi Ravi, >>>>>>> >>>>>>> Consider moving to RocksDB state backend, where you can enable >>>>>>> incremental checkpointing. This will make you checkpoints size stay >>>>>>> pretty >>>>>>> much constant even when your state becomes larger. >>>>>>> >>>>>>> >>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/state_backends.html#the-rocksdbstatebackend >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Rafi >>>>>>> >>>>>>> On Sat, Sep 7, 2019, 17:47 Ravi Bhushan Ratnakar < >>>>>>> ravibhushanratna...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I am writing a streaming application using Flink 1.9. This >>>>>>>> application consumes data from kinesis stream which is basically avro >>>>>>>> payload. Application is using KeyedProcessFunction to execute business >>>>>>>> logic on the basis of correlation id using event time characteristics >>>>>>>> with >>>>>>>> below configuration -- >>>>>>>> StateBackend - filesystem with S3 storage >>>>>>>> registerTimeTimer duration for each key is - currentWatermark + >>>>>>>> 15 seconds >>>>>>>> checkpoint interval - 1min >>>>>>>> minPauseBetweenCheckpointInterval - 1 min >>>>>>>> checkpoint timeout - 10mins >>>>>>>> >>>>>>>> incoming data rate from kinesis - ~10 to 21GB/min >>>>>>>> >>>>>>>> Number of Task manager - 200 (r4.2xlarge -> 8cpu,61GB) >>>>>>>> >>>>>>>> First 2-4 checkpoints get completed within 1mins where the state >>>>>>>> size is usually 50GB. As the state size grows beyond 50GB, then >>>>>>>> checkpointing time starts taking more than 1mins and it increased till >>>>>>>> 10 >>>>>>>> mins and then checkpoint fails. The moment the checkpoint starts taking >>>>>>>> more than 1 mins to complete then application starts processing slow >>>>>>>> and >>>>>>>> start lagging in output. >>>>>>>> >>>>>>>> Any suggestion to fine tune checkpoint performance would be highly >>>>>>>> appreciated. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Ravi >>>>>>>> >>>>>>>