Hi Stephan, To verify if S3 is making teh pipeline stall, I have replaced the S3 sink with HDFS and kept minimum pause between checkpoints to 5minutes, still I see the same issue with checkpoints getting failed.
If I keep the pause time to 20 seconds, all checkpoints are completed , however there is a hit in overall throughput. Regards, Vinay Patil On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink User Mailing List archive.] <ml-node+s2336050n11891...@n4.nabble.com> wrote: > Flink's state backends currently do a good number of "make sure this > exists" operations on the file systems. Through Hadoop's S3 filesystem, > that translates to S3 bucket list operations, where there is a limit in how > many operation may happen per time interval. After that, S3 blocks. > > It seems that operations that are totally cheap on HDFS are hellishly > expensive (and limited) on S3. It may be that you are affected by that. > > We are gradually trying to improve the behavior there and be more S3 aware. > > Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there. > > Best, > Stephan > > > On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=11891&i=0>> wrote: > >> Hi Stephan, >> >> So do you mean that S3 is causing the stall , as I have mentioned in my >> previous mail, I could not see any progress for 16minutes as checkpoints >> were getting failed continuously. >> >> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing >> List archive.]" <[hidden email] >> <http:///user/SendEmail.jtp?type=node&node=11887&i=0>> wrote: >> >>> Hi Vinay! >>> >>> True, the operator state (like Kafka) is currently not asynchronously >>> checkpointed. >>> >>> While it is rather small state, we have seen before that on S3 it can >>> cause trouble, because S3 frequently stalls uploads of even data amounts as >>> low as kilobytes due to its throttling policies. >>> >>> That would be a super important fix to add! >>> >>> Best, >>> Stephan >>> >>> >>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=11885&i=0>> wrote: >>> >>>> Hi, >>>> >>>> I have attached a snapshot for reference: >>>> As you can see all the 3 checkpointins failed , for checkpoint ID 2 and >>>> 3 it >>>> is stuck at the Kafka source after 50% >>>> (The data sent till now by Kafka source 1 is 65GB and sent by source 2 >>>> is >>>> 15GB ) >>>> >>>> Within 10minutes 15M records were processed, and for the next 16minutes >>>> the >>>> pipeline is stuck , I don't see any progress beyond 15M because of >>>> checkpoints getting failed consistently. >>>> >>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na >>>> bble.com/file/n11882/Checkpointing_Failed.png> >>>> >>>> >>>> >>>> -- >>>> View this message in context: http://apache-flink-user-maili >>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with- >>>> RocksDB-as-statebackend-tp11752p11882.html >>>> Sent from the Apache Flink User Mailing List archive. mailing list >>>> archive at Nabble.com. >>>> >>> >>> >>> >>> ------------------------------ >>> If you reply to this email, your message will be added to the discussion >>> below: >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab >>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11885.html >>> To start a new topic under Apache Flink User Mailing List archive., >>> email [hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=11887&i=1> >>> To unsubscribe from Apache Flink User Mailing List archive., click here. >>> NAML >>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>> >> >> ------------------------------ >> View this message in context: Re: Checkpointing with RocksDB as >> statebackend >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html> >> Sent from the Apache Flink User Mailing List archive. mailing list >> archive >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> >> at Nabble.com. >> > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re- > Checkpointing-with-RocksDB-as-statebackend-tp11752p11891.html > To start a new topic under Apache Flink User Mailing List archive., email > ml-node+s2336050n1...@n4.nabble.com > To unsubscribe from Apache Flink User Mailing List archive., click here > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx> > . > NAML > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11901.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.