Hi Stephan, Sure will do that next time when I observe it.
Regards, Vinay Patil On Thu, Jul 13, 2017 at 8:09 PM, Stephan Ewen <se...@apache.org> wrote: > Is there any way you can pull a thread dump from the TMs at the point when > that happens? > > On Wed, Jul 12, 2017 at 8:50 PM, vinay patil <vinay18.pa...@gmail.com> > wrote: > >> Hi Gyula, >> >> I have observed similar issue with FlinkConsumer09 and 010 and posted it >> to the mailing list as well . This issue is not consistent, however >> whenever it happens it leads to checkpoints getting failed or taking a long >> time to complete. >> >> Regards, >> Vinay Patil >> >> On Wed, Jul 12, 2017 at 7:00 PM, Gyula Fóra [via Apache Flink User >> Mailing List archive.] <[hidden email] >> <http:///user/SendEmail.jtp?type=node&node=14232&i=0>> wrote: >> >>> I have added logging that will help determine this as well, next time >>> this happens I will post the results. (Although there doesnt seem to be >>> high backpressure) >>> >>> Thanks for the tips, >>> Gyula >>> >>> Stephan Ewen <[hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=14210&i=0>> ezt írta >>> (időpont: 2017. júl. 12., Sze, 15:27): >>> >>>> Can it be that the checkpoint thread is waiting to grab the lock, which >>>> is held by the chain under backpressure? >>>> >>>> On Wed, Jul 12, 2017 at 12:23 PM, Gyula Fóra <[hidden email] >>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=1>> wrote: >>>> >>>>> Yes thats definitely what I am about to do next but just thought maybe >>>>> someone has seen this before. >>>>> >>>>> Will post info next time it happens. (Not guaranteed to happen soon as >>>>> it didn't happen for a long time before) >>>>> >>>>> Gyula >>>>> >>>>> On Wed, Jul 12, 2017, 12:13 Stefan Richter <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=2>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> could you introduce some logging to figure out from which method call >>>>>> the delay is introduced? >>>>>> >>>>>> Best, >>>>>> Stefan >>>>>> >>>>>> Am 12.07.2017 um 11:37 schrieb Gyula Fóra <[hidden email] >>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=3>>: >>>>>> >>>>>> Hi, >>>>>> >>>>>> We are using the latest 1.3.1 >>>>>> >>>>>> Gyula >>>>>> >>>>>> Urs Schoenenberger <[hidden email] >>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=4>> ezt írta >>>>>> (időpont: 2017. júl. 12., Sze, 10:44): >>>>>> >>>>>>> Hi Gyula, >>>>>>> >>>>>>> I don't know the cause unfortunately, but we observed a similiar >>>>>>> issue >>>>>>> on Flink 1.1.3. The problem seems to be gone after upgrading to >>>>>>> 1.2.1. >>>>>>> Which version are you running on? >>>>>>> >>>>>>> Urs >>>>>>> >>>>>>> On 12.07.2017 09:48, Gyula Fóra wrote: >>>>>>> > Hi, >>>>>>> > >>>>>>> > I have noticed a strange behavior in one of our jobs: every once >>>>>>> in a while >>>>>>> > the Kafka source checkpointing time becomes extremely large >>>>>>> compared to >>>>>>> > what it usually is. (To be very specific it is a kafka source >>>>>>> chained with >>>>>>> > a stateless map operator) >>>>>>> > >>>>>>> > To be more specific checkpointing the offsets usually takes around >>>>>>> 10ms >>>>>>> > which sounds reasonable but in some checkpoints this goes into the >>>>>>> 3-5 >>>>>>> > minutes range practically blocking the job for that period of time. >>>>>>> > Yesterday I have observed even 10 minute delays. First I thought >>>>>>> that some >>>>>>> > sources might trigger checkpoints later than others, but adding >>>>>>> some >>>>>>> > logging and comparing it it seems that the triggerCheckpoint was >>>>>>> received >>>>>>> > at the same time. >>>>>>> > >>>>>>> > Interestingly only one of the 3 kafka sources in the job seems to >>>>>>> be >>>>>>> > affected (last time I checked at least). We are still using the 0.8 >>>>>>> > consumer with commit on checkpoints. Also I dont see this happen >>>>>>> in other >>>>>>> > jobs. >>>>>>> > >>>>>>> > Any clue on what might cause this? >>>>>>> > >>>>>>> > Thanks :) >>>>>>> > Gyula >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Hi, >>>>>>> > >>>>>>> > I have noticed a strange behavior in one of our jobs: every once >>>>>>> in a >>>>>>> > while the Kafka source checkpointing time becomes extremely large >>>>>>> > compared to what it usually is. (To be very specific it is a kafka >>>>>>> > source chained with a stateless map operator) >>>>>>> > >>>>>>> > To be more specific checkpointing the offsets usually takes around >>>>>>> 10ms >>>>>>> > which sounds reasonable but in some checkpoints this goes into the >>>>>>> 3-5 >>>>>>> > minutes range practically blocking the job for that period of time. >>>>>>> > Yesterday I have observed even 10 minute delays. First I thought >>>>>>> that >>>>>>> > some sources might trigger checkpoints later than others, but >>>>>>> adding >>>>>>> > some logging and comparing it it seems that the triggerCheckpoint >>>>>>> was >>>>>>> > received at the same time. >>>>>>> > >>>>>>> > Interestingly only one of the 3 kafka sources in the job seems to >>>>>>> be >>>>>>> > affected (last time I checked at least). We are still using the 0.8 >>>>>>> > consumer with commit on checkpoints. Also I dont see this happen in >>>>>>> > other jobs. >>>>>>> > >>>>>>> > Any clue on what might cause this? >>>>>>> > >>>>>>> > Thanks :) >>>>>>> > Gyula >>>>>>> >>>>>>> -- >>>>>>> Urs Schönenberger - [hidden email] >>>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=5> >>>>>>> >>>>>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >>>>>>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke >>>>>>> Sitz: Unterföhring * Amtsgericht München * HRB 135082 >>>>>>> >>>>>> >>>>>> >>>> >>> >>> ------------------------------ >>> If you reply to this email, your message will be added to the discussion >>> below: >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab >>> ble.com/Why-would-a-kafka-source-checkpoint-take-so-long- >>> tp14193p14210.html >>> To start a new topic under Apache Flink User Mailing List archive., >>> email [hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=14232&i=1> >>> To unsubscribe from Apache Flink User Mailing List archive., click here. >>> NAML >>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>> >> >> >> ------------------------------ >> View this message in context: Re: Why would a kafka source checkpoint >> take so long? >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-would-a-kafka-source-checkpoint-take-so-long-tp14193p14232.html> >> Sent from the Apache Flink User Mailing List archive. mailing list >> archive >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> >> at Nabble.com. >> > >