Is there any way you can pull a thread dump from the TMs at the point when that happens?
On Wed, Jul 12, 2017 at 8:50 PM, vinay patil <vinay18.pa...@gmail.com> wrote: > Hi Gyula, > > I have observed similar issue with FlinkConsumer09 and 010 and posted it > to the mailing list as well . This issue is not consistent, however > whenever it happens it leads to checkpoints getting failed or taking a long > time to complete. > > Regards, > Vinay Patil > > On Wed, Jul 12, 2017 at 7:00 PM, Gyula Fóra [via Apache Flink User Mailing > List archive.] <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=14232&i=0>> wrote: > >> I have added logging that will help determine this as well, next time >> this happens I will post the results. (Although there doesnt seem to be >> high backpressure) >> >> Thanks for the tips, >> Gyula >> >> Stephan Ewen <[hidden email] >> <http:///user/SendEmail.jtp?type=node&node=14210&i=0>> ezt írta >> (időpont: 2017. júl. 12., Sze, 15:27): >> >>> Can it be that the checkpoint thread is waiting to grab the lock, which >>> is held by the chain under backpressure? >>> >>> On Wed, Jul 12, 2017 at 12:23 PM, Gyula Fóra <[hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=14210&i=1>> wrote: >>> >>>> Yes thats definitely what I am about to do next but just thought maybe >>>> someone has seen this before. >>>> >>>> Will post info next time it happens. (Not guaranteed to happen soon as >>>> it didn't happen for a long time before) >>>> >>>> Gyula >>>> >>>> On Wed, Jul 12, 2017, 12:13 Stefan Richter <[hidden email] >>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=2>> wrote: >>>> >>>>> Hi, >>>>> >>>>> could you introduce some logging to figure out from which method call >>>>> the delay is introduced? >>>>> >>>>> Best, >>>>> Stefan >>>>> >>>>> Am 12.07.2017 um 11:37 schrieb Gyula Fóra <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=3>>: >>>>> >>>>> Hi, >>>>> >>>>> We are using the latest 1.3.1 >>>>> >>>>> Gyula >>>>> >>>>> Urs Schoenenberger <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=4>> ezt írta >>>>> (időpont: 2017. júl. 12., Sze, 10:44): >>>>> >>>>>> Hi Gyula, >>>>>> >>>>>> I don't know the cause unfortunately, but we observed a similiar issue >>>>>> on Flink 1.1.3. The problem seems to be gone after upgrading to 1.2.1. >>>>>> Which version are you running on? >>>>>> >>>>>> Urs >>>>>> >>>>>> On 12.07.2017 09:48, Gyula Fóra wrote: >>>>>> > Hi, >>>>>> > >>>>>> > I have noticed a strange behavior in one of our jobs: every once in >>>>>> a while >>>>>> > the Kafka source checkpointing time becomes extremely large >>>>>> compared to >>>>>> > what it usually is. (To be very specific it is a kafka source >>>>>> chained with >>>>>> > a stateless map operator) >>>>>> > >>>>>> > To be more specific checkpointing the offsets usually takes around >>>>>> 10ms >>>>>> > which sounds reasonable but in some checkpoints this goes into the >>>>>> 3-5 >>>>>> > minutes range practically blocking the job for that period of time. >>>>>> > Yesterday I have observed even 10 minute delays. First I thought >>>>>> that some >>>>>> > sources might trigger checkpoints later than others, but adding some >>>>>> > logging and comparing it it seems that the triggerCheckpoint was >>>>>> received >>>>>> > at the same time. >>>>>> > >>>>>> > Interestingly only one of the 3 kafka sources in the job seems to be >>>>>> > affected (last time I checked at least). We are still using the 0.8 >>>>>> > consumer with commit on checkpoints. Also I dont see this happen in >>>>>> other >>>>>> > jobs. >>>>>> > >>>>>> > Any clue on what might cause this? >>>>>> > >>>>>> > Thanks :) >>>>>> > Gyula >>>>>> > >>>>>> > >>>>>> > >>>>>> > Hi, >>>>>> > >>>>>> > I have noticed a strange behavior in one of our jobs: every once in >>>>>> a >>>>>> > while the Kafka source checkpointing time becomes extremely large >>>>>> > compared to what it usually is. (To be very specific it is a kafka >>>>>> > source chained with a stateless map operator) >>>>>> > >>>>>> > To be more specific checkpointing the offsets usually takes around >>>>>> 10ms >>>>>> > which sounds reasonable but in some checkpoints this goes into the >>>>>> 3-5 >>>>>> > minutes range practically blocking the job for that period of time. >>>>>> > Yesterday I have observed even 10 minute delays. First I thought >>>>>> that >>>>>> > some sources might trigger checkpoints later than others, but adding >>>>>> > some logging and comparing it it seems that the triggerCheckpoint >>>>>> was >>>>>> > received at the same time. >>>>>> > >>>>>> > Interestingly only one of the 3 kafka sources in the job seems to be >>>>>> > affected (last time I checked at least). We are still using the 0.8 >>>>>> > consumer with commit on checkpoints. Also I dont see this happen in >>>>>> > other jobs. >>>>>> > >>>>>> > Any clue on what might cause this? >>>>>> > >>>>>> > Thanks :) >>>>>> > Gyula >>>>>> >>>>>> -- >>>>>> Urs Schönenberger - [hidden email] >>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=5> >>>>>> >>>>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >>>>>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke >>>>>> Sitz: Unterföhring * Amtsgericht München * HRB 135082 >>>>>> >>>>> >>>>> >>> >> >> ------------------------------ >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-flink-user-mailing-list-archive.2336050.n4. >> nabble.com/Why-would-a-kafka-source-checkpoint-take-so- >> long-tp14193p14210.html >> To start a new topic under Apache Flink User Mailing List archive., email >> [hidden >> email] <http:///user/SendEmail.jtp?type=node&node=14232&i=1> >> To unsubscribe from Apache Flink User Mailing List archive., click here. >> NAML >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> > > > ------------------------------ > View this message in context: Re: Why would a kafka source checkpoint > take so long? > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-would-a-kafka-source-checkpoint-take-so-long-tp14193p14232.html> > Sent from the Apache Flink User Mailing List archive. mailing list archive > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at > Nabble.com. >