I have added logging that will help determine this as well, next time this happens I will post the results. (Although there doesnt seem to be high backpressure)
Thanks for the tips, Gyula Stephan Ewen <se...@apache.org> ezt írta (időpont: 2017. júl. 12., Sze, 15:27): > Can it be that the checkpoint thread is waiting to grab the lock, which is > held by the chain under backpressure? > > On Wed, Jul 12, 2017 at 12:23 PM, Gyula Fóra <gyula.f...@gmail.com> wrote: > >> Yes thats definitely what I am about to do next but just thought maybe >> someone has seen this before. >> >> Will post info next time it happens. (Not guaranteed to happen soon as it >> didn't happen for a long time before) >> >> Gyula >> >> On Wed, Jul 12, 2017, 12:13 Stefan Richter <s.rich...@data-artisans.com> >> wrote: >> >>> Hi, >>> >>> could you introduce some logging to figure out from which method call >>> the delay is introduced? >>> >>> Best, >>> Stefan >>> >>> Am 12.07.2017 um 11:37 schrieb Gyula Fóra <gyula.f...@gmail.com>: >>> >>> Hi, >>> >>> We are using the latest 1.3.1 >>> >>> Gyula >>> >>> Urs Schoenenberger <urs.schoenenber...@tngtech.com> ezt írta (időpont: >>> 2017. júl. 12., Sze, 10:44): >>> >>>> Hi Gyula, >>>> >>>> I don't know the cause unfortunately, but we observed a similiar issue >>>> on Flink 1.1.3. The problem seems to be gone after upgrading to 1.2.1. >>>> Which version are you running on? >>>> >>>> Urs >>>> >>>> On 12.07.2017 09:48, Gyula Fóra wrote: >>>> > Hi, >>>> > >>>> > I have noticed a strange behavior in one of our jobs: every once in a >>>> while >>>> > the Kafka source checkpointing time becomes extremely large compared >>>> to >>>> > what it usually is. (To be very specific it is a kafka source chained >>>> with >>>> > a stateless map operator) >>>> > >>>> > To be more specific checkpointing the offsets usually takes around >>>> 10ms >>>> > which sounds reasonable but in some checkpoints this goes into the 3-5 >>>> > minutes range practically blocking the job for that period of time. >>>> > Yesterday I have observed even 10 minute delays. First I thought that >>>> some >>>> > sources might trigger checkpoints later than others, but adding some >>>> > logging and comparing it it seems that the triggerCheckpoint was >>>> received >>>> > at the same time. >>>> > >>>> > Interestingly only one of the 3 kafka sources in the job seems to be >>>> > affected (last time I checked at least). We are still using the 0.8 >>>> > consumer with commit on checkpoints. Also I dont see this happen in >>>> other >>>> > jobs. >>>> > >>>> > Any clue on what might cause this? >>>> > >>>> > Thanks :) >>>> > Gyula >>>> > >>>> > >>>> > >>>> > Hi, >>>> > >>>> > I have noticed a strange behavior in one of our jobs: every once in a >>>> > while the Kafka source checkpointing time becomes extremely large >>>> > compared to what it usually is. (To be very specific it is a kafka >>>> > source chained with a stateless map operator) >>>> > >>>> > To be more specific checkpointing the offsets usually takes around >>>> 10ms >>>> > which sounds reasonable but in some checkpoints this goes into the 3-5 >>>> > minutes range practically blocking the job for that period of time. >>>> > Yesterday I have observed even 10 minute delays. First I thought that >>>> > some sources might trigger checkpoints later than others, but adding >>>> > some logging and comparing it it seems that the triggerCheckpoint was >>>> > received at the same time. >>>> > >>>> > Interestingly only one of the 3 kafka sources in the job seems to be >>>> > affected (last time I checked at least). We are still using the 0.8 >>>> > consumer with commit on checkpoints. Also I dont see this happen in >>>> > other jobs. >>>> > >>>> > Any clue on what might cause this? >>>> > >>>> > Thanks :) >>>> > Gyula >>>> >>>> -- >>>> Urs Schönenberger - urs.schoenenber...@tngtech.com >>>> >>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >>>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke >>>> Sitz: Unterföhring * Amtsgericht München * HRB 135082 >>>> >>> >>> >