Can it be that the checkpoint thread is waiting to grab the lock, which is held by the chain under backpressure?
On Wed, Jul 12, 2017 at 12:23 PM, Gyula Fóra <gyula.f...@gmail.com> wrote: > Yes thats definitely what I am about to do next but just thought maybe > someone has seen this before. > > Will post info next time it happens. (Not guaranteed to happen soon as it > didn't happen for a long time before) > > Gyula > > On Wed, Jul 12, 2017, 12:13 Stefan Richter <s.rich...@data-artisans.com> > wrote: > >> Hi, >> >> could you introduce some logging to figure out from which method call the >> delay is introduced? >> >> Best, >> Stefan >> >> Am 12.07.2017 um 11:37 schrieb Gyula Fóra <gyula.f...@gmail.com>: >> >> Hi, >> >> We are using the latest 1.3.1 >> >> Gyula >> >> Urs Schoenenberger <urs.schoenenber...@tngtech.com> ezt írta (időpont: >> 2017. júl. 12., Sze, 10:44): >> >>> Hi Gyula, >>> >>> I don't know the cause unfortunately, but we observed a similiar issue >>> on Flink 1.1.3. The problem seems to be gone after upgrading to 1.2.1. >>> Which version are you running on? >>> >>> Urs >>> >>> On 12.07.2017 09:48, Gyula Fóra wrote: >>> > Hi, >>> > >>> > I have noticed a strange behavior in one of our jobs: every once in a >>> while >>> > the Kafka source checkpointing time becomes extremely large compared to >>> > what it usually is. (To be very specific it is a kafka source chained >>> with >>> > a stateless map operator) >>> > >>> > To be more specific checkpointing the offsets usually takes around 10ms >>> > which sounds reasonable but in some checkpoints this goes into the 3-5 >>> > minutes range practically blocking the job for that period of time. >>> > Yesterday I have observed even 10 minute delays. First I thought that >>> some >>> > sources might trigger checkpoints later than others, but adding some >>> > logging and comparing it it seems that the triggerCheckpoint was >>> received >>> > at the same time. >>> > >>> > Interestingly only one of the 3 kafka sources in the job seems to be >>> > affected (last time I checked at least). We are still using the 0.8 >>> > consumer with commit on checkpoints. Also I dont see this happen in >>> other >>> > jobs. >>> > >>> > Any clue on what might cause this? >>> > >>> > Thanks :) >>> > Gyula >>> > >>> > >>> > >>> > Hi, >>> > >>> > I have noticed a strange behavior in one of our jobs: every once in a >>> > while the Kafka source checkpointing time becomes extremely large >>> > compared to what it usually is. (To be very specific it is a kafka >>> > source chained with a stateless map operator) >>> > >>> > To be more specific checkpointing the offsets usually takes around 10ms >>> > which sounds reasonable but in some checkpoints this goes into the 3-5 >>> > minutes range practically blocking the job for that period of time. >>> > Yesterday I have observed even 10 minute delays. First I thought that >>> > some sources might trigger checkpoints later than others, but adding >>> > some logging and comparing it it seems that the triggerCheckpoint was >>> > received at the same time. >>> > >>> > Interestingly only one of the 3 kafka sources in the job seems to be >>> > affected (last time I checked at least). We are still using the 0.8 >>> > consumer with commit on checkpoints. Also I dont see this happen in >>> > other jobs. >>> > >>> > Any clue on what might cause this? >>> > >>> > Thanks :) >>> > Gyula >>> >>> -- >>> Urs Schönenberger - urs.schoenenber...@tngtech.com >>> >>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke >>> Sitz: Unterföhring * Amtsgericht München * HRB 135082 >>> >> >>