Re: Why would a kafka source checkpoint take so long?

Stephan Ewen Thu, 13 Jul 2017 07:40:19 -0700

Is there any way you can pull a thread dump from the TMs at the point when
that happens?


On Wed, Jul 12, 2017 at 8:50 PM, vinay patil <vinay18.pa...@gmail.com>
wrote:

> Hi Gyula,
>
> I have observed similar issue with FlinkConsumer09 and 010 and posted it
> to the mailing list as well . This issue is not consistent, however
> whenever it happens it leads to checkpoints getting failed or taking a long
> time to complete.
>
> Regards,
> Vinay Patil
>
> On Wed, Jul 12, 2017 at 7:00 PM, Gyula Fóra [via Apache Flink User Mailing
> List archive.] <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=14232&i=0>> wrote:
>
>> I have added logging that will help determine this as well, next time
>> this happens I will post the results. (Although there doesnt seem to be
>> high backpressure)
>>
>> Thanks for the tips,
>> Gyula
>>
>> Stephan Ewen <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=14210&i=0>> ezt írta
>> (időpont: 2017. júl. 12., Sze, 15:27):
>>
>>> Can it be that the checkpoint thread is waiting to grab the lock, which
>>> is held by the chain under backpressure?
>>>
>>> On Wed, Jul 12, 2017 at 12:23 PM, Gyula Fóra <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=1>> wrote:
>>>
>>>> Yes thats definitely what I am about to do next but just thought maybe
>>>> someone has seen this before.
>>>>
>>>> Will post info next time it happens. (Not guaranteed to happen soon as
>>>> it didn't happen for a long time before)
>>>>
>>>> Gyula
>>>>
>>>> On Wed, Jul 12, 2017, 12:13 Stefan Richter <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=2>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> could you introduce some logging to figure out from which method call
>>>>> the delay is introduced?
>>>>>
>>>>> Best,
>>>>> Stefan
>>>>>
>>>>> Am 12.07.2017 um 11:37 schrieb Gyula Fóra <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=3>>:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We are using the latest 1.3.1
>>>>>
>>>>> Gyula
>>>>>
>>>>> Urs Schoenenberger <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=4>> ezt írta
>>>>> (időpont: 2017. júl. 12., Sze, 10:44):
>>>>>
>>>>>> Hi Gyula,
>>>>>>
>>>>>> I don't know the cause unfortunately, but we observed a similiar issue
>>>>>> on Flink 1.1.3. The problem seems to be gone after upgrading to 1.2.1.
>>>>>> Which version are you running on?
>>>>>>
>>>>>> Urs
>>>>>>
>>>>>> On 12.07.2017 09:48, Gyula Fóra wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > I have noticed a strange behavior in one of our jobs: every once in
>>>>>> a while
>>>>>> > the Kafka source checkpointing time becomes extremely large
>>>>>> compared to
>>>>>> > what it usually is. (To be very specific it is a kafka source
>>>>>> chained with
>>>>>> > a stateless map operator)
>>>>>> >
>>>>>> > To be more specific checkpointing the offsets usually takes around
>>>>>> 10ms
>>>>>> > which sounds reasonable but in some checkpoints this goes into the
>>>>>> 3-5
>>>>>> > minutes range practically blocking the job for that period of time.
>>>>>> > Yesterday I have observed even 10 minute delays. First I thought
>>>>>> that some
>>>>>> > sources might trigger checkpoints later than others, but adding some
>>>>>> > logging and comparing it it seems that the triggerCheckpoint was
>>>>>> received
>>>>>> > at the same time.
>>>>>> >
>>>>>> > Interestingly only one of the 3 kafka sources in the job seems to be
>>>>>> > affected (last time I checked at least). We are still using the 0.8
>>>>>> > consumer with commit on checkpoints. Also I dont see this happen in
>>>>>> other
>>>>>> > jobs.
>>>>>> >
>>>>>> > Any clue on what might cause this?
>>>>>> >
>>>>>> > Thanks :)
>>>>>> > Gyula
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Hi,
>>>>>> >
>>>>>> > I have noticed a strange behavior in one of our jobs: every once in
>>>>>> a
>>>>>> > while the Kafka source checkpointing time becomes extremely large
>>>>>> > compared to what it usually is. (To be very specific it is a kafka
>>>>>> > source chained with a stateless map operator)
>>>>>> >
>>>>>> > To be more specific checkpointing the offsets usually takes around
>>>>>> 10ms
>>>>>> > which sounds reasonable but in some checkpoints this goes into the
>>>>>> 3-5
>>>>>> > minutes range practically blocking the job for that period of time.
>>>>>> > Yesterday I have observed even 10 minute delays. First I thought
>>>>>> that
>>>>>> > some sources might trigger checkpoints later than others, but adding
>>>>>> > some logging and comparing it it seems that the triggerCheckpoint
>>>>>> was
>>>>>> > received at the same time.
>>>>>> >
>>>>>> > Interestingly only one of the 3 kafka sources in the job seems to be
>>>>>> > affected (last time I checked at least). We are still using the 0.8
>>>>>> > consumer with commit on checkpoints. Also I dont see this happen in
>>>>>> > other jobs.
>>>>>> >
>>>>>> > Any clue on what might cause this?
>>>>>> >
>>>>>> > Thanks :)
>>>>>> > Gyula
>>>>>>
>>>>>> --
>>>>>> Urs Schönenberger - [hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=5>
>>>>>>
>>>>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>>>>>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>>>>>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>>>>>
>>>>>
>>>>>
>>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Why-would-a-kafka-source-checkpoint-take-so-
>> long-tp14193p14210.html
>> To start a new topic under Apache Flink User Mailing List archive., email 
>> [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=14232&i=1>
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> ------------------------------
> View this message in context: Re: Why would a kafka source checkpoint
> take so long?
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-would-a-kafka-source-checkpoint-take-so-long-tp14193p14232.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at
> Nabble.com.
>

Re: Why would a kafka source checkpoint take so long?

Reply via email to