Re: Why would a kafka source checkpoint take so long?

Vinay Patil Thu, 13 Jul 2017 07:44:18 -0700

Hi Stephan,

Sure will do that next time when I observe it.


Regards,
Vinay Patil

On Thu, Jul 13, 2017 at 8:09 PM, Stephan Ewen <se...@apache.org> wrote:

> Is there any way you can pull a thread dump from the TMs at the point when
> that happens?
>
> On Wed, Jul 12, 2017 at 8:50 PM, vinay patil <vinay18.pa...@gmail.com>
> wrote:
>
>> Hi Gyula,
>>
>> I have observed similar issue with FlinkConsumer09 and 010 and posted it
>> to the mailing list as well . This issue is not consistent, however
>> whenever it happens it leads to checkpoints getting failed or taking a long
>> time to complete.
>>
>> Regards,
>> Vinay Patil
>>
>> On Wed, Jul 12, 2017 at 7:00 PM, Gyula Fóra [via Apache Flink User
>> Mailing List archive.] <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=14232&i=0>> wrote:
>>
>>> I have added logging that will help determine this as well, next time
>>> this happens I will post the results. (Although there doesnt seem to be
>>> high backpressure)
>>>
>>> Thanks for the tips,
>>> Gyula
>>>
>>> Stephan Ewen <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=0>> ezt írta
>>> (időpont: 2017. júl. 12., Sze, 15:27):
>>>
>>>> Can it be that the checkpoint thread is waiting to grab the lock, which
>>>> is held by the chain under backpressure?
>>>>
>>>> On Wed, Jul 12, 2017 at 12:23 PM, Gyula Fóra <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=1>> wrote:
>>>>
>>>>> Yes thats definitely what I am about to do next but just thought maybe
>>>>> someone has seen this before.
>>>>>
>>>>> Will post info next time it happens. (Not guaranteed to happen soon as
>>>>> it didn't happen for a long time before)
>>>>>
>>>>> Gyula
>>>>>
>>>>> On Wed, Jul 12, 2017, 12:13 Stefan Richter <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=2>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> could you introduce some logging to figure out from which method call
>>>>>> the delay is introduced?
>>>>>>
>>>>>> Best,
>>>>>> Stefan
>>>>>>
>>>>>> Am 12.07.2017 um 11:37 schrieb Gyula Fóra <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=3>>:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We are using the latest 1.3.1
>>>>>>
>>>>>> Gyula
>>>>>>
>>>>>> Urs Schoenenberger <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=4>> ezt írta
>>>>>> (időpont: 2017. júl. 12., Sze, 10:44):
>>>>>>
>>>>>>> Hi Gyula,
>>>>>>>
>>>>>>> I don't know the cause unfortunately, but we observed a similiar
>>>>>>> issue
>>>>>>> on Flink 1.1.3. The problem seems to be gone after upgrading to
>>>>>>> 1.2.1.
>>>>>>> Which version are you running on?
>>>>>>>
>>>>>>> Urs
>>>>>>>
>>>>>>> On 12.07.2017 09:48, Gyula Fóra wrote:
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > I have noticed a strange behavior in one of our jobs: every once
>>>>>>> in a while
>>>>>>> > the Kafka source checkpointing time becomes extremely large
>>>>>>> compared to
>>>>>>> > what it usually is. (To be very specific it is a kafka source
>>>>>>> chained with
>>>>>>> > a stateless map operator)
>>>>>>> >
>>>>>>> > To be more specific checkpointing the offsets usually takes around
>>>>>>> 10ms
>>>>>>> > which sounds reasonable but in some checkpoints this goes into the
>>>>>>> 3-5
>>>>>>> > minutes range practically blocking the job for that period of time.
>>>>>>> > Yesterday I have observed even 10 minute delays. First I thought
>>>>>>> that some
>>>>>>> > sources might trigger checkpoints later than others, but adding
>>>>>>> some
>>>>>>> > logging and comparing it it seems that the triggerCheckpoint was
>>>>>>> received
>>>>>>> > at the same time.
>>>>>>> >
>>>>>>> > Interestingly only one of the 3 kafka sources in the job seems to
>>>>>>> be
>>>>>>> > affected (last time I checked at least). We are still using the 0.8
>>>>>>> > consumer with commit on checkpoints. Also I dont see this happen
>>>>>>> in other
>>>>>>> > jobs.
>>>>>>> >
>>>>>>> > Any clue on what might cause this?
>>>>>>> >
>>>>>>> > Thanks :)
>>>>>>> > Gyula
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > I have noticed a strange behavior in one of our jobs: every once
>>>>>>> in a
>>>>>>> > while the Kafka source checkpointing time becomes extremely large
>>>>>>> > compared to what it usually is. (To be very specific it is a kafka
>>>>>>> > source chained with a stateless map operator)
>>>>>>> >
>>>>>>> > To be more specific checkpointing the offsets usually takes around
>>>>>>> 10ms
>>>>>>> > which sounds reasonable but in some checkpoints this goes into the
>>>>>>> 3-5
>>>>>>> > minutes range practically blocking the job for that period of time.
>>>>>>> > Yesterday I have observed even 10 minute delays. First I thought
>>>>>>> that
>>>>>>> > some sources might trigger checkpoints later than others, but
>>>>>>> adding
>>>>>>> > some logging and comparing it it seems that the triggerCheckpoint
>>>>>>> was
>>>>>>> > received at the same time.
>>>>>>> >
>>>>>>> > Interestingly only one of the 3 kafka sources in the job seems to
>>>>>>> be
>>>>>>> > affected (last time I checked at least). We are still using the 0.8
>>>>>>> > consumer with commit on checkpoints. Also I dont see this happen in
>>>>>>> > other jobs.
>>>>>>> >
>>>>>>> > Any clue on what might cause this?
>>>>>>> >
>>>>>>> > Thanks :)
>>>>>>> > Gyula
>>>>>>>
>>>>>>> --
>>>>>>> Urs Schönenberger - [hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=5>
>>>>>>>
>>>>>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>>>>>>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>>>>>>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Why-would-a-kafka-source-checkpoint-take-so-long-
>>> tp14193p14210.html
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=14232&i=1>
>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>> View this message in context: Re: Why would a kafka source checkpoint
>> take so long?
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-would-a-kafka-source-checkpoint-take-so-long-tp14193p14232.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>> at Nabble.com.
>>
>
>

Re: Why would a kafka source checkpoint take so long?

Reply via email to