Re: Trying to figure out why a slot takes a long time to checkpoint

Julio Biason Wed, 19 Sep 2018 13:31:23 -0700

Hey guys,

So, switching to Ceph/S3 didn't shine any new lights on the issue. Although
the times are a bit higher, just a few slots are taking a magnitude longer
to save. So I changed the logs for DEBUG.


The problem is: I'm not seeing anything that seems relevant; only pings
from ZooKeeper, heartbeats and the S3 disconnecting from being idle.

Is there anything else that I should change to DEBUG? Akka? Kafka? Haoop?
ZooKeeper? (Those are, by the default config, bumped to INFO)

All of those?

On Tue, Sep 18, 2018 at 12:34 PM, Julio Biason <julio.bia...@azion.com>
wrote:

> Hey TIll (and others),
>
> We don't have debug logs yet, but we decided to remove a related
> component: HDFS.
>
> We are moving the storage to our Ceph install (using S3), which is running
> for longer than our HDFS install and we know, for sure, it runs without any
> problems (specially 'cause we have more people that understand Ceph than
> people that know HDFS at this point).
>
> If, for some reason, the problem persists, we know it's not the underlying
> storage and may be something with our pipeline itself. I'll enable debug
> logs, then.
>
> On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> This behavior seems very odd Julio. Could you indeed share the debug logs
>> of all Flink processes in order to see why things are taking so long?
>>
>> The checkpoint size of task #8 is twice as big as the second biggest
>> checkpoint. But this should not cause an increase in checkpoint time of a
>> factor of 8.
>>
>> Cheers,
>> Till
>>
>> On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>>
>>> Hi, Julio:
>>> This happens frequently? What state backend do you use? The async
>>> checkpoint duration and sync checkpoint duration seems normal compared to
>>> others, it seems that most of the time are spent acking the checkpoint.
>>>
>>> On Sun, Sep 16, 2018 at 9:24 AM vino yang <yanghua1...@gmail.com> wrote:
>>>
>>>> Hi Julio,
>>>>
>>>> Yes, it seems that fifty-five minutes is really long.
>>>> However, it is linear with the time and size of the previous task
>>>> adjacent to it in the diagram.
>>>> I think your real application is concerned about why Flink accesses
>>>> HDFS so slowly.
>>>> You can call the DEBUG log to see if you can find any clues, or post
>>>> the log to the mailing list to help others analyze the problem for you.
>>>>
>>>> Thanks, vino.
>>>>
>>>> Julio Biason <julio.bia...@azion.com> 于2018年9月15日周六 上午7:03写道：
>>>>
>>>>> (Just an addendum: Although it's not a huge problem -- we can always
>>>>> increase the checkpoint timeout time -- this anomalous situation makes me
>>>>> think there is something wrong in our pipeline or in our cluster, and that
>>>>> is what is making the checkpoint creation go crazy.)
>>>>>
>>>>> On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <julio.bia...@azion.com>
>>>>> wrote:
>>>>>
>>>>>> Hey guys,
>>>>>>
>>>>>> On our pipeline, we have a single slot that it's taking longer to
>>>>>> create the checkpoint compared to other slots and we are wondering what
>>>>>> could be causing it.
>>>>>>
>>>>>> The operator in question is the window metric -- the only element in
>>>>>> the pipeline that actually uses the state. While the other slots take 7
>>>>>> mins to create the checkpoint, this one -- and only this one -- takes
>>>>>> 55mins.
>>>>>>
>>>>>> Is there something I should look at to understand what's going on?
>>>>>>
>>>>>> (We are storing all checkpoints in HDFS, in case that helps.)
>>>>>>
>>>>>> --
>>>>>> *Julio Biason*, Sofware Engineer
>>>>>> *AZION*  |  Deliver. Accelerate. Protect.
>>>>>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>>>>>> <callto:+5551996209291>*99907 0554*
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Julio Biason*, Sofware Engineer
>>>>> *AZION*  |  Deliver. Accelerate. Protect.
>>>>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>>>>> <callto:+5551996209291>*99907 0554*
>>>>>
>>>> --
>>> Liu, Renjie
>>> Software Engineer, MVAD
>>>
>>
>
>
> --
> *Julio Biason*, Sofware Engineer
> *AZION*  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
> <callto:+5551996209291>*99907 0554*
>



-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
<callto:+5551996209291>*99907 0554*

Re: Trying to figure out why a slot takes a long time to checkpoint

Reply via email to