A debug log for state backend and checkpoint coordinator could also help.
> Am 20.09.2018 um 14:19 schrieb Stefan Richter <s.rich...@data-artisans.com>:
>
> Hi,
>
> if some tasks take like 50 minutes, could you wait until such a checkpoint is
> in progress and (let’s say after 10 minutes) log into the node and create a
> (or multiple over time) thread-dump(s) for the JVM that runs the slow
> checkpointing task. This could help to figure out where it is stuck or
> waiting.
>
> Best,
> Stefan
>
>> Am 19.09.2018 um 22:30 schrieb Julio Biason <julio.bia...@azion.com
>> <mailto:julio.bia...@azion.com>>:
>>
>> Hey guys,
>>
>> So, switching to Ceph/S3 didn't shine any new lights on the issue. Although
>> the times are a bit higher, just a few slots are taking a magnitude longer
>> to save. So I changed the logs for DEBUG.
>>
>> The problem is: I'm not seeing anything that seems relevant; only pings from
>> ZooKeeper, heartbeats and the S3 disconnecting from being idle.
>>
>> Is there anything else that I should change to DEBUG? Akka? Kafka? Haoop?
>> ZooKeeper? (Those are, by the default config, bumped to INFO)
>>
>> All of those?
>>
>> On Tue, Sep 18, 2018 at 12:34 PM, Julio Biason <julio.bia...@azion.com
>> <mailto:julio.bia...@azion.com>> wrote:
>> Hey TIll (and others),
>>
>> We don't have debug logs yet, but we decided to remove a related component:
>> HDFS.
>>
>> We are moving the storage to our Ceph install (using S3), which is running
>> for longer than our HDFS install and we know, for sure, it runs without any
>> problems (specially 'cause we have more people that understand Ceph than
>> people that know HDFS at this point).
>>
>> If, for some reason, the problem persists, we know it's not the underlying
>> storage and may be something with our pipeline itself. I'll enable debug
>> logs, then.
>>
>> On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <trohrm...@apache.org
>> <mailto:trohrm...@apache.org>> wrote:
>> This behavior seems very odd Julio. Could you indeed share the debug logs of
>> all Flink processes in order to see why things are taking so long?
>>
>> The checkpoint size of task #8 is twice as big as the second biggest
>> checkpoint. But this should not cause an increase in checkpoint time of a
>> factor of 8.
>>
>> Cheers,
>> Till
>>
>> On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <liurenjie2...@gmail.com
>> <mailto:liurenjie2...@gmail.com>> wrote:
>> Hi, Julio:
>> This happens frequently? What state backend do you use? The async checkpoint
>> duration and sync checkpoint duration seems normal compared to others, it
>> seems that most of the time are spent acking the checkpoint.
>>
>> On Sun, Sep 16, 2018 at 9:24 AM vino yang <yanghua1...@gmail.com
>> <mailto:yanghua1...@gmail.com>> wrote:
>> Hi Julio,
>>
>> Yes, it seems that fifty-five minutes is really long.
>> However, it is linear with the time and size of the previous task adjacent
>> to it in the diagram.
>> I think your real application is concerned about why Flink accesses HDFS so
>> slowly.
>> You can call the DEBUG log to see if you can find any clues, or post the log
>> to the mailing list to help others analyze the problem for you.
>>
>> Thanks, vino.
>>
>> Julio Biason <julio.bia...@azion.com <mailto:julio.bia...@azion.com>>
>> 于2018年9月15日周六 上午7:03写道:
>> (Just an addendum: Although it's not a huge problem -- we can always
>> increase the checkpoint timeout time -- this anomalous situation makes me
>> think there is something wrong in our pipeline or in our cluster, and that
>> is what is making the checkpoint creation go crazy.)
>>
>> On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <julio.bia...@azion.com
>> <mailto:julio.bia...@azion.com>> wrote:
>> Hey guys,
>>
>> On our pipeline, we have a single slot that it's taking longer to create the
>> checkpoint compared to other slots and we are wondering what could be
>> causing it.
>>
>> The operator in question is the window metric -- the only element in the
>> pipeline that actually uses the state. While the other slots take 7 mins to
>> create the checkpoint, this one -- and only this one -- takes 55mins.
>>
>> Is there something I should look at to understand what's going on?
>>
>> (We are storing all checkpoints in HDFS, in case that helps.)
>>
>> --
>> Julio Biason, Sofware Engineer
>> AZION | Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51
>> <callto:+5551996209291>99907 0554
>>
>>
>>
>> --
>> Julio Biason, Sofware Engineer
>> AZION | Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51
>> <callto:+5551996209291>99907 0554
>> --
>> Liu, Renjie
>> Software Engineer, MVAD
>>
>>
>>
>> --
>> Julio Biason, Sofware Engineer
>> AZION | Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51
>> <callto:+5551996209291>99907 0554
>>
>>
>>
>> --
>> Julio Biason, Sofware Engineer
>> AZION | Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51
>> <callto:+5551996209291>99907 0554
>