Re: Trying to figure out why a slot takes a long time to checkpoint

Stefan Richter Thu, 20 Sep 2018 05:21:21 -0700
A debug log for state backend and checkpoint coordinator could also help. 

> Am 20.09.2018 um 14:19 schrieb Stefan Richter <s.rich...@data-artisans.com>:
> 
> Hi,
> 
> if some tasks take like 50 minutes, could you wait until such a checkpoint is 
> in progress and (let’s say after 10 minutes) log into the node and create a 
> (or multiple over time) thread-dump(s) for the JVM that runs the slow 
> checkpointing task. This could help to figure out where it is stuck or 
> waiting.
> 
> Best,
> Stefan
> 
>> Am 19.09.2018 um 22:30 schrieb Julio Biason <julio.bia...@azion.com 
>> <mailto:julio.bia...@azion.com>>:
>> 
>> Hey guys,
>> 
>> So, switching to Ceph/S3 didn't shine any new lights on the issue. Although 
>> the times are a bit higher, just a few slots are taking a magnitude longer 
>> to save. So I changed the logs for DEBUG.
>> 
>> The problem is: I'm not seeing anything that seems relevant; only pings from 
>> ZooKeeper, heartbeats and the S3 disconnecting from being idle.
>> 
>> Is there anything else that I should change to DEBUG? Akka? Kafka? Haoop? 
>> ZooKeeper? (Those are, by the default config, bumped to INFO)
>> 
>> All of those?
>> 
>> On Tue, Sep 18, 2018 at 12:34 PM, Julio Biason <julio.bia...@azion.com 
>> <mailto:julio.bia...@azion.com>> wrote:
>> Hey TIll (and others),
>> 
>> We don't have debug logs yet, but we decided to remove a related component: 
>> HDFS.
>> 
>> We are moving the storage to our Ceph install (using S3), which is running 
>> for longer than our HDFS install and we know, for sure, it runs without any 
>> problems (specially 'cause we have more people that understand Ceph than 
>> people that know HDFS at this point).
>> 
>> If, for some reason, the problem persists, we know it's not the underlying 
>> storage and may be something with our pipeline itself. I'll enable debug 
>> logs, then.
>> 
>> On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <trohrm...@apache.org 
>> <mailto:trohrm...@apache.org>> wrote:
>> This behavior seems very odd Julio. Could you indeed share the debug logs of 
>> all Flink processes in order to see why things are taking so long?
>> 
>> The checkpoint size of task #8 is twice as big as the second biggest 
>> checkpoint. But this should not cause an increase in checkpoint time of a 
>> factor of 8.
>> 
>> Cheers,
>> Till
>> 
>> On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <liurenjie2...@gmail.com 
>> <mailto:liurenjie2...@gmail.com>> wrote:
>> Hi, Julio:
>> This happens frequently? What state backend do you use? The async checkpoint 
>> duration and sync checkpoint duration seems normal compared to others, it 
>> seems that most of the time are spent acking the checkpoint.
>> 
>> On Sun, Sep 16, 2018 at 9:24 AM vino yang <yanghua1...@gmail.com 
>> <mailto:yanghua1...@gmail.com>> wrote:
>> Hi Julio,
>> 
>> Yes, it seems that fifty-five minutes is really long. 
>> However, it is linear with the time and size of the previous task adjacent 
>> to it in the diagram. 
>> I think your real application is concerned about why Flink accesses HDFS so 
>> slowly. 
>> You can call the DEBUG log to see if you can find any clues, or post the log 
>> to the mailing list to help others analyze the problem for you.
>> 
>> Thanks, vino.
>> 
>> Julio Biason <julio.bia...@azion.com <mailto:julio.bia...@azion.com>> 
>> 于2018年9月15日周六 上午7:03写道：
>> (Just an addendum: Although it's not a huge problem -- we can always 
>> increase the checkpoint timeout time -- this anomalous situation makes me 
>> think there is something wrong in our pipeline or in our cluster, and that 
>> is what is making the checkpoint creation go crazy.)
>> 
>> On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <julio.bia...@azion.com 
>> <mailto:julio.bia...@azion.com>> wrote:
>> Hey guys,
>> 
>> On our pipeline, we have a single slot that it's taking longer to create the 
>> checkpoint compared to other slots and we are wondering what could be 
>> causing it.
>> 
>> The operator in question is the window metric -- the only element in the 
>> pipeline that actually uses the state. While the other slots take 7 mins to 
>> create the checkpoint, this one -- and only this one -- takes 55mins.
>> 
>> Is there something I should look at to understand what's going on?
>> 
>> (We are storing all checkpoints in HDFS, in case that helps.)
>> 
>> -- 
>> Julio Biason, Sofware Engineer
>> AZION  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51  
>> <callto:+5551996209291>99907 0554
>> 
>> 
>> 
>> -- 
>> Julio Biason, Sofware Engineer
>> AZION  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51  
>> <callto:+5551996209291>99907 0554
>> -- 
>> Liu, Renjie
>> Software Engineer, MVAD
>> 
>> 
>> 
>> -- 
>> Julio Biason, Sofware Engineer
>> AZION  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51  
>> <callto:+5551996209291>99907 0554
>> 
>> 
>> 
>> -- 
>> Julio Biason, Sofware Engineer
>> AZION  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51  
>> <callto:+5551996209291>99907 0554
>
Re: Trying to figure out why a slot takes a long time to checkpoint

Reply via email to