Re: Trying to figure out why a slot takes a long time to checkpoint

Stefan Richter Thu, 20 Sep 2018 05:19:42 -0700

Hi,

if some tasks take like 50 minutes, could you wait until such a checkpoint is 
in progress and (let’s say after 10 minutes) log into the node and create a (or 
multiple over time) thread-dump(s) for the JVM that runs the slow checkpointing 
task. This could help to figure out where it is stuck or waiting.


Best,
Stefan

> Am 19.09.2018 um 22:30 schrieb Julio Biason <julio.bia...@azion.com>:
> 
> Hey guys,
> 
> So, switching to Ceph/S3 didn't shine any new lights on the issue. Although 
> the times are a bit higher, just a few slots are taking a magnitude longer to 
> save. So I changed the logs for DEBUG.
> 
> The problem is: I'm not seeing anything that seems relevant; only pings from 
> ZooKeeper, heartbeats and the S3 disconnecting from being idle.
> 
> Is there anything else that I should change to DEBUG? Akka? Kafka? Haoop? 
> ZooKeeper? (Those are, by the default config, bumped to INFO)
> 
> All of those?
> 
> On Tue, Sep 18, 2018 at 12:34 PM, Julio Biason <julio.bia...@azion.com 
> <mailto:julio.bia...@azion.com>> wrote:
> Hey TIll (and others),
> 
> We don't have debug logs yet, but we decided to remove a related component: 
> HDFS.
> 
> We are moving the storage to our Ceph install (using S3), which is running 
> for longer than our HDFS install and we know, for sure, it runs without any 
> problems (specially 'cause we have more people that understand Ceph than 
> people that know HDFS at this point).
> 
> If, for some reason, the problem persists, we know it's not the underlying 
> storage and may be something with our pipeline itself. I'll enable debug 
> logs, then.
> 
> On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <trohrm...@apache.org 
> <mailto:trohrm...@apache.org>> wrote:
> This behavior seems very odd Julio. Could you indeed share the debug logs of 
> all Flink processes in order to see why things are taking so long?
> 
> The checkpoint size of task #8 is twice as big as the second biggest 
> checkpoint. But this should not cause an increase in checkpoint time of a 
> factor of 8.
> 
> Cheers,
> Till
> 
> On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <liurenjie2...@gmail.com 
> <mailto:liurenjie2...@gmail.com>> wrote:
> Hi, Julio:
> This happens frequently? What state backend do you use? The async checkpoint 
> duration and sync checkpoint duration seems normal compared to others, it 
> seems that most of the time are spent acking the checkpoint.
> 
> On Sun, Sep 16, 2018 at 9:24 AM vino yang <yanghua1...@gmail.com 
> <mailto:yanghua1...@gmail.com>> wrote:
> Hi Julio,
> 
> Yes, it seems that fifty-five minutes is really long. 
> However, it is linear with the time and size of the previous task adjacent to 
> it in the diagram. 
> I think your real application is concerned about why Flink accesses HDFS so 
> slowly. 
> You can call the DEBUG log to see if you can find any clues, or post the log 
> to the mailing list to help others analyze the problem for you.
> 
> Thanks, vino.
> 
> Julio Biason <julio.bia...@azion.com <mailto:julio.bia...@azion.com>> 
> 于2018年9月15日周六 上午7:03写道：
> (Just an addendum: Although it's not a huge problem -- we can always increase 
> the checkpoint timeout time -- this anomalous situation makes me think there 
> is something wrong in our pipeline or in our cluster, and that is what is 
> making the checkpoint creation go crazy.)
> 
> On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <julio.bia...@azion.com 
> <mailto:julio.bia...@azion.com>> wrote:
> Hey guys,
> 
> On our pipeline, we have a single slot that it's taking longer to create the 
> checkpoint compared to other slots and we are wondering what could be causing 
> it.
> 
> The operator in question is the window metric -- the only element in the 
> pipeline that actually uses the state. While the other slots take 7 mins to 
> create the checkpoint, this one -- and only this one -- takes 55mins.
> 
> Is there something I should look at to understand what's going on?
> 
> (We are storing all checkpoints in HDFS, in case that helps.)
> 
> -- 
> Julio Biason, Sofware Engineer
> AZION  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51  
> <callto:+5551996209291>99907 0554
> 
> 
> 
> -- 
> Julio Biason, Sofware Engineer
> AZION  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51  
> <callto:+5551996209291>99907 0554
> -- 
> Liu, Renjie
> Software Engineer, MVAD
> 
> 
> 
> -- 
> Julio Biason, Sofware Engineer
> AZION  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51  
> <callto:+5551996209291>99907 0554
> 
> 
> 
> -- 
> Julio Biason, Sofware Engineer
> AZION  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51  
> <callto:+5551996209291>99907 0554

Re: Trying to figure out why a slot takes a long time to checkpoint

Reply via email to