Hi, if some tasks take like 50 minutes, could you wait until such a checkpoint is in progress and (let’s say after 10 minutes) log into the node and create a (or multiple over time) thread-dump(s) for the JVM that runs the slow checkpointing task. This could help to figure out where it is stuck or waiting.
Best, Stefan > Am 19.09.2018 um 22:30 schrieb Julio Biason <julio.bia...@azion.com>: > > Hey guys, > > So, switching to Ceph/S3 didn't shine any new lights on the issue. Although > the times are a bit higher, just a few slots are taking a magnitude longer to > save. So I changed the logs for DEBUG. > > The problem is: I'm not seeing anything that seems relevant; only pings from > ZooKeeper, heartbeats and the S3 disconnecting from being idle. > > Is there anything else that I should change to DEBUG? Akka? Kafka? Haoop? > ZooKeeper? (Those are, by the default config, bumped to INFO) > > All of those? > > On Tue, Sep 18, 2018 at 12:34 PM, Julio Biason <julio.bia...@azion.com > <mailto:julio.bia...@azion.com>> wrote: > Hey TIll (and others), > > We don't have debug logs yet, but we decided to remove a related component: > HDFS. > > We are moving the storage to our Ceph install (using S3), which is running > for longer than our HDFS install and we know, for sure, it runs without any > problems (specially 'cause we have more people that understand Ceph than > people that know HDFS at this point). > > If, for some reason, the problem persists, we know it's not the underlying > storage and may be something with our pipeline itself. I'll enable debug > logs, then. > > On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <trohrm...@apache.org > <mailto:trohrm...@apache.org>> wrote: > This behavior seems very odd Julio. Could you indeed share the debug logs of > all Flink processes in order to see why things are taking so long? > > The checkpoint size of task #8 is twice as big as the second biggest > checkpoint. But this should not cause an increase in checkpoint time of a > factor of 8. > > Cheers, > Till > > On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <liurenjie2...@gmail.com > <mailto:liurenjie2...@gmail.com>> wrote: > Hi, Julio: > This happens frequently? What state backend do you use? The async checkpoint > duration and sync checkpoint duration seems normal compared to others, it > seems that most of the time are spent acking the checkpoint. > > On Sun, Sep 16, 2018 at 9:24 AM vino yang <yanghua1...@gmail.com > <mailto:yanghua1...@gmail.com>> wrote: > Hi Julio, > > Yes, it seems that fifty-five minutes is really long. > However, it is linear with the time and size of the previous task adjacent to > it in the diagram. > I think your real application is concerned about why Flink accesses HDFS so > slowly. > You can call the DEBUG log to see if you can find any clues, or post the log > to the mailing list to help others analyze the problem for you. > > Thanks, vino. > > Julio Biason <julio.bia...@azion.com <mailto:julio.bia...@azion.com>> > 于2018年9月15日周六 上午7:03写道: > (Just an addendum: Although it's not a huge problem -- we can always increase > the checkpoint timeout time -- this anomalous situation makes me think there > is something wrong in our pipeline or in our cluster, and that is what is > making the checkpoint creation go crazy.) > > On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <julio.bia...@azion.com > <mailto:julio.bia...@azion.com>> wrote: > Hey guys, > > On our pipeline, we have a single slot that it's taking longer to create the > checkpoint compared to other slots and we are wondering what could be causing > it. > > The operator in question is the window metric -- the only element in the > pipeline that actually uses the state. While the other slots take 7 mins to > create the checkpoint, this one -- and only this one -- takes 55mins. > > Is there something I should look at to understand what's going on? > > (We are storing all checkpoints in HDFS, in case that helps.) > > -- > Julio Biason, Sofware Engineer > AZION | Deliver. Accelerate. Protect. > Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 > <callto:+5551996209291>99907 0554 > > > > -- > Julio Biason, Sofware Engineer > AZION | Deliver. Accelerate. Protect. > Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 > <callto:+5551996209291>99907 0554 > -- > Liu, Renjie > Software Engineer, MVAD > > > > -- > Julio Biason, Sofware Engineer > AZION | Deliver. Accelerate. Protect. > Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 > <callto:+5551996209291>99907 0554 > > > > -- > Julio Biason, Sofware Engineer > AZION | Deliver. Accelerate. Protect. > Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 > <callto:+5551996209291>99907 0554