Hey guys, So, switching to Ceph/S3 didn't shine any new lights on the issue. Although the times are a bit higher, just a few slots are taking a magnitude longer to save. So I changed the logs for DEBUG.
The problem is: I'm not seeing anything that seems relevant; only pings from ZooKeeper, heartbeats and the S3 disconnecting from being idle. Is there anything else that I should change to DEBUG? Akka? Kafka? Haoop? ZooKeeper? (Those are, by the default config, bumped to INFO) All of those? On Tue, Sep 18, 2018 at 12:34 PM, Julio Biason <julio.bia...@azion.com> wrote: > Hey TIll (and others), > > We don't have debug logs yet, but we decided to remove a related > component: HDFS. > > We are moving the storage to our Ceph install (using S3), which is running > for longer than our HDFS install and we know, for sure, it runs without any > problems (specially 'cause we have more people that understand Ceph than > people that know HDFS at this point). > > If, for some reason, the problem persists, we know it's not the underlying > storage and may be something with our pipeline itself. I'll enable debug > logs, then. > > On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <trohrm...@apache.org> > wrote: > >> This behavior seems very odd Julio. Could you indeed share the debug logs >> of all Flink processes in order to see why things are taking so long? >> >> The checkpoint size of task #8 is twice as big as the second biggest >> checkpoint. But this should not cause an increase in checkpoint time of a >> factor of 8. >> >> Cheers, >> Till >> >> On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <liurenjie2...@gmail.com> >> wrote: >> >>> Hi, Julio: >>> This happens frequently? What state backend do you use? The async >>> checkpoint duration and sync checkpoint duration seems normal compared to >>> others, it seems that most of the time are spent acking the checkpoint. >>> >>> On Sun, Sep 16, 2018 at 9:24 AM vino yang <yanghua1...@gmail.com> wrote: >>> >>>> Hi Julio, >>>> >>>> Yes, it seems that fifty-five minutes is really long. >>>> However, it is linear with the time and size of the previous task >>>> adjacent to it in the diagram. >>>> I think your real application is concerned about why Flink accesses >>>> HDFS so slowly. >>>> You can call the DEBUG log to see if you can find any clues, or post >>>> the log to the mailing list to help others analyze the problem for you. >>>> >>>> Thanks, vino. >>>> >>>> Julio Biason <julio.bia...@azion.com> 于2018年9月15日周六 上午7:03写道: >>>> >>>>> (Just an addendum: Although it's not a huge problem -- we can always >>>>> increase the checkpoint timeout time -- this anomalous situation makes me >>>>> think there is something wrong in our pipeline or in our cluster, and that >>>>> is what is making the checkpoint creation go crazy.) >>>>> >>>>> On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <julio.bia...@azion.com> >>>>> wrote: >>>>> >>>>>> Hey guys, >>>>>> >>>>>> On our pipeline, we have a single slot that it's taking longer to >>>>>> create the checkpoint compared to other slots and we are wondering what >>>>>> could be causing it. >>>>>> >>>>>> The operator in question is the window metric -- the only element in >>>>>> the pipeline that actually uses the state. While the other slots take 7 >>>>>> mins to create the checkpoint, this one -- and only this one -- takes >>>>>> 55mins. >>>>>> >>>>>> Is there something I should look at to understand what's going on? >>>>>> >>>>>> (We are storing all checkpoints in HDFS, in case that helps.) >>>>>> >>>>>> -- >>>>>> *Julio Biason*, Sofware Engineer >>>>>> *AZION* | Deliver. Accelerate. Protect. >>>>>> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 >>>>>> <callto:+5551996209291>*99907 0554* >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *Julio Biason*, Sofware Engineer >>>>> *AZION* | Deliver. Accelerate. Protect. >>>>> Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 >>>>> <callto:+5551996209291>*99907 0554* >>>>> >>>> -- >>> Liu, Renjie >>> Software Engineer, MVAD >>> >> > > > -- > *Julio Biason*, Sofware Engineer > *AZION* | Deliver. Accelerate. Protect. > Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 > <callto:+5551996209291>*99907 0554* > -- *Julio Biason*, Sofware Engineer *AZION* | Deliver. Accelerate. Protect. Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 <callto:+5551996209291>*99907 0554*