Hi Kostas, Attaching the taskmanager logs regarding this issue. I have attached the Kaka related metrics. I hope you can see it this time.
Not sure why we get these many disconnects to Kafka. Maybe because of this interruptions, we seem to slow down on our processing. At some point the memory also increases and the workers almost stagnate not doing any processing. I have 3GB heap committed and allotted 5GB memory to the pods. Thanks for your help. ~Ramya. On Tue, Sep 22, 2020 at 9:18 PM Kostas Kloudas <kklou...@gmail.com> wrote: > Hi Ramya, > > Unfortunately your images are blocked. Could you upload them somewhere and > post the links here? > Also I think that the TaskManager logs may be able to help a bit more. > Could you please provide them here? > > Cheers, > Kostas > > On Tue, Sep 22, 2020 at 8:58 AM Ramya Ramamurthy <hair...@gmail.com> > wrote: > > > Hi, > > > > We are seeing an issue with Flink on our production. The version is 1.7 > > which we use. > > We started seeing sudden lag on kafka, and the consumers were no longer > > working/accepting messages. On trying to enable debug mode, the below > > errors were seen > > [image: image.jpeg] > > > > I am not sure why this occurs everyday and when this happens, I can see > > the remaining workers arent able to handle the load. Unless i restart my > > jobs, i am unable to start processing again. This way, there is data loss > > as well. > > > > On the below graph, there is a slight dip in consumption before 5:30. > That > > is when this incident happens and correlated with logs. > > > > [image: image.jpeg] > > > > Any pointers/suggestions would be appreciated. > > > > Thanks. > > > > >