Thanks @Matthias Pohl <matthias.p...@aiven.io> . This is informative. So generally in a session cluster if I have more than one job and only one of them has this issue, still we will face the same problem?
Regards Ram On Mon, Sep 26, 2022 at 4:32 PM Matthias Pohl <matthias.p...@aiven.io> wrote: > I see. Thanks for sharing the logs. It's related to a FLINK-9097 [1]. In > order for the job to not be cleaned up entirely after a failure while > submitting the job, the JobManager is failed fatally resulting in a > failover. That's what you're experiencing. > > One solution is to fix the permission issue to make the job recover > without problems. If that's not what you want to do, you could delete the > entry with the key 'jobGraph-04ae99777ee2ed34c13fe8120e68436e' from the > JobGraphStore ConfigMap (based on your logs it should > be flink-972ac3d8028e45fcafa9b8b7b7f1dafb-custer-config-map). This will > prevent the JobManager from recovering this specific job. Keep in mind that > you have to clean up any job-related data by yourself in that case. > > I hope that helps. > Matthias > > [1] https://issues.apache.org/jira/browse/FLINK-9097 > > On Mon, Sep 26, 2022 at 12:26 PM ramkrishna vasudevan < > ramvasu.fl...@gmail.com> wrote: > >> I got some logs and stack traces from our backend storage. This is not >> the entire log though. Can this be useful? With these set of logs messages >> the job manager kept restarting. >> >> Regards >> Ram >> >> On Mon, Sep 26, 2022 at 3:11 PM ramkrishna vasudevan < >> ramvasu.fl...@gmail.com> wrote: >> >>> Thank you very much for the reply. I have lost the k8s cluster in this >>> case before I could capture the logs. I will try to repro this and get back >>> to you. >>> >>> Regards >>> Ram >>> >>> On Mon, Sep 26, 2022 at 12:42 PM Matthias Pohl <matthias.p...@aiven.io> >>> wrote: >>> >>>> Hi Ramkrishna, >>>> thanks for reaching out to the Flink community. Could you share the >>>> JobManager logs to get a better understanding of what's going on? I'm >>>> wondering why the JobManager is failing when the actual problem is that the >>>> job is struggling to access a folder. It sounds like there are multiple >>>> problems here. >>>> >>>> Best, >>>> Matthias >>>> >>>> On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan < >>>> ramvasu.fl...@gmail.com> wrote: >>>> >>>>> Hi all >>>>> >>>>> I have a simple job where we read for a given path in cloud storage to >>>>> watch for new files in a given fodler. While I setup my job there was some >>>>> permission issue on the folder. The job is STREAMING job. >>>>> The cluster is set in the session mode and is running on Kubernetes. >>>>> The job manager since then is failing to come back up and every time >>>>> it fails with the permission issue. But the point is how should i recover >>>>> my cluster in this case. Since JM is not there the UI is also not working >>>>> and how do I remove the bad job from the JM. >>>>> >>>>> Regards >>>>> Ram >>>>> >>>>