Thanks @Matthias Pohl <matthias.p...@aiven.io> . This is informative.  So
generally in a session cluster if I have more than one job and only one of
them has this issue, still we will face the same problem?

Regards
Ram

On Mon, Sep 26, 2022 at 4:32 PM Matthias Pohl <matthias.p...@aiven.io>
wrote:

> I see. Thanks for sharing the logs. It's related to a FLINK-9097 [1]. In
> order for the job to not be cleaned up entirely after a failure while
> submitting the job, the JobManager is failed fatally resulting in a
> failover. That's what you're experiencing.
>
> One solution is to fix the permission issue to make the job recover
> without problems. If that's not what you want to do, you could delete the
> entry with the key 'jobGraph-04ae99777ee2ed34c13fe8120e68436e' from the
> JobGraphStore ConfigMap (based on your logs it should
> be flink-972ac3d8028e45fcafa9b8b7b7f1dafb-custer-config-map). This will
> prevent the JobManager from recovering this specific job. Keep in mind that
> you have to clean up any job-related data by yourself in that case.
>
> I hope that helps.
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-9097
>
> On Mon, Sep 26, 2022 at 12:26 PM ramkrishna vasudevan <
> ramvasu.fl...@gmail.com> wrote:
>
>> I got some logs and stack traces from our backend storage. This is not
>> the entire log though. Can this be useful?  With these set of logs messages
>> the job manager kept restarting.
>>
>> Regards
>> Ram
>>
>> On Mon, Sep 26, 2022 at 3:11 PM ramkrishna vasudevan <
>> ramvasu.fl...@gmail.com> wrote:
>>
>>> Thank you very much for the reply. I have lost the k8s cluster in this
>>> case before I could capture the logs. I will try to repro this and get back
>>> to you.
>>>
>>> Regards
>>> Ram
>>>
>>> On Mon, Sep 26, 2022 at 12:42 PM Matthias Pohl <matthias.p...@aiven.io>
>>> wrote:
>>>
>>>> Hi Ramkrishna,
>>>> thanks for reaching out to the Flink community. Could you share the
>>>> JobManager logs to get a better understanding of what's going on? I'm
>>>> wondering why the JobManager is failing when the actual problem is that the
>>>> job is struggling to access a folder. It sounds like there are multiple
>>>> problems here.
>>>>
>>>> Best,
>>>> Matthias
>>>>
>>>> On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan <
>>>> ramvasu.fl...@gmail.com> wrote:
>>>>
>>>>> Hi all
>>>>>
>>>>> I have a simple job where we read for a given path in cloud storage to
>>>>> watch for new files in a given fodler. While I setup my job there was some
>>>>> permission issue on the folder. The job is STREAMING job.
>>>>> The cluster is set in the session mode and is running on Kubernetes.
>>>>> The job manager since then is failing to come back up and every time
>>>>> it fails with the permission issue. But the point is how should i recover
>>>>> my cluster in this case. Since JM is not there the UI is also not working
>>>>> and how do I remove the bad job from the JM.
>>>>>
>>>>> Regards
>>>>> Ram
>>>>>
>>>>

Reply via email to