We are using 1.14 version currently. The final manifestation of the issue
shows up as the trace I pasted above, and then the job keeps on restarting.
When we track back, we see various exceptions depending on the job, for
example for one of the jobs, some tasks were failing due to out-of-memory
exceptions. We resolve the issue by deleting all the task manager pods from
the Kubernetes cluster. As soon as we delete all task managers, new pods
are created and the job starts up normally. I feel the reason behind this
is that the scheduler tries to start up the new job very aggressively, and
so it is not able to find enough resources.

On Sun, Jul 31, 2022 at 6:59 PM Lijie Wang <wangdachui9...@gmail.com> wrote:

> Hi,
>
> Which version are you using? Has any job failover occurred? It would be
> better if you can provide the full log of JM.
>
> Best,
> Lijie
>
> Hemanga Borah <borah.hema...@gmail.com> 于2022年8月1日周一 01:47写道:
>
>> Hello guys,
>>  We have been seeing an issue with our Flink applications. Our
>> applications run fine for several hours, and then we see an error/exception
>> like so:
>>
>> java.util.concurrent.CompletionException: 
>> java.util.concurrent.CompletionException:
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Could not acquire the minimum required resources.
>>
>> For some applications, this error/exception appears once, which stays in
>> history for a while and but the job recovers. However, for some
>> applications, we see this error thrown repeatedly, and the application gets
>> into a crash loop.
>>
>> Since our application had been running fine for several hours before we
>> see such a message, our suspicion is that when the crash happens, the job
>> manager aggressively tries to start back the job, and is not able to
>> acquire enough resources because the previous job has not cleaned up as yet.
>>
>> Has anyone else been seeing this issue? If so, what did you guys try to
>> fix it?
>>
>> Thanks,
>> HKB
>>
>>

Reply via email to