Re: Flink jobs stuck in "RESTARTING" State, Jobmanager thrashing on "Could not fulfill resource requirements"

Jordan Hurwich Mon, 06 Sep 2021 14:52:11 -0700

Thanks for the help Arvid. Going to make these changes now and expect this
will address the issue we've had.


On Mon, Sep 6, 2021 at 12:03 PM Arvid Heise <ar...@apache.org> wrote:

> Sorry for not immediately writing that: yes you should just throw an
> exception. Flink will then try to close everything and restarts the job if
> the job is configured for restarts. [1]
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/
>
> On Mon, Sep 6, 2021 at 6:59 PM Jordan Hurwich <jhurw...@pulsasensors.com>
> wrote:
>
>> Thanks very much for the quick response, Arvid.
>>
>> I'm removing our calls to Runtime.getRuntime().exit(1) now. If something
>> goes wrong when setting up the task, what's the preferred way to graceful
>> shutdown so the Taskmanager can restart the job successfully? Should I just
>> allow an Exception to propagate up, or is there another method I should
>> call instead of exit()?
>>
>> On Mon, Sep 6, 2021 at 9:41 AM Arvid Heise <ar...@apache.org> wrote:
>>
>>> Hi Jordan,
>>>
>>> Please never call Runtime.getRuntime().exit(1). This will rob Flink of
>>> any way of graceful shutdown and recovery. Since you mentioned EC2 (and not
>>> a managed service such as ECS or K8S), I'm also assuming that the task
>>> manager process is never restarted and that's why you always end up with
>>> too few resources.
>>>
>>> If you have any other way of restarting the task manager, please let me
>>> know and I would look more closely into the logs.
>>>
>>>
>>>
>>> On Sun, Sep 5, 2021 at 8:02 PM Jordan Hurwich <jhurw...@pulsasensors.com>
>>> wrote:
>>>
>>>> Hi Flink Users,
>>>> We're having an issue where Flink jobs get stuck in the "RESTARTING"
>>>> state reliably the first night after our deployments with logs from the
>>>> Jobmanager reporting "Could not fulfill resource requirements of job
>>>> <job-id>. Free slots: 0". This is surprising because we expect the
>>>> restarting job to reuse its existing slot, and because even after
>>>> increasing from our minimum required 2 slots to 6 this issue appears to
>>>> still occur though $ flink list indicates only 2 jobs.
>>>>
>>>> Manually stopping all jobs and the jobmanager, and restarting each,
>>>> reliably addresses the issue and it does not appear to occur again. We
>>>> suspect some currently unknown root cause, but are particularly interested
>>>> in ensuring the Flink jobs can automatically recover from this state
>>>> without manual intervention.
>>>>
>>>> Details below with logs and flink-conf.yaml attached. Your support and
>>>> feedback is greatly appreciated,
>>>> Jordan, Pulsa Inc
>>>>
>>>>    - $ flink --version: Version: 1.13.2, Commit ID: 5f007ff
>>>>
>>>>    - What we observe
>>>>    - In this example we observe the issue beginning at about
>>>>       2021-09-05 03:20:34, logs for "standalonesession" and "taskexecutor" 
>>>> for
>>>>       the relevant time period attached, more available on request.
>>>>       - The earliest unique indication of the issue (at 2021-09-05
>>>>       03:20:34) appears to be a failure for one of our two jobs 
>>>> ("Measure") to
>>>>       open a Postgres connection, the other "Signal" job does not rely on
>>>>       Postgres. Logs attached in taskexecutor.out
>>>>          - This occurs in the open() of our custom RichSinkFunction
>>>>          "PGSink", where the exception is caught and we call
>>>>          Runtime.getRuntime().exit(1)
>>>>       - Moments before the issue starts (at 2021-09-05 03:19:19) we
>>>>       observe the TaskManager canceling the previous instance of this job, 
>>>> as
>>>>       indicated in taskexecutor.log
>>>>       - This corresponds with logs in standalonesession.log of "Trying
>>>>          to recover from a global failure...FlinkRuntimeException: Exceeded
>>>>          checkpoint tolerable failure threshold"
>>>>          - standalonesession.log indicates this exception happens
>>>>          multiple times before this event but in the earlier cases 
>>>> automatic recover
>>>>          seems to work correctly.
>>>>       - Moments after the issue starts (at 2021-09-05 03:20:37) in
>>>>       standalonesession.log we see another occurrence of "Exceeded
>>>>       checkpoint tolerable failure threshold" with considerably
>>>>       different following logs than the previous.
>>>>          - At about this time and following we observe logs in
>>>>          standalongsession.log reporting "akka...NettyTransport...Remote
>>>>          connection to [null] failed with java.net.ConnectException: 
>>>> Connection
>>>>          refused: /172.30.172.30:40763" repeatedly
>>>>       - In standalonesession.log we then see one occurrence of 
>>>> "...TimeoutException:
>>>>       Heartbeat of TaskManager with id 172.30.172.30:40763-75a3ae
>>>>       timed out" (at 2021-09-05 03:21:19) - this exception never
>>>>       appears to be repeated.
>>>>       - Immediately following that exception the repeating failure
>>>>       loop seems to begin:
>>>>          - The first log of the loop, in standalonesession.log,
>>>>          appears to be "Discarding the results produced by task
>>>>          execution 905e1..." 905e1... corresponds with the 
>>>> "TimeoutException:
>>>>          Heartbeat..." exception
>>>>          - That last log before repeating is "...CompletionException:
>>>>          ...NoResourceAvailableException: Could not acquire the minimum 
>>>> required
>>>>          resources", and this line will be followed "Discarding the
>>>>          results..." with the hash id matching this exception's.
>>>>          - This loop then repeats indefinitely.
>>>>       - During the issue, $ flink list on each machine reports the
>>>>       following:
>>>>       ------------------ Running/Restarting Jobs -------------------
>>>>       04.09.2021 16:03:39 : 3143a6c3ab0a5f58130384c13b2eca45 :
>>>>       FlinkSignalProcessor (RESTARTING)
>>>>       04.09.2021 16:45:08 : 4d531101945f0726432f73f37a38ee48 :
>>>>       FlinkMeasureProcessor (RESTARTING)
>>>>
>>>>
>>>>    - When we observe the issue
>>>>       - On each deployment we use Terraform to deploy a cluster of
>>>>       "funnel" machines on AWS EC2, each running two Flink jobs, 
>>>> "...Measure..."
>>>>       and "...Signal...", that pull from some upstream RabbitMQ queue and 
>>>> output
>>>>       into our Postgres db and Redis cache respectively.
>>>>       - After deployment we always observe these jobs are functioning
>>>>       properly in the RUNNING state on each machine, but some time the 
>>>> night
>>>>       after deployment we observe that the jobs become nonfunctional, 
>>>> reporting
>>>>       RESTARTING indefinitely.
>>>>       - This appears to occur to both jobs on every machine in the
>>>>       cluster simultaneously, implying some currently unknown root cause.
>>>>       - Stopping the jobs and jobmanager on each machine, and
>>>>       restarting them, reliably fixes the issue and it does not occur 
>>>> again after
>>>>       that first night following deployment.
>>>>
>>>>

Re: Flink jobs stuck in "RESTARTING" State, Jobmanager thrashing on "Could not fulfill resource requirements"

Reply via email to