Hi Jordan,

Please never call Runtime.getRuntime().exit(1). This will rob Flink of any
way of graceful shutdown and recovery. Since you mentioned EC2 (and not a
managed service such as ECS or K8S), I'm also assuming that the task
manager process is never restarted and that's why you always end up with
too few resources.

If you have any other way of restarting the task manager, please let me
know and I would look more closely into the logs.



On Sun, Sep 5, 2021 at 8:02 PM Jordan Hurwich <jhurw...@pulsasensors.com>
wrote:

> Hi Flink Users,
> We're having an issue where Flink jobs get stuck in the "RESTARTING" state
> reliably the first night after our deployments with logs from the
> Jobmanager reporting "Could not fulfill resource requirements of job
> <job-id>. Free slots: 0". This is surprising because we expect the
> restarting job to reuse its existing slot, and because even after
> increasing from our minimum required 2 slots to 6 this issue appears to
> still occur though $ flink list indicates only 2 jobs.
>
> Manually stopping all jobs and the jobmanager, and restarting each,
> reliably addresses the issue and it does not appear to occur again. We
> suspect some currently unknown root cause, but are particularly interested
> in ensuring the Flink jobs can automatically recover from this state
> without manual intervention.
>
> Details below with logs and flink-conf.yaml attached. Your support and
> feedback is greatly appreciated,
> Jordan, Pulsa Inc
>
>    - $ flink --version: Version: 1.13.2, Commit ID: 5f007ff
>
>    - What we observe
>    - In this example we observe the issue beginning at about 2021-09-05
>       03:20:34, logs for "standalonesession" and "taskexecutor" for the 
> relevant
>       time period attached, more available on request.
>       - The earliest unique indication of the issue (at 2021-09-05
>       03:20:34) appears to be a failure for one of our two jobs ("Measure") to
>       open a Postgres connection, the other "Signal" job does not rely on
>       Postgres. Logs attached in taskexecutor.out
>          - This occurs in the open() of our custom RichSinkFunction
>          "PGSink", where the exception is caught and we call
>          Runtime.getRuntime().exit(1)
>       - Moments before the issue starts (at 2021-09-05 03:19:19) we
>       observe the TaskManager canceling the previous instance of this job, as
>       indicated in taskexecutor.log
>       - This corresponds with logs in standalonesession.log of "Trying to
>          recover from a global failure...FlinkRuntimeException: Exceeded 
> checkpoint
>          tolerable failure threshold"
>          - standalonesession.log indicates this exception happens
>          multiple times before this event but in the earlier cases automatic 
> recover
>          seems to work correctly.
>       - Moments after the issue starts (at 2021-09-05 03:20:37) in
>       standalonesession.log we see another occurrence of "Exceeded
>       checkpoint tolerable failure threshold" with considerably different
>       following logs than the previous.
>          - At about this time and following we observe logs in
>          standalongsession.log reporting "akka...NettyTransport...Remote
>          connection to [null] failed with java.net.ConnectException: 
> Connection
>          refused: /172.30.172.30:40763" repeatedly
>       - In standalonesession.log we then see one occurrence of 
> "...TimeoutException:
>       Heartbeat of TaskManager with id 172.30.172.30:40763-75a3ae timed
>       out" (at 2021-09-05 03:21:19) - this exception never appears to be
>       repeated.
>       - Immediately following that exception the repeating failure loop
>       seems to begin:
>          - The first log of the loop, in standalonesession.log, appears
>          to be "Discarding the results produced by task execution 905e1..."
>          905e1... corresponds with the "TimeoutException: Heartbeat..."
>          exception
>          - That last log before repeating is "...CompletionException:
>          ...NoResourceAvailableException: Could not acquire the minimum 
> required
>          resources", and this line will be followed "Discarding the
>          results..." with the hash id matching this exception's.
>          - This loop then repeats indefinitely.
>       - During the issue, $ flink list on each machine reports the
>       following:
>       ------------------ Running/Restarting Jobs -------------------
>       04.09.2021 16:03:39 : 3143a6c3ab0a5f58130384c13b2eca45 :
>       FlinkSignalProcessor (RESTARTING)
>       04.09.2021 16:45:08 : 4d531101945f0726432f73f37a38ee48 :
>       FlinkMeasureProcessor (RESTARTING)
>
>
>    - When we observe the issue
>       - On each deployment we use Terraform to deploy a cluster of
>       "funnel" machines on AWS EC2, each running two Flink jobs, 
> "...Measure..."
>       and "...Signal...", that pull from some upstream RabbitMQ queue and 
> output
>       into our Postgres db and Redis cache respectively.
>       - After deployment we always observe these jobs are functioning
>       properly in the RUNNING state on each machine, but some time the night
>       after deployment we observe that the jobs become nonfunctional, 
> reporting
>       RESTARTING indefinitely.
>       - This appears to occur to both jobs on every machine in the
>       cluster simultaneously, implying some currently unknown root cause.
>       - Stopping the jobs and jobmanager on each machine, and restarting
>       them, reliably fixes the issue and it does not occur again after that 
> first
>       night following deployment.
>
>

Reply via email to