Flink on Kubernetes unable to Recover from failure

Morgan Geldenhuys Tue, 05 May 2020 02:42:33 -0700


Community,

I am currently doing some fault tolerance testing for Flink (1.10)running on Kubernetes (1.18) and am encountering an error where after arunning job experiences a failure, the job fails completely.

A Flink session cluster has been created according to the documentationcontained here:https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html.The job is then uploaded and deployed via the web interface andeverything runs smoothly. The job has a parallelism of 24 with 3 workernodes as fail overs in reserve. Each worker is assigned 1 task slot each(total of 27).

The next step would be inject an error for which I use the Pumba ChaosTesting tool (https://github.com/alexei-led/pumba) to pause a randomworker process. This selection and pausing is done manually for the moment.

Looking at the error logs, Flink does detect the error after the timeout(The heartbeat timeout has been set to 20 seconds):

java.util.concurrent.TimeoutException: The heartbeat of TaskManager withid 768848f91ebdbccc8d518e910160414d timed out.

After the failure has been detected, the system resets to the latestsaved checkpoint and restarts. The system catches up nicely and resumesnormal processing... however, after about 3 minutes, the following erroroccurs:

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:Connection unexpectedly closed by remote task manager'/10.45.128.1:6121'. This might indicate that the remote task managerwas lost.

The job fails, and is unable to restart because the number of task slotshas been reduced to zero. Looking at the kubernetes cluster, allcontainers are running...

Has anyone else run into this error? What am I missing? The same thinghappens when the containers are deleted.


Regards,
M.

Flink on Kubernetes unable to Recover from failure

Reply via email to