I have a Flink 1.15 app running in Kubernetes (v1.22) deployed via operator 1.2, using S3-based HA with 2 jobmanagers and 2 taskmanagers.
The app consumes a high-traffic Kafka topic and writes to a Cassandra database. It had been running fine for 4 days, but at some point the taskmanagers crashed. Looking through my logs, the oldest messages I see are "Exec Failure java.io.EOFException null" from both taskmanagers at exactly the same time, but there is no associated stack trace. After that, the taskmanagers try to restart, but I see another "Exec Failure java.io.EOFException null" message from one jobmanager and shortly after the task manager sets the newly started taskmanagers to a failed state with the message below. This repeats a couple more times until finally no more taskmanagers try to come up, and the jobmanager sit there throwing RecipientUnreachableExceptions because there are no more taskmanagers around. Any idea what that "Exec Failure java.io.EOFException null" message is about, or what can I do to debug it if it happens again? Thanks, Javier Vegas message from jobmanager Source: event-activity (2/4)#0 (090ef433d97011f8f595885a9bb39a28) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for b7c72bb5b7570a7c981f761afa1b7ea6. at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1679) at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$closeJob$18(TaskExecutor.java:1660) at java.base/java.util.Optional.ifPresent(Unknown Source) at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJob(TaskExecutor.java:1658) at org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:462) at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.lambda$terminate$0(AkkaRpcActor.java:568) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:567) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:191) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:537) at akka.actor.Actor.aroundReceive$(Actor.scala:535) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) at akka.actor.ActorCell.invoke(ActorCell.scala:548) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) at akka.dispatch.Mailbox.run(Mailbox.scala:231) at akka.dispatch.Mailbox.exec(Mailbox.scala:243) at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) Caused by: org.apache.flink.util.FlinkException: The TaskExecutor is shutting down. at org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:456) ... 25 more