Hi Ori, Thanks for reaching out! I do fear that there's not much that we can help out with. As you mentioned, it looks like there's a network issue which would be on the Google side of issues. I'm assuming that the mentioned Flink version corresponds with Flink 1.12 [1], which isn't supported in the Flink community anymore. Are you restarting the job from a savepoint or starting fresh without state at all?
Best regards, Martijn [1] https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0 On Sun, Oct 2, 2022 at 3:38 AM Ori Popowski <ori....@gmail.com> wrote: > Hi, > > We're using Flink 2.10.2 on Google Dataproc. > > Lately we experience a very unusual problem: the job fails and when it's > trying to recover we get this error: > > Slot request bulk is not fulfillable! Could not allocate the required slot > within slot request timeout > > I investigated what happened and I saw that the failure is caused by a > heartbeat timeout to one of the containers. I looked at the container's > logs and I saw something unusual: > > 1. Eight minutes before the heartbeat timeout the logs show connection > problems to the Confluent Kafka topic and also to Datadog, which means > there's a network issue with the whole node or just the specific container. > 2. The container logs disappear at this point, but the node logs show > multiple Garbage Collection pauses, ranging from 10 seconds to 215 (!) > seconds. > > It looks like right after the network issue the node itself gets into an > endless GC phase, and my theory is that the slots are not fulfillable > because the node itself is not available because it gets into an endless GC. > > I want to note that we've been running this job for months without any > issues. The issues started one month ago arbitrarily, not following a Flink > version upgrade, job code upgrade, change in amount or type of data being > processed, and neither a Dataproc image version change. > > Attached are job manager jogs, container logs, and node logs. > > How can we recover from this issue? > > Thanks! > >