Thanks for the suggestion Qi. I tried increasing slot.idle.timeout to 3600000 but it seems to still have encountered the issue. Does this mean if a slot or "flink worker" has not processed items for 1 hour, that it will be removed?
Would any other flink configuration properties help for this? slot.request.timeout web.timeout heartbeat.interval heartbeat.timeout ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Sunday, November 25, 2018 6:56 PM, 罗齐 <[email protected]> wrote: > Hi, > > It looks that some of your slots were freed during the job execution > (possibly due to idle for too long). AFAIK the exception was thrown when a > pending Slot request was removed. You can try increase the > “Slot.idle.timeout” to mitigate this issue (default is 50000, try 3600000 or > higher). > > Regards, > Qi > >> On Nov 26, 2018, at 7:36 AM, Flink Developer <[email protected]> >> wrote: >> >> Hi, I have a Flink application sourcing from a topic in Kafka (400 >> partitions) and sinking to S3 using bucketingsink and using RocksDb for >> checkpointing every 2 mins. The Flink app runs with parallelism 400 so that >> each worker handles a partition. This is using Flink 1.5.2. The Flink >> cluster uses 10 task managers with 40 slots each. >> >> After running for a few days straight, it encounters a Flink exception: >> Org.apache.flink.util.FlinkException: The assigned slot >> container_1234567_0003_01_000009_1 was removed. >> >> This causes the Flink job to fail. It is odd to me. I am unsure what causes >> this. Also, during this time, I see some checkpoints stating "checkpoint was >> declined (tasks not ready)". At this point, the job is unable to recover and >> fails. Does this happen if a slot or worker is not doing processing for X >> amount of time? Would I need to increase the Flink config properties for the >> following when creating the Flink cluster in yarn? >> >> Slot.idle.timeout >> Slot.request.timeout >> Web.timeout >> Heartbeat.interval >> Heartbeat.timeout >> >> Any help would be greatly appreciated.
