Flink Exception - assigned slot container was removed

Flink Developer Sun, 25 Nov 2018 15:36:52 -0800

Hi, I have a Flink application sourcing from a topic in Kafka (400 partitions) 
and sinking to S3 using bucketingsink and using RocksDb for checkpointing every 
2 mins. The Flink app runs with parallelism 400 so that each worker handles a 
partition. This is using Flink 1.5.2. The Flink cluster uses 10 task managers 
with 40 slots each.


After running for a few days straight, it encounters a Flink exception:
Org.apache.flink.util.FlinkException: The assigned slot 
container_1234567_0003_01_000009_1 was removed.

This causes the Flink job to fail. It is odd to me. I am unsure what causes 
this. Also, during this time, I see some checkpoints stating "checkpoint was 
declined (tasks not ready)". At this point, the job is unable to recover and 
fails. Does this happen if a slot or worker is not doing processing for X 
amount of time? Would I need to increase the Flink config properties for the 
following when creating the Flink cluster in yarn?

Slot.idle.timeout
Slot.request.timeout
Web.timeout
Heartbeat.interval
Heartbeat.timeout

Any help would be greatly appreciated.

Flink Exception - assigned slot container was removed

Reply via email to