Hi,

It looks that some of your slots were freed during the job execution (possibly 
due to idle for too long). AFAIK the exception was thrown when a pending Slot 
request was removed. You can try increase the “Slot.idle.timeout” to mitigate 
this issue (default is 50000, try 3600000 or higher).

Regards,
Qi

> On Nov 26, 2018, at 7:36 AM, Flink Developer <developer...@protonmail.com> 
> wrote:
> 
> Hi, I have a Flink application sourcing from a topic in Kafka (400 
> partitions) and sinking to S3 using bucketingsink and using RocksDb for 
> checkpointing every 2 mins. The Flink app runs with parallelism 400 so that 
> each worker handles a partition. This is using Flink 1.5.2. The Flink cluster 
> uses 10 task managers with 40 slots each.
> 
> After running for a few days straight, it encounters a Flink exception:
> Org.apache.flink.util.FlinkException: The assigned slot 
> container_1234567_0003_01_000009_1 was removed.
> 
> This causes the Flink job to fail. It is odd to me. I am unsure what causes 
> this. Also, during this time, I see some checkpoints stating "checkpoint was 
> declined (tasks not ready)". At this point, the job is unable to recover and 
> fails. Does this happen if a slot or worker is not doing processing for X 
> amount of time? Would I need to increase the Flink config properties for the 
> following when creating the Flink cluster in yarn?
> 
> Slot.idle.timeout
> Slot.request.timeout
> Web.timeout
> Heartbeat.interval
> Heartbeat.timeout
> 
> Any help would be greatly appreciated.
> 

Reply via email to