This is weird. Could you paste your entire exception trace here?

> On Nov 26, 2018, at 4:37 PM, Flink Developer <developer...@protonmail.com> 
> wrote:
> 
> In addition, after the Flink job has failed from the above exception, the 
> Flink job is unable to recover from previous checkpoint. Is this the expected 
> behavior? How can the job be recovered successfully from this?
> 
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, November 26, 2018 12:30 AM, Flink Developer 
> <developer...@protonmail.com> wrote:
> 
>> Thanks for the suggestion Qi. I tried increasing slot.idle.timeout to 
>> 3600000 but it seems to still have encountered the issue. Does this mean if 
>> a slot or "flink worker" has not processed items for 1 hour, that it will be 
>> removed?
>> 
>> Would any other flink configuration properties help for this?
>> 
>> slot.request.timeout
>> web.timeout
>> heartbeat.interval
>> heartbeat.timeout
>> 
>> 
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Sunday, November 25, 2018 6:56 PM, 罗齐 <luoqi...@bytedance.com> wrote:
>> 
>>> Hi,
>>> 
>>> It looks that some of your slots were freed during the job execution 
>>> (possibly due to idle for too long). AFAIK the exception was thrown when a 
>>> pending Slot request was removed. You can try increase the 
>>> “Slot.idle.timeout” to mitigate this issue (default is 50000, try 3600000 
>>> or higher).
>>> 
>>> Regards,
>>> Qi
>>> 
>>>> On Nov 26, 2018, at 7:36 AM, Flink Developer <developer...@protonmail.com 
>>>> <mailto:developer...@protonmail.com>> wrote:
>>>> 
>>>> Hi, I have a Flink application sourcing from a topic in Kafka (400 
>>>> partitions) and sinking to S3 using bucketingsink and using RocksDb for 
>>>> checkpointing every 2 mins. The Flink app runs with parallelism 400 so 
>>>> that each worker handles a partition. This is using Flink 1.5.2. The Flink 
>>>> cluster uses 10 task managers with 40 slots each.
>>>> 
>>>> After running for a few days straight, it encounters a Flink exception:
>>>> Org.apache.flink.util.FlinkException: The assigned slot 
>>>> container_1234567_0003_01_000009_1 was removed.
>>>> 
>>>> This causes the Flink job to fail. It is odd to me. I am unsure what 
>>>> causes this. Also, during this time, I see some checkpoints stating 
>>>> "checkpoint was declined (tasks not ready)". At this point, the job is 
>>>> unable to recover and fails. Does this happen if a slot or worker is not 
>>>> doing processing for X amount of time? Would I need to increase the Flink 
>>>> config properties for the following when creating the Flink cluster in 
>>>> yarn?
>>>> 
>>>> Slot.idle.timeout
>>>> Slot.request.timeout
>>>> Web.timeout
>>>> Heartbeat.interval
>>>> Heartbeat.timeout
>>>> 
>>>> Any help would be greatly appreciated.
>>>> 
>> 
> 

Reply via email to