Hi It can be also that some task managers fail, then pending checkpoints are declined and slots are removed. Can you also have a look into task managers logs?
Best, Andrey > On 26 Nov 2018, at 12:19, qi luo <luoqi...@gmail.com> wrote: > > This is weird. Could you paste your entire exception trace here? > >> On Nov 26, 2018, at 4:37 PM, Flink Developer <developer...@protonmail.com >> <mailto:developer...@protonmail.com>> wrote: >> >> In addition, after the Flink job has failed from the above exception, the >> Flink job is unable to recover from previous checkpoint. Is this the >> expected behavior? How can the job be recovered successfully from this? >> >> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >> On Monday, November 26, 2018 12:30 AM, Flink Developer >> <developer...@protonmail.com <mailto:developer...@protonmail.com>> wrote: >> >>> Thanks for the suggestion Qi. I tried increasing slot.idle.timeout to >>> 3600000 but it seems to still have encountered the issue. Does this mean if >>> a slot or "flink worker" has not processed items for 1 hour, that it will >>> be removed? >>> >>> Would any other flink configuration properties help for this? >>> >>> slot.request.timeout >>> web.timeout >>> heartbeat.interval >>> heartbeat.timeout >>> >>> >>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>> On Sunday, November 25, 2018 6:56 PM, 罗齐 <luoqi...@bytedance.com >>> <mailto:luoqi...@bytedance.com>> wrote: >>> >>>> Hi, >>>> >>>> It looks that some of your slots were freed during the job execution >>>> (possibly due to idle for too long). AFAIK the exception was thrown when a >>>> pending Slot request was removed. You can try increase the >>>> “Slot.idle.timeout” to mitigate this issue (default is 50000, try 3600000 >>>> or higher). >>>> >>>> Regards, >>>> Qi >>>> >>>>> On Nov 26, 2018, at 7:36 AM, Flink Developer <developer...@protonmail.com >>>>> <mailto:developer...@protonmail.com>> wrote: >>>>> >>>>> Hi, I have a Flink application sourcing from a topic in Kafka (400 >>>>> partitions) and sinking to S3 using bucketingsink and using RocksDb for >>>>> checkpointing every 2 mins. The Flink app runs with parallelism 400 so >>>>> that each worker handles a partition. This is using Flink 1.5.2. The >>>>> Flink cluster uses 10 task managers with 40 slots each. >>>>> >>>>> After running for a few days straight, it encounters a Flink exception: >>>>> Org.apache.flink.util.FlinkException: The assigned slot >>>>> container_1234567_0003_01_000009_1 was removed. >>>>> >>>>> This causes the Flink job to fail. It is odd to me. I am unsure what >>>>> causes this. Also, during this time, I see some checkpoints stating >>>>> "checkpoint was declined (tasks not ready)". At this point, the job is >>>>> unable to recover and fails. Does this happen if a slot or worker is not >>>>> doing processing for X amount of time? Would I need to increase the Flink >>>>> config properties for the following when creating the Flink cluster in >>>>> yarn? >>>>> >>>>> Slot.idle.timeout >>>>> Slot.request.timeout >>>>> Web.timeout >>>>> Heartbeat.interval >>>>> Heartbeat.timeout >>>>> >>>>> Any help would be greatly appreciated. >>>>> >>> >> >