Hi Garrett,

have you set a restart strategy for your job [1]? In order to recover from
failures you need to specify one. Otherwise Flink will terminally fail the
job in case of a failure.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html

Cheers,
Till

On Thu, Jun 21, 2018 at 7:43 PM Garrett Barton <garrett.bar...@gmail.com>
wrote:

> Actually, random thought, could yarn preemption be causing this?  What is
> the failure scenario should a working task manager go down in yarn that is
> doing real work?  The docs make it sound like it should fire up another TM
> and get back to work out of the box, but I'm not seeing that.
>
>
> On Thu, Jun 21, 2018 at 1:20 PM Garrett Barton <garrett.bar...@gmail.com>
> wrote:
>
>> Thank you all for the reply!
>>
>> I am running batch jobs, I read in a handful of files from HDFS and
>> output to HBase, HDFS, and Kafka.  I run into this when I have partial
>> usage of the cluster as the job runs.  So right now I spin up 20 nodes with
>> 3 slots, my job at peak uses all 60 slots, but by the end of it since my
>> outputs are all forced parallel 1 while I work out kinks, that all
>> typically ends up running in 1 or two task managers tops.  The other 18-19
>> task managers die off.  Problem is as soon as any task manager dies off, my
>> client throws the above exception and the job fails.
>>
>> I cannot share logs, but I was thinking about writing a dirt simple
>> mapreduce flow based on the wordcount example.  The example would have a
>> wide map phase that generates data, and then I'd run it through a reducer
>> that sleeps maybe 1 second every record.  I believe that will simulate my
>> condition very well where I go from 100% used slots to only 1-2 used slots
>> as I hit that timeout.  I'll do that today and let you know, if it works I
>> can share the code in here as an example.
>>
>> On Thu, Jun 21, 2018 at 5:01 AM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> Hi Garrett,
>>>
>>> killing of idle TaskManager should not affect the execution of the job.
>>> By definition a TaskManager only idles if it does not execute any tasks.
>>> Could you maybe share the complete logs (of the cluster entrypoint and all
>>> TaskManagers) with us?
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Jun 21, 2018 at 10:26 AM Fabian Hueske <fhue...@gmail.com>
>>> wrote:
>>>
>>>> Hi Garrett,
>>>>
>>>> I agree, there seems to be an issue and increasing the timeout should
>>>> not be the right approach to solve it.
>>>> Are you running streaming or batch jobs, i.e., do some of the tasks
>>>> finish much earlier than others?
>>>>
>>>> I'm adding Till to this thread who's very familiar with scheduling and
>>>> process communication.
>>>>
>>>> Best, Fabian
>>>>
>>>> 2018-06-19 0:03 GMT+02:00 Garrett Barton <garrett.bar...@gmail.com>:
>>>>
>>>>> Hey all,
>>>>>
>>>>>  My jobs that I am trying to write in Flink 1.5 are failing after a
>>>>> few minutes.  I think its because the idle task managers are shutting 
>>>>> down,
>>>>> which seems to kill the client and the running job. The running job itself
>>>>> was still going on one of the other task managers.  I get:
>>>>>
>>>>> org.apache.flink.client.program.ProgramInvocationException:
>>>>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>>>>> Connection unexpectedly closed by remote task manager 'xxxx'. This might
>>>>> indicate that the remote task manager was lost.
>>>>> at org.apache.flink.runtime.io
>>>>> .network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:143)
>>>>>
>>>>> Now I happen to have the last part of the flow paralleled to 1 right
>>>>> now for debugging, so the 4 task managers that are spun up, 3 of them hit
>>>>> the timeout period (currently set to 240000).  I think as soon as the 
>>>>> first
>>>>> one goes the client throws up and the whole job dies as a result.
>>>>>
>>>>>  Is this expected behavior and if so, is there another way around it?
>>>>> Do I keep increasing the slotmanager.taskmanager-timeout to a really 
>>>>> really
>>>>> large number? I have verified setting the timeout to 840000 lets the job
>>>>> complete without error.
>>>>>
>>>>> Thank you!
>>>>>
>>>>
>>>>

Reply via email to