Thank you all for the reply! I am running batch jobs, I read in a handful of files from HDFS and output to HBase, HDFS, and Kafka. I run into this when I have partial usage of the cluster as the job runs. So right now I spin up 20 nodes with 3 slots, my job at peak uses all 60 slots, but by the end of it since my outputs are all forced parallel 1 while I work out kinks, that all typically ends up running in 1 or two task managers tops. The other 18-19 task managers die off. Problem is as soon as any task manager dies off, my client throws the above exception and the job fails.
I cannot share logs, but I was thinking about writing a dirt simple mapreduce flow based on the wordcount example. The example would have a wide map phase that generates data, and then I'd run it through a reducer that sleeps maybe 1 second every record. I believe that will simulate my condition very well where I go from 100% used slots to only 1-2 used slots as I hit that timeout. I'll do that today and let you know, if it works I can share the code in here as an example. On Thu, Jun 21, 2018 at 5:01 AM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Garrett, > > killing of idle TaskManager should not affect the execution of the job. By > definition a TaskManager only idles if it does not execute any tasks. Could > you maybe share the complete logs (of the cluster entrypoint and all > TaskManagers) with us? > > Cheers, > Till > > On Thu, Jun 21, 2018 at 10:26 AM Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Garrett, >> >> I agree, there seems to be an issue and increasing the timeout should not >> be the right approach to solve it. >> Are you running streaming or batch jobs, i.e., do some of the tasks >> finish much earlier than others? >> >> I'm adding Till to this thread who's very familiar with scheduling and >> process communication. >> >> Best, Fabian >> >> 2018-06-19 0:03 GMT+02:00 Garrett Barton <garrett.bar...@gmail.com>: >> >>> Hey all, >>> >>> My jobs that I am trying to write in Flink 1.5 are failing after a few >>> minutes. I think its because the idle task managers are shutting down, >>> which seems to kill the client and the running job. The running job itself >>> was still going on one of the other task managers. I get: >>> >>> org.apache.flink.client.program.ProgramInvocationException: >>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: >>> Connection unexpectedly closed by remote task manager 'xxxx'. This might >>> indicate that the remote task manager was lost. >>> at org.apache.flink.runtime.io >>> .network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:143) >>> >>> Now I happen to have the last part of the flow paralleled to 1 right now >>> for debugging, so the 4 task managers that are spun up, 3 of them hit the >>> timeout period (currently set to 240000). I think as soon as the first one >>> goes the client throws up and the whole job dies as a result. >>> >>> Is this expected behavior and if so, is there another way around it? Do >>> I keep increasing the slotmanager.taskmanager-timeout to a really really >>> large number? I have verified setting the timeout to 840000 lets the job >>> complete without error. >>> >>> Thank you! >>> >> >>