Great to hear that you could solve your problem Garrett. What happens when
you call `collect` is that Flink will send the job which has been defined
up to this point to the cluster in order to execute it and it waits until
it retrieved the result. Once the result has been obtained, the Flink
progra
I don't know why yet, but I did figure it out. After my sample long
running map reduce test ran fine all night I tried a ton of things. Turns
out there is a difference between env.execute() and env.collect().
My flow had reading from HDFS, decrypting, processing, and finally writing
to HDFS, at
Hi Garrett,
have you set a restart strategy for your job [1]? In order to recover from
failures you need to specify one. Otherwise Flink will terminally fail the
job in case of a failure.
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
Cheers,
Till
On
Actually, random thought, could yarn preemption be causing this? What is
the failure scenario should a working task manager go down in yarn that is
doing real work? The docs make it sound like it should fire up another TM
and get back to work out of the box, but I'm not seeing that.
On Thu, Jun
Thank you all for the reply!
I am running batch jobs, I read in a handful of files from HDFS and output
to HBase, HDFS, and Kafka. I run into this when I have partial usage of
the cluster as the job runs. So right now I spin up 20 nodes with 3 slots,
my job at peak uses all 60 slots, but by the
Hi Garrett,
killing of idle TaskManager should not affect the execution of the job. By
definition a TaskManager only idles if it does not execute any tasks. Could
you maybe share the complete logs (of the cluster entrypoint and all
TaskManagers) with us?
Cheers,
Till
On Thu, Jun 21, 2018 at 10:2
Hi Garrett,
I agree, there seems to be an issue and increasing the timeout should not
be the right approach to solve it.
Are you running streaming or batch jobs, i.e., do some of the tasks finish
much earlier than others?
I'm adding Till to this thread who's very familiar with scheduling and
proc
Hey all,
My jobs that I am trying to write in Flink 1.5 are failing after a few
minutes. I think its because the idle task managers are shutting down,
which seems to kill the client and the running job. The running job itself
was still going on one of the other task managers. I get:
org.apache