I am running a spark streaming application on a cluster composed by three
nodes, each one with a worker and three executors (so a total of 9
executors). I am using the spark standalone mode (version 2.1.1).

The application is run with a spark-submit command with option
"--deploy-mode" client and "--conf
spark.streaming.stopGracefullyOnShutdown=true". The submit command is run
from one of the nodes, let's call it node 1.

As a fault tolerance test I am stopping the worker on node 2 by calling the
script "stop-slave.sh".

In executor logs on node 2 I can see several errors related to a
FileNotFoundException during a shuffle operation:



I can see 4 errors of this kind on the same task in each of the 3 executors
on node 2.

In driver logs I can see:



This is taking down the application, as expected: the executor reached the
spark.task.maxFailures on a single task and the application is then stopped.

I ran different tests and all of them but one ended with the app stopped. My
idea is that the behaviour can vary depending on the precise step in the
stream process I ask the worker to stop. In any case, all other tests failed
with the same error described above.

Increasing the parameter spark.task.maxFailures to 8 did not help either,
with the TaskSetManager signalling task failed 8 times instead of 4.

What if the worker is killed?


I also ran a different test: I killed the worker and 3 executors processes
on node 2 with the command "kill -9". And in this case, the streaming app
adapted to the remaining resources and kept working.

In driver log we can see the driver noticing the missing executors:



Then, we notice the a long long serie of the following errors:



This errors appears in the log until the killed worker is started again (as
said before, these errors do not cause the application to stop). 

Conclusion


Stopping a worker with the dedicated command has a unexpected behaviour: the
app should be able to cope with the missed worked, adapting to the remaining
resources and keep working (as it does in the case of kill).

What are your observations on this issue?

Thank you, Davide



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to