Hi Ben,I haven't tried it with Python, but the instructions are the same as for Scala compiled (jar) apps. What it's saying is that it's not possible to offload the entire work to the master (ala hadoop) in a fire and forget (or rather submit-and-forget) manner when running on stand alone. There are two deployment modes - client and cluster. For standalone, only client is supported. What this means is that the "submitting process" will be the driver process (not to be confused with "master"). It should very well be possible to submit from you laptop to a standalone cluster, but the process running spark-submit will be alive until the job finishes. If you terminate the process (via kill-9 or otherwise), then the job will be terminated as well. The driver process will submit the work to the spark master, which will do the usually divvying up of tasks, distribution, fault tolerance, etc. and the results will get reported back to the driver process. Often it's not possible to have arbitrary access to the spark master, and if jobs take hours to complete, it's not feasible to have the process running on the laptop without interruptions, disconnects, etc. As such, a "gateway" machine is used closer to the spark master that's used to submit jobs from. That way, the process on the gateway machine lives for the duration of the job, and no connection from the laptop, etc. is needed. It's not uncommon to actually have an api to the gateway machine. For example, Ooyala's job server https://github.com/ooyala/spark-jobserver provides a restful interface to submit jobs. Does that help? Regards,Ashic. Date: Fri, 14 Nov 2014 13:40:43 -0600 Subject: Submitting Python Applications from Remote to Master From: quasi...@gmail.com To: user@spark.apache.org
Hi All, I'm not quite clear on whether submitting a python application to spark standalone on ec2 is possible. Am I reading this correctly: *A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the client spark-submit process, with the input and output of the application attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to usecluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or python applications. So I shouldn't be able to do something like:./bin/spark-submit --master spark:/xxxxx.compute-1.amazonaws.com:7077 examples/src/main/python/pi.py >From a laptop connecting to a previously launched spark cluster using the >default spark-ec2 script, correct? If I am not mistaken about this then docs are slightly confusing -- the above example is more or less the example here: https://spark.apache.org/docs/1.1.0/submitting-applications.html If I am mistaken, apologies, can you help me figure out where I went wrong?I've also taken to opening port 7077 to 0.0.0.0/0 --Ben