Hi Ben,I haven't tried it with Python, but the instructions are the same as for 
Scala compiled (jar) apps. What it's saying is that it's not possible to 
offload the entire work to the master (ala hadoop) in a fire and forget (or 
rather submit-and-forget) manner when running on stand alone. There are two 
deployment modes - client and cluster. For standalone, only client is 
supported. What this means is that the "submitting process" will be the driver 
process (not to be confused with "master"). It should very well be possible to 
submit from you laptop to a standalone cluster, but the process running 
spark-submit will be alive until the job finishes. If you terminate the process 
(via kill-9 or otherwise), then the job will be terminated as well. The driver 
process will submit the work to the spark master, which will do the usually 
divvying up of tasks, distribution, fault tolerance, etc. and the results will 
get reported back to the driver process. 
Often it's not possible to have arbitrary access to the spark master, and if 
jobs take hours to complete, it's not feasible to have the process running on 
the laptop without interruptions, disconnects, etc. As such, a "gateway" 
machine is used closer to the spark master that's used to submit jobs from. 
That way, the process on the gateway machine lives for the duration of the job, 
and no connection from the laptop, etc. is needed. It's not uncommon to 
actually have an api to the gateway machine. For example, Ooyala's job server 
https://github.com/ooyala/spark-jobserver provides a restful interface to 
submit jobs.
Does that help?
Regards,Ashic.
Date: Fri, 14 Nov 2014 13:40:43 -0600
Subject: Submitting Python Applications from Remote to Master
From: quasi...@gmail.com
To: user@spark.apache.org

Hi All,
I'm not quite clear on whether submitting a python application to spark 
standalone on ec2 is possible. 
Am I reading this correctly:
*A common deployment strategy is to submit your application from a gateway 
machine that is physically co-located with your worker machines (e.g. Master 
node in a standalone EC2 cluster). In this setup, client mode is appropriate. 
In client mode, the driver is launched directly within the client spark-submit 
process, with the input and output of the application attached to the console. 
Thus, this mode is especially suitable for applications that involve the REPL 
(e.g. Spark shell).Alternatively, if your application is submitted from a 
machine far from the worker machines (e.g. locally on your laptop), it is 
common to usecluster mode to minimize network latency between the drivers and 
the executors. Note that cluster mode is currently not supported for standalone 
clusters, Mesos clusters, or python applications.
So I shouldn't be able to do something like:./bin/spark-submit  --master 
spark:/xxxxx.compute-1.amazonaws.com:7077  examples/src/main/python/pi.py 
>From a laptop connecting to a previously launched spark cluster using the 
>default spark-ec2 script, correct?
If I am not mistaken about this then docs are slightly confusing -- the above 
example is more or less the example here: 
https://spark.apache.org/docs/1.1.0/submitting-applications.html
If I am mistaken, apologies, can you help me figure out where I went wrong?I've 
also taken to opening port 7077 to 0.0.0.0/0
--Ben


                                          

Reply via email to