One clarification: there *are* Python interpreters running on executors so that Python UDFs and RDD API code can be executed. Some slightly-outdated but mostly-correct reference material for this can be found at https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.
See also: search the Spark codebase for PythonRDD and look at python/pyspark/worker.py On Tue, Mar 29, 2016 at 8:21 AM Adam Roberts <arobe...@uk.ibm.com> wrote: > Hi, I'm interested in figuring out how the Python API for Spark works, > I've came to the following conclusion and want to share this with the > community; could be of use in the PySpark docs here > <https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals>, > specifically the "Execution and pipelining part". > > *Any sanity checking would be much appreciated, here's the trivial Python > example I've traced:* > > > *from pyspark import SparkContext sc = SparkContext("local[1]", "Adam > test") sc.setCheckpointDir("foo checkpoint dir")* > > *Added this JVM option:* > *export > IBM_JAVA_OPTIONS="-Xtrace:methods={org/apache/spark/*,py4j/*},print=mt"* > > *Prints added in py4j-java/src/py4j/commands/CallCommand.java - > specifically in the execute method. Built and replaced existing class in > the py4j 0.9 jar in my Spark assembly jar. Example output is:* > *In execute for CallCommand, commandName: c* > *target object id: o0* > *methodName: get* > > *I'll launch the Spark application with:* > *$SPARK_HOME/bin/spark-submit --master local[1] Adam.py > checkme.txt 2>&1* > > I've quickly put together the following WIP diagram of what I think is > happening: > http://postimg.org/image/nihylmset/ > > To summarise I think: > > - We're heavily using reflection (as evidenced by Py4j's > ReflectionEngine and MethodInvoker classes) to invoke Spark's API in a JVM > from Python > - There's an agreed protocol (in Py4j's Protocol.java) for handling > commands: said commands are exchanged using a local socket between Python > and our JVM (the driver based on docs, not the master) > - The Spark API is accessible by means of commands exchanged using > said socket using the agreed protocol > - Commands are read/written using BufferedReader/Writer > - Type conversion is also performed from Python to Java (not looked at > in detail yet) > - We keep track of the objects with, for example, o0 representing the > first object we know about > > Does this sound correct? > > I've only checked the trace output in local mode, curious as to what > happens when we're running in standalone mode (I didn't see a Python > interpreter appearing on all workers in order to process partitions of > data, I assume in standalone mode we use Python solely as an orchestrator - > the driver - and not as an executor for distributed computing?). > > Happy to provide the full trace output on request (omitted timestamps, > logging info, added spacing), I expect there's a O*JDK method tracing > equivalent so the above can easily be reproduced regardless of Java vendor. > > Cheers, > > > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU >