Re: Understanding PySpark Internals

Josh Rosen Wed, 30 Mar 2016 10:48:17 -0700

One clarification: there *are* Python interpreters running on executors so
that Python UDFs and RDD API code can be executed. Some slightly-outdated
but mostly-correct reference material for this can be found at
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.


See also: search the Spark codebase for PythonRDD and look at
python/pyspark/worker.py

On Tue, Mar 29, 2016 at 8:21 AM Adam Roberts <arobe...@uk.ibm.com> wrote:

> Hi, I'm interested in figuring out how the Python API for Spark works,
> I've came to the following conclusion and want to share this with the
> community; could be of use in the PySpark docs here
> <https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals>,
> specifically the "Execution and pipelining part".
>
> *Any sanity checking would be much appreciated, here's the trivial Python
> example I've traced:*
>
>
> *from pyspark import SparkContext sc = SparkContext("local[1]", "Adam
> test") sc.setCheckpointDir("foo checkpoint dir")*
>
> *Added this JVM option:*
> *export
> IBM_JAVA_OPTIONS="-Xtrace:methods={org/apache/spark/*,py4j/*},print=mt"*
>
> *Prints added in py4j-java/src/py4j/commands/CallCommand.java -
> specifically in the execute method. Built and replaced existing class in
> the py4j 0.9 jar in my Spark assembly jar. Example output is:*
> *In execute for CallCommand, commandName: c*
> *target object id: o0*
> *methodName: get*
>
> *I'll launch the Spark application with:*
> *$SPARK_HOME/bin/spark-submit --master local[1] Adam.py > checkme.txt 2>&1*
>
> I've quickly put together the following WIP diagram of what I think is
> happening:
> http://postimg.org/image/nihylmset/
>
> To summarise I think:
>
>    - We're heavily using reflection (as evidenced by Py4j's
>    ReflectionEngine and MethodInvoker classes) to invoke Spark's API in a JVM
>    from Python
>    - There's an agreed protocol (in Py4j's Protocol.java) for handling
>    commands: said commands are exchanged using a local socket between Python
>    and our JVM (the driver based on docs, not the master)
>    - The Spark API is accessible by means of commands exchanged using
>    said socket using the agreed protocol
>    - Commands are read/written using BufferedReader/Writer
>    - Type conversion is also performed from Python to Java (not looked at
>    in detail yet)
>    - We keep track of the objects with, for example, o0 representing the
>    first object we know about
>
> Does this sound correct?
>
> I've only checked the trace output in local mode, curious as to what
> happens when we're running in standalone mode (I didn't see a Python
> interpreter appearing on all workers in order to process partitions of
> data, I assume in standalone mode we use Python solely as an orchestrator -
> the driver - and not as an executor for distributed computing?).
>
> Happy to provide the full trace output on request (omitted timestamps,
> logging info, added spacing), I expect there's a O*JDK method tracing
> equivalent so the above can easily be reproduced regardless of Java vendor.
>
> Cheers,
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>

Re: Understanding PySpark Internals

Reply via email to