You can read 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals 
<https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals>
For pySpark data flow on worker nodes, you can read the source code of 
PythonRDD.scala. Python worker processes communicate with Spark executors via 
sockets instead of pipes.

> On Jul 7, 2016, at 15:49, Amit Rana <amitranavs...@gmail.com> wrote:
> 
> Hi all,
> 
> I am trying  to trace the data flow in pyspark. I am using intellij IDEA in 
> windows 7.
> I had submitted  a python  job as follows:
> --master local[4] <path to pyspark  job> <arguments to the job>
> 
> I have made the following  insights after running the above command in debug 
> mode:
> ->Locally when a pyspark's interpreter starts, it also starts a JVM with 
> which it communicates through socket.
> ->py4j is used to handle this communication 
> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext 
> which communicates with the spark executors in cluster.
> 
> In cluster I have read that the data flow between spark executors and python 
> interpreter happens using pipes. But I am not able to trace that data flow.
> 
> Please correct me if my understanding is wrong. It would be very helpful if, 
> someone can help me understand tge code-flow for data transfer between JVM 
> and python workers.
> 
> Thanks,
> Amit Rana
> 

Reply via email to