Re: Understanding pyspark data flow on worker nodes

Amit Rana Thu, 07 Jul 2016 23:02:04 -0700

Hi all,

Did anyone get a chance to look into it??
Any sort of guidance will be much appreciated.


Thanks,
Amit Rana
On 7 Jul 2016 14:28, "Amit Rana" <[email protected]> wrote:

> As mentioned in the documentation:
> PythonRDD objects launch Python subprocesses and communicate with them
> using pipes, sending the user's code and the data to be processed.
>
> I am trying to understand  the implementation of how this data transfer is
> happening  using pipes.
> Can anyone please guide me along that line??
>
> Thanks,
> Amit Rana
> On 7 Jul 2016 13:44, "Sun Rui" <[email protected]> wrote:
>
>> You can read
>> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
>> For pySpark data flow on worker nodes, you can read the source code of
>> PythonRDD.scala. Python worker processes communicate with Spark executors
>> via sockets instead of pipes.
>>
>> On Jul 7, 2016, at 15:49, Amit Rana <[email protected]> wrote:
>>
>> Hi all,
>>
>> I am trying  to trace the data flow in pyspark. I am using intellij IDEA
>> in windows 7.
>> I had submitted  a python  job as follows:
>> --master local[4] <path to pyspark  job> <arguments to the job>
>>
>> I have made the following  insights after running the above command in
>> debug mode:
>> ->Locally when a pyspark's interpreter starts, it also starts a JVM with
>> which it communicates through socket.
>> ->py4j is used to handle this communication
>> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
>> which communicates with the spark executors in cluster.
>>
>> In cluster I have read that the data flow between spark executors and
>> python interpreter happens using pipes. But I am not able to trace that
>> data flow.
>>
>> Please correct me if my understanding is wrong. It would be very helpful
>> if, someone can help me understand tge code-flow for data transfer between
>> JVM and python workers.
>>
>> Thanks,
>> Amit Rana
>>
>>
>>

Re: Understanding pyspark data flow on worker nodes

Reply via email to