Re: Understanding pyspark data flow on worker nodes

2016-07-08 Thread Adam Roberts
uot; Date: 08/07/2016 07:03 Subject: Re: Understanding pyspark data flow on worker nodes You can look into its source code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana w

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Reynold Xin
You can look into its source code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana wrote: > Hi all, > > Did anyone get a chance to look into it?? > Any sort of guidance will be much appreciate

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Amit Rana
Hi all, Did anyone get a chance to look into it?? Any sort of guidance will be much appreciated. Thanks, Amit Rana On 7 Jul 2016 14:28, "Amit Rana" wrote: > As mentioned in the documentation: > PythonRDD objects launch Python subprocesses and communicate with them > using pipes, sending the use

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Amit Rana
As mentioned in the documentation: PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed. I am trying to understand the implementation of how this data transfer is happening using pipes. Can anyone please guide me

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Sun Rui
You can read https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals For pySpark data flow on worker nodes, you can read the source code of PythonRDD.scala. Python worker processes communicate with Spark executors