Hi all, Did anyone get a chance to look into it?? Any sort of guidance will be much appreciated.
Thanks, Amit Rana On 7 Jul 2016 14:28, "Amit Rana" <amitranavs...@gmail.com> wrote: > As mentioned in the documentation: > PythonRDD objects launch Python subprocesses and communicate with them > using pipes, sending the user's code and the data to be processed. > > I am trying to understand the implementation of how this data transfer is > happening using pipes. > Can anyone please guide me along that line?? > > Thanks, > Amit Rana > On 7 Jul 2016 13:44, "Sun Rui" <sunrise_...@163.com> wrote: > >> You can read >> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals >> For pySpark data flow on worker nodes, you can read the source code of >> PythonRDD.scala. Python worker processes communicate with Spark executors >> via sockets instead of pipes. >> >> On Jul 7, 2016, at 15:49, Amit Rana <amitranavs...@gmail.com> wrote: >> >> Hi all, >> >> I am trying to trace the data flow in pyspark. I am using intellij IDEA >> in windows 7. >> I had submitted a python job as follows: >> --master local[4] <path to pyspark job> <arguments to the job> >> >> I have made the following insights after running the above command in >> debug mode: >> ->Locally when a pyspark's interpreter starts, it also starts a JVM with >> which it communicates through socket. >> ->py4j is used to handle this communication >> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext >> which communicates with the spark executors in cluster. >> >> In cluster I have read that the data flow between spark executors and >> python interpreter happens using pipes. But I am not able to trace that >> data flow. >> >> Please correct me if my understanding is wrong. It would be very helpful >> if, someone can help me understand tge code-flow for data transfer between >> JVM and python workers. >> >> Thanks, >> Amit Rana >> >> >>