Re: [Spark Core]: Python and Scala generate different DAGs for identical code

Holden Karau Wed, 10 May 2017 09:18:30 -0700

In PySpark the filter and then map steps are combined into a single
transformation from the JVM point of view. This allows us to avoid copying
the data back to Scala in between the filter and the map steps. The
debugging exeperience is certainly much harder in PySpark and I think is an
interesting area for those interested in contributing :)


On Wed, May 10, 2017 at 7:33 AM pklemenkov <[email protected]> wrote:

> This Scala code:
> scala> val logs = sc.textFile("big_data_specialization/log.txt").
>      | filter(x => !x.contains("INFO")).
>      | map(x => (x.split("\t")(1), 1)).
>      | reduceByKey((x, y) => x + y)
>
> generated obvious lineage:
>
> (2) ShuffledRDD[4] at reduceByKey at <console>:27 []
>  +-(2) MapPartitionsRDD[3] at map at <console>:26 []
>     |  MapPartitionsRDD[2] at filter at <console>:25 []
>     |  big_data_specialization/log.txt MapPartitionsRDD[1] at textFile at
> <console>:24 []
>     |  big_data_specialization/log.txt HadoopRDD[0] at textFile at
> <console>:24 []
>
> But Python code:
>
> logs = sc.textFile("../log.txt")\
>          .filter(lambda x: 'INFO' not in x)\
>          .map(lambda x: (x.split('\t')[1], 1))\
>          .reduceByKey(lambda x, y: x + y)
>
> generated something strange which is hard to follow:
>
> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 []
>  |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
>  |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 []
>  +-(2) PairwiseRDD[10] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1
> []
>     |  PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 []
>     |  ../log.txt MapPartitionsRDD[8] at textFile at
> NativeMethodAccessorImpl.java:0 []
>     |  ../log.txt HadoopRDD[7] at textFile at
> NativeMethodAccessorImpl.java:0 []
>
> Why is that? Does pyspark do some optimizations under the hood? This debug
> string is really useless for debugging.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Core-Python-and-Scala-generate-different-DAGs-for-identical-code-tp28674.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

Reply via email to