Re: [Spark Core]: Python and Scala generate different DAGs for identical code

Holden Karau Wed, 10 May 2017 09:50:42 -0700

So this Python side pipelining happens in a lot of places which can make
debugging extra challenging. Some people work around this with persist
which breaks the pipelining during debugging, but if your interested in
more general Python debugging I've got a YouTube video on the topic which
could be a good intro (of course I'm pretty biased about that).


On Wed, May 10, 2017 at 9:42 AM Pavel Klemenkov <pklemen...@gmail.com>
wrote:

> Thanks for the quick answer, Holden!
>
> Are there any other tricks with PySpark which are hard to debug using UI
> or toDebugString?
>
> On Wed, May 10, 2017 at 7:18 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> In PySpark the filter and then map steps are combined into a single
>> transformation from the JVM point of view. This allows us to avoid copying
>> the data back to Scala in between the filter and the map steps. The
>> debugging exeperience is certainly much harder in PySpark and I think is an
>> interesting area for those interested in contributing :)
>>
>> On Wed, May 10, 2017 at 7:33 AM pklemenkov <pklemen...@gmail.com> wrote:
>>
>>> This Scala code:
>>> scala> val logs = sc.textFile("big_data_specialization/log.txt").
>>>      | filter(x => !x.contains("INFO")).
>>>      | map(x => (x.split("\t")(1), 1)).
>>>      | reduceByKey((x, y) => x + y)
>>>
>>> generated obvious lineage:
>>>
>>> (2) ShuffledRDD[4] at reduceByKey at <console>:27 []
>>>  +-(2) MapPartitionsRDD[3] at map at <console>:26 []
>>>     |  MapPartitionsRDD[2] at filter at <console>:25 []
>>>     |  big_data_specialization/log.txt MapPartitionsRDD[1] at textFile at
>>> <console>:24 []
>>>     |  big_data_specialization/log.txt HadoopRDD[0] at textFile at
>>> <console>:24 []
>>>
>>> But Python code:
>>>
>>> logs = sc.textFile("../log.txt")\
>>>          .filter(lambda x: 'INFO' not in x)\
>>>          .map(lambda x: (x.split('\t')[1], 1))\
>>>          .reduceByKey(lambda x, y: x + y)
>>>
>>> generated something strange which is hard to follow:
>>>
>>> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 []
>>>  |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
>>>  |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 []
>>>  +-(2) PairwiseRDD[10] at reduceByKey at
>>> <ipython-input-9-d6a34e0335b0>:1 []
>>>     |  PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 []
>>>     |  ../log.txt MapPartitionsRDD[8] at textFile at
>>> NativeMethodAccessorImpl.java:0 []
>>>     |  ../log.txt HadoopRDD[7] at textFile at
>>> NativeMethodAccessorImpl.java:0 []
>>>
>>> Why is that? Does pyspark do some optimizations under the hood? This
>>> debug
>>> string is really useless for debugging.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Core-Python-and-Scala-generate-different-DAGs-for-identical-code-tp28674.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Yours faithfully, Pavel Klemenkov.
>
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

Reply via email to