This video https://www.youtube.com/watch?v=LQHMMCf2ZWY I think.

On Wed, May 10, 2017 at 8:04 PM, lucas.g...@gmail.com <lucas.g...@gmail.com>
wrote:

> Any chance of a link to that video :)
>
> Thanks!
>
> G
>
> On 10 May 2017 at 09:49, Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> So this Python side pipelining happens in a lot of places which can make
>> debugging extra challenging. Some people work around this with persist
>> which breaks the pipelining during debugging, but if your interested in
>> more general Python debugging I've got a YouTube video on the topic which
>> could be a good intro (of course I'm pretty biased about that).
>>
>> On Wed, May 10, 2017 at 9:42 AM Pavel Klemenkov <pklemen...@gmail.com>
>> wrote:
>>
>>> Thanks for the quick answer, Holden!
>>>
>>> Are there any other tricks with PySpark which are hard to debug using UI
>>> or toDebugString?
>>>
>>> On Wed, May 10, 2017 at 7:18 PM, Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> In PySpark the filter and then map steps are combined into a single
>>>> transformation from the JVM point of view. This allows us to avoid copying
>>>> the data back to Scala in between the filter and the map steps. The
>>>> debugging exeperience is certainly much harder in PySpark and I think is an
>>>> interesting area for those interested in contributing :)
>>>>
>>>> On Wed, May 10, 2017 at 7:33 AM pklemenkov <pklemen...@gmail.com>
>>>> wrote:
>>>>
>>>>> This Scala code:
>>>>> scala> val logs = sc.textFile("big_data_specialization/log.txt").
>>>>>      | filter(x => !x.contains("INFO")).
>>>>>      | map(x => (x.split("\t")(1), 1)).
>>>>>      | reduceByKey((x, y) => x + y)
>>>>>
>>>>> generated obvious lineage:
>>>>>
>>>>> (2) ShuffledRDD[4] at reduceByKey at <console>:27 []
>>>>>  +-(2) MapPartitionsRDD[3] at map at <console>:26 []
>>>>>     |  MapPartitionsRDD[2] at filter at <console>:25 []
>>>>>     |  big_data_specialization/log.txt MapPartitionsRDD[1] at
>>>>> textFile at
>>>>> <console>:24 []
>>>>>     |  big_data_specialization/log.txt HadoopRDD[0] at textFile at
>>>>> <console>:24 []
>>>>>
>>>>> But Python code:
>>>>>
>>>>> logs = sc.textFile("../log.txt")\
>>>>>          .filter(lambda x: 'INFO' not in x)\
>>>>>          .map(lambda x: (x.split('\t')[1], 1))\
>>>>>          .reduceByKey(lambda x, y: x + y)
>>>>>
>>>>> generated something strange which is hard to follow:
>>>>>
>>>>> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 []
>>>>>  |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
>>>>>  |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0
>>>>> []
>>>>>  +-(2) PairwiseRDD[10] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1
>>>>> []
>>>>>     |  PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1
>>>>> []
>>>>>     |  ../log.txt MapPartitionsRDD[8] at textFile at
>>>>> NativeMethodAccessorImpl.java:0 []
>>>>>     |  ../log.txt HadoopRDD[7] at textFile at
>>>>> NativeMethodAccessorImpl.java:0 []
>>>>>
>>>>> Why is that? Does pyspark do some optimizations under the hood? This
>>>>> debug
>>>>> string is really useless for debugging.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context: http://apache-spark-user-list.
>>>>> 1001560.n3.nabble.com/Spark-Core-Python-and-Scala-generate-
>>>>> different-DAGs-for-identical-code-tp28674.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>> --
>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>> Yours faithfully, Pavel Klemenkov.
>>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Yours faithfully, Pavel Klemenkov.

Reply via email to