This video https://www.youtube.com/watch?v=LQHMMCf2ZWY I think.
On Wed, May 10, 2017 at 8:04 PM, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote: > Any chance of a link to that video :) > > Thanks! > > G > > On 10 May 2017 at 09:49, Holden Karau <hol...@pigscanfly.ca> wrote: > >> So this Python side pipelining happens in a lot of places which can make >> debugging extra challenging. Some people work around this with persist >> which breaks the pipelining during debugging, but if your interested in >> more general Python debugging I've got a YouTube video on the topic which >> could be a good intro (of course I'm pretty biased about that). >> >> On Wed, May 10, 2017 at 9:42 AM Pavel Klemenkov <pklemen...@gmail.com> >> wrote: >> >>> Thanks for the quick answer, Holden! >>> >>> Are there any other tricks with PySpark which are hard to debug using UI >>> or toDebugString? >>> >>> On Wed, May 10, 2017 at 7:18 PM, Holden Karau <hol...@pigscanfly.ca> >>> wrote: >>> >>>> In PySpark the filter and then map steps are combined into a single >>>> transformation from the JVM point of view. This allows us to avoid copying >>>> the data back to Scala in between the filter and the map steps. The >>>> debugging exeperience is certainly much harder in PySpark and I think is an >>>> interesting area for those interested in contributing :) >>>> >>>> On Wed, May 10, 2017 at 7:33 AM pklemenkov <pklemen...@gmail.com> >>>> wrote: >>>> >>>>> This Scala code: >>>>> scala> val logs = sc.textFile("big_data_specialization/log.txt"). >>>>> | filter(x => !x.contains("INFO")). >>>>> | map(x => (x.split("\t")(1), 1)). >>>>> | reduceByKey((x, y) => x + y) >>>>> >>>>> generated obvious lineage: >>>>> >>>>> (2) ShuffledRDD[4] at reduceByKey at <console>:27 [] >>>>> +-(2) MapPartitionsRDD[3] at map at <console>:26 [] >>>>> | MapPartitionsRDD[2] at filter at <console>:25 [] >>>>> | big_data_specialization/log.txt MapPartitionsRDD[1] at >>>>> textFile at >>>>> <console>:24 [] >>>>> | big_data_specialization/log.txt HadoopRDD[0] at textFile at >>>>> <console>:24 [] >>>>> >>>>> But Python code: >>>>> >>>>> logs = sc.textFile("../log.txt")\ >>>>> .filter(lambda x: 'INFO' not in x)\ >>>>> .map(lambda x: (x.split('\t')[1], 1))\ >>>>> .reduceByKey(lambda x, y: x + y) >>>>> >>>>> generated something strange which is hard to follow: >>>>> >>>>> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 [] >>>>> | MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 [] >>>>> | ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 >>>>> [] >>>>> +-(2) PairwiseRDD[10] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 >>>>> [] >>>>> | PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 >>>>> [] >>>>> | ../log.txt MapPartitionsRDD[8] at textFile at >>>>> NativeMethodAccessorImpl.java:0 [] >>>>> | ../log.txt HadoopRDD[7] at textFile at >>>>> NativeMethodAccessorImpl.java:0 [] >>>>> >>>>> Why is that? Does pyspark do some optimizations under the hood? This >>>>> debug >>>>> string is really useless for debugging. >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: http://apache-spark-user-list. >>>>> 1001560.n3.nabble.com/Spark-Core-Python-and-Scala-generate- >>>>> different-DAGs-for-identical-code-tp28674.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>> >>>>> -- >>>> Cell : 425-233-8271 <(425)%20233-8271> >>>> Twitter: https://twitter.com/holdenkarau >>>> >>> >>> >>> >>> -- >>> Yours faithfully, Pavel Klemenkov. >>> >> -- >> Cell : 425-233-8271 <(425)%20233-8271> >> Twitter: https://twitter.com/holdenkarau >> > > -- Yours faithfully, Pavel Klemenkov.