So this Python side pipelining happens in a lot of places which can make debugging extra challenging. Some people work around this with persist which breaks the pipelining during debugging, but if your interested in more general Python debugging I've got a YouTube video on the topic which could be a good intro (of course I'm pretty biased about that).
On Wed, May 10, 2017 at 9:42 AM Pavel Klemenkov <pklemen...@gmail.com> wrote: > Thanks for the quick answer, Holden! > > Are there any other tricks with PySpark which are hard to debug using UI > or toDebugString? > > On Wed, May 10, 2017 at 7:18 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > >> In PySpark the filter and then map steps are combined into a single >> transformation from the JVM point of view. This allows us to avoid copying >> the data back to Scala in between the filter and the map steps. The >> debugging exeperience is certainly much harder in PySpark and I think is an >> interesting area for those interested in contributing :) >> >> On Wed, May 10, 2017 at 7:33 AM pklemenkov <pklemen...@gmail.com> wrote: >> >>> This Scala code: >>> scala> val logs = sc.textFile("big_data_specialization/log.txt"). >>> | filter(x => !x.contains("INFO")). >>> | map(x => (x.split("\t")(1), 1)). >>> | reduceByKey((x, y) => x + y) >>> >>> generated obvious lineage: >>> >>> (2) ShuffledRDD[4] at reduceByKey at <console>:27 [] >>> +-(2) MapPartitionsRDD[3] at map at <console>:26 [] >>> | MapPartitionsRDD[2] at filter at <console>:25 [] >>> | big_data_specialization/log.txt MapPartitionsRDD[1] at textFile at >>> <console>:24 [] >>> | big_data_specialization/log.txt HadoopRDD[0] at textFile at >>> <console>:24 [] >>> >>> But Python code: >>> >>> logs = sc.textFile("../log.txt")\ >>> .filter(lambda x: 'INFO' not in x)\ >>> .map(lambda x: (x.split('\t')[1], 1))\ >>> .reduceByKey(lambda x, y: x + y) >>> >>> generated something strange which is hard to follow: >>> >>> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 [] >>> | MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 [] >>> | ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 [] >>> +-(2) PairwiseRDD[10] at reduceByKey at >>> <ipython-input-9-d6a34e0335b0>:1 [] >>> | PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 [] >>> | ../log.txt MapPartitionsRDD[8] at textFile at >>> NativeMethodAccessorImpl.java:0 [] >>> | ../log.txt HadoopRDD[7] at textFile at >>> NativeMethodAccessorImpl.java:0 [] >>> >>> Why is that? Does pyspark do some optimizations under the hood? This >>> debug >>> string is really useless for debugging. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Core-Python-and-Scala-generate-different-DAGs-for-identical-code-tp28674.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> -- >> Cell : 425-233-8271 <(425)%20233-8271> >> Twitter: https://twitter.com/holdenkarau >> > > > > -- > Yours faithfully, Pavel Klemenkov. > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau