This Scala code:
scala> val logs = sc.textFile("big_data_specialization/log.txt").
| filter(x => !x.contains("INFO")).
| map(x => (x.split("\t")(1), 1)).
| reduceByKey((x, y) => x + y)
generated obvious lineage:
(2) ShuffledRDD[4] at reduceByKey at <console>:27 []
+-(2) MapPartitionsRDD[3] at map at <console>:26 []
| MapPartitionsRDD[2] at filter at <console>:25 []
| big_data_specialization/log.txt MapPartitionsRDD[1] at textFile at
<console>:24 []
| big_data_specialization/log.txt HadoopRDD[0] at textFile at
<console>:24 []
But Python code:
logs = sc.textFile("../log.txt")\
.filter(lambda x: 'INFO' not in x)\
.map(lambda x: (x.split('\t')[1], 1))\
.reduceByKey(lambda x, y: x + y)
generated something strange which is hard to follow:
(2) PythonRDD[13] at RDD at PythonRDD.scala:48 []
| MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
| ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 []
+-(2) PairwiseRDD[10] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 []
| PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 []
| ../log.txt MapPartitionsRDD[8] at textFile at
NativeMethodAccessorImpl.java:0 []
| ../log.txt HadoopRDD[7] at textFile at
NativeMethodAccessorImpl.java:0 []
Why is that? Does pyspark do some optimizations under the hood? This debug
string is really useless for debugging.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Core-Python-and-Scala-generate-different-DAGs-for-identical-code-tp28674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]