Hello,

I am new to spark. For one of  job I am finding significant performance 
difference when run in pyspark vs scala. Could you please let me know if this 
is known and scala is preferred over python for writing spark jobs? Also DAG 
visualization shows completely different DAGs for scala and pyspark. I have 
pasted DAG for both using toDebugString() method. Let me know if you need any 
additional information.

Time for Job in scala : 52 secs

Time for job in pyspark : 4.2 min


Scala code in Zepplin:

val lines = sc.textFile("s3://[test-bucket]/output2/")
val words = lines.flatMap(line => line.split(" "))
val filteredWords = words.filter(word => word.equals("Gutenberg") || 
word.equals("flower") || word.equals("a"))
val wordMap = filteredWords.map(word => (word, 1)).reduceByKey(_ + _)
wordMap.collect()

pyspark code in Zepplin:

lines = sc.textFile("s3://[test-bucket]/output2/")
words = lines.flatMap(lambda x: x.split())
filteredWords = words.filter(lambda x: (x == "Gutenberg" or x == "flower" or x 
== "a"))
result = filteredWords.map(lambda x: (x, 1)).reduceByKey(lambda a,b: 
a+b).collect()
print result


Scala final RDD:

print wordMap.toDebugString()

 lines: org.apache.spark.rdd.RDD[String] = s3://[test-bucket]/output2/ 
MapPartitionsRDD[108] at textFile at <console>:30 words: 
org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[109] at flatMap at 
<console>:31 filteredWords: org.apache.spark.rdd.RDD[String] = 
MapPartitionsRDD[110] at filter at <console>:33 wordMap: 
org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[112] at reduceByKey at 
<console>:35 (10) ShuffledRDD[112] at reduceByKey at <console>:35 [] +-(10) 
MapPartitionsRDD[111] at map at <console>:35 [] | MapPartitionsRDD[110] at 
filter at <console>:33 [] | MapPartitionsRDD[109] at flatMap at <console>:31 [] 
| s3://[test-bucket]/output2/ MapPartitionsRDD[108] at textFile at <console>:30 
[] | s3://[test-bucket]/output2/ HadoopRDD[107] at textFile at <console>:30 []


PySpark final RDD:

println(wordMap.toDebugString)

(10) PythonRDD[119] at RDD at PythonRDD.scala:43 [] | 
s3://[test-bucket]/output2/ MapPartitionsRDD[114] at textFile at null:-1 [] | 
s3://[test-bucket]/output2/HadoopRDD[113] at textFile at null:-1 [] 
PythonRDD[120] at RDD at PythonRDD.scala:43


Thanks,

Pratik

Reply via email to