I'm running reduceByKey in spark. My program is the simplest example of
spark:
val counts = textFile.flatMap(line => line.split(" ")).repartition(20000).
.map(word => (word, 1))
.reduceByKey(_ + _, 10000)
counts.saveAsTextFile("hdfs://...")
but it always run out of memory...
I 'm using 50 servers , 35 executors per server, 140GB memory per server.
the documents volume is : 8TB documents, 20 billion documents, 1000 billion
words in total. and the words after reduce will be about 100 million.
I wonder how to set the configuration of spark?
I wonder what value should these parameters be?
1. the number of the maps ? 20000 for example?
2. the number of the reduces ? 10000 for example?
3. others parameters?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]