Hi,
I have a piece of code that reads all the (csv) files in a folder. For each
file, it parses each line, extracts the first 2 elements from each row of
the file, groups the tuple by the key and finally outputs the number of
unique values for each key.
val conf = new SparkConf().setAppName("App")
val sc = new SparkContext(conf)
val user_time = sc.union(sc.textFile("/directory/*")) // union of all
files in the directory
.map(line => {
val fields = line.split(",")
(fields(1), fields(0)) // extract first
2 elements
})
.groupByKey // group by timestamp
.map(g=> (g._1, g._2.toSet.size)) // get the
number of unique ids per timestamp
I have a lot of files in the directory (several hundreds). The program takes
a long time. I am not sure if the union operation is preventing the files
from being processed in parallel. Is there a better way to parallelize the
above code ? For example, the first two operations (reading each file and
extracting the first 2 columns from each file) can be done in parallel, but
I am not sure if that is how Spark schedules the above code.
thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-multiple-files-in-parallel-tp12336.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]