Hi, I have a piece of code that reads all the (csv) files in a folder. For each file, it parses each line, extracts the first 2 elements from each row of the file, groups the tuple by the key and finally outputs the number of unique values for each key.
val conf = new SparkConf().setAppName("App") val sc = new SparkContext(conf) val user_time = sc.union(sc.textFile("/directory/*")) // union of all files in the directory .map(line => { val fields = line.split(",") (fields(1), fields(0)) // extract first 2 elements }) .groupByKey // group by timestamp .map(g=> (g._1, g._2.toSet.size)) // get the number of unique ids per timestamp I have a lot of files in the directory (several hundreds). The program takes a long time. I am not sure if the union operation is preventing the files from being processed in parallel. Is there a better way to parallelize the above code ? For example, the first two operations (reading each file and extracting the first 2 columns from each file) can be done in parallel, but I am not sure if that is how Spark schedules the above code. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Processing-multiple-files-in-parallel-tp12336.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org