Hi,

I have a piece of code that reads all the (csv) files in a folder. For each
file, it parses each line, extracts the first 2 elements from each row of
the file, groups the tuple  by the key and finally outputs the  number of
unique values for each key.

    val conf = new SparkConf().setAppName("App")
    val sc = new SparkContext(conf)
    
    val user_time = sc.union(sc.textFile("/directory/*"))    // union of all
files in the directory
                               .map(line => {
                                   val fields = line.split(",")
                                   (fields(1), fields(0))  // extract first
2 elements
                                  }) 
                               .groupByKey  // group by timestamp
                              .map(g=> (g._1, g._2.toSet.size)) // get the
number of unique ids per timestamp

I have a lot of files in the directory (several hundreds). The program takes
a long time. I am not sure if the union operation is preventing the files
from being processed in parallel. Is there a better way to parallelize the
above code ? For example, the first two operations (reading each file and
extracting the first 2 columns from each file) can be done in parallel, but
I am not sure if that is how Spark schedules the above code.

thanks
  





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-multiple-files-in-parallel-tp12336.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to