Hi,

I am  trying to compute the number of unique users from a year's worth of
data. So there are about 300 files and each file is quite large (~GB).  I
first tried this without a loop by reading all the files in the directory
using the glob pattern:  sc.textFile("dir/*"). But the tasks were stalling
and I was getting a "Too many open files" warning, even though I increased
the nofile limit to 500K.  The number of shuffle tasks that were being
created was more than 200K and they were all generating shuffle files.
Setting consolidateFiles to true did not help. 

So now I am reading the files in a loop as shown in the  code below. Now I
dont run in to the "Too many open files" issue.  But the countByKey is
taking a really long time (more then 15 hours and still ongoing). It appears
from the logs that this operation is happening on a single node. From the
logs, I am not able to figure out why it is taking so long. Each node has 16
GB memory and the mesos cluster has 16 nodes.  I have set  spark.serializer
to KryoSerializer.  I am not running into any out of memory errors so far.
Is there some way to improve the performance? Thanks. 

for (i <- 1 to 300)
{
         var f = "file" + i    //name of the file
         val user_time = sc.textFile(f)
                        .map(line => {
                             val fields = line.split("\t")
                             (fields(11), fields(6))
                            }) // extract (year-month, user_id)
                        .distinct()
                        .countByKey  // group by with year as the key

        // now convert Map object to RDD in order to output results
        val ut_rdd = sc.parallelize(user_time.toSeq)

        // convert to array to extract the count. Need to find if there is
an easier way to do this.
        var ar = ut_rdd.toArray()

        // aggregate the count for the year
        ucount = ucount + ar(0)._2
}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/processing-large-number-of-files-tp15429.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to