Hi I'm new to Spark. I have played with some data locally but starting to wonder if I'm going down a wrong track of using Scala collections inside RDDs.
I'm looking at a log file of events from mobile clients. One of the engagement metrics we're interested in is lifetime (not terribly interesting on its own but reused in other metrics). To calculate this I've written the following: val events = csvFile.map(line => line.split(",")) val sessionEventsByUsers = events.map(x => (x._1, x._2.filter(e => "session.start".equals(e(0)) || "session.end".equals(e(0)) ))).groupBy(_(2)) this gives me: RDD[(String, Seq[Array[String]])] which is: user id -> Seq[Array(event, timestamp, userid, sessionid)] sessionEventsByUsers.map(x => (x._1, x._2.map(_(1).toLong).sorted)).map(x => (x._1, x._2.last - x._2.head)) this gives us our results user id -> lifetime I'm wondering if using Scala collections inside the RDD like this is a good idea, from a brief glance at the tuning guide I'm cautious. In this contrived case a user could potentially have 1000s of events, I'm wondering what happens behind the scenes, does the bolded operations on the nested scala collections get distributed to worker nodes or does it all run on the driver node (ie, not horizontally scalable). I'm hopeful for the former and suspect that's the case (I found that seq.grouped(x) doesn't work to scala.collector.Iterator not being serializable). Is there a way to do something like this without using the nested scala collections & transformations? Don't know any tricks yet :) Thank you! Peter