Hi 

I'm new to Spark. I have played with some data locally but starting to wonder 
if I'm going down a wrong track of using Scala collections inside RDDs. 

I'm looking at a log file of events from mobile clients. One of the engagement 
metrics we're interested in is lifetime (not terribly interesting on its own 
but reused in other metrics). To calculate this I've written the following:

val events = csvFile.map(line => line.split(","))

val sessionEventsByUsers = events.map(x => (x._1, x._2.filter(e => 
"session.start".equals(e(0)) || "session.end".equals(e(0)) ))).groupBy(_(2))


this gives me:
   RDD[(String, Seq[Array[String]])]  
which is:
   user id -> Seq[Array(event, timestamp, userid, sessionid)]

sessionEventsByUsers.map(x => (x._1, x._2.map(_(1).toLong).sorted)).map(x => 
(x._1, x._2.last - x._2.head))


this gives us our results
  user id -> lifetime 

I'm wondering if using Scala collections inside the RDD like this is a good 
idea, from a brief glance at the tuning guide I'm cautious. In this contrived 
case a user could potentially have 1000s of events, I'm wondering what happens 
behind the scenes, does the bolded operations on the nested scala collections 
get distributed to worker nodes or does it all run on the driver node (ie, not 
horizontally scalable). I'm hopeful for the former and suspect that's the case 
(I found that seq.grouped(x) doesn't work to scala.collector.Iterator not being 
serializable). 

Is there a way to do something like this without using the nested scala 
collections & transformations? Don't know any tricks yet :)

Thank you!
Peter

Reply via email to