I've got ~500 tab delimited log files 25gigs each with page name and userId who viewed the page along with timestamp.
I'm trying to build a basic spark app to get a unique visitors per page. I was able to achieve this using SparkSQL by registering the RDD of a case class and running a select with count(distinct(userId)). I was reading through the documentation and found that the RDD can be partitioned but looks like it needs to be key/value form (?). I want to re-partition the RDD by grouping the page to list of userIds (ignore the timestamp for now), but not quite clear on how to convert the RDD returned by sc.textFile("rawdata/").map(_.split("\t")) to a key/value pairs. Can anyone pls throw some light on the options I have to partition the RDD and have a better performance. Really appreciate any input, thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org