I've got ~500 tab delimited log files 25gigs each with page name and userId
who viewed the page along with timestamp.

I'm trying to build a basic spark app to get a unique visitors per page. I
was able to achieve this using SparkSQL by registering the RDD of a case
class and running a select with count(distinct(userId)).

I was reading through the documentation and found that the RDD can be
partitioned but looks like it needs to be key/value form (?). I want to
re-partition the RDD by grouping the page to list of userIds (ignore the
timestamp for now), but not quite clear on how to convert the RDD returned
by sc.textFile("rawdata/").map(_.split("\t")) to a key/value pairs.

Can anyone pls throw some light on the options I have to partition the RDD
and have a better performance. Really appreciate any input, thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to