If you are counting per time and per page, then you need to group by time and page not just page. Something more like:
csv.groupBy(csv => (csv(0),csv(1))) ... This gives a list of users per (time,page). As Nick suggests, then you count the distinct values for each key: ... .mapValues(_.distinct.count) If you can tolerate some approximation, then using countApproxDistinctByKey will be a lot faster. csv.groupBy(csv => (csv(0),csv(1))).countApproxDistinctByKey() On Tue, Jul 15, 2014 at 7:14 PM, buntu <[email protected]> wrote: > Hi -- > > New to Spark and trying to figure out how to do a generate unique counts per > page by date given this raw data: > > timestamp,page,userId > 1405377264,google,user1 > 1405378589,google,user2 > 1405380012,yahoo,user1 > .. > > I can do a groupBy a field and get the count: > > val lines=sc.textFile("data.csv") > val csv=lines.map(_.split(",")) > // group by page > csv.groupBy(_(1)).count > > But not able to see how to do count distinct on userId and also apply > another groupBy on timestamp field. Please let me know how to handle such > cases. > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.
