If you are counting per time and per page, then you need to group by
time and page not just page. Something more like:

csv.groupBy(csv => (csv(0),csv(1))) ...

This gives a list of users per (time,page). As Nick suggests, then you
count the distinct values for each key:

... .mapValues(_.distinct.count)

If you can tolerate some approximation, then using
countApproxDistinctByKey will be a lot faster.

csv.groupBy(csv => (csv(0),csv(1))).countApproxDistinctByKey()

On Tue, Jul 15, 2014 at 7:14 PM, buntu <[email protected]> wrote:
> Hi --
>
> New to Spark and trying to figure out how to do a generate unique counts per
> page by date given this raw data:
>
> timestamp,page,userId
> 1405377264,google,user1
> 1405378589,google,user2
> 1405380012,yahoo,user1
> ..
>
> I can do a groupBy a field and get the count:
>
> val lines=sc.textFile("data.csv")
> val csv=lines.map(_.split(","))
> // group by page
> csv.groupBy(_(1)).count
>
> But not able to see how to do count distinct on userId and also apply
> another groupBy on timestamp field. Please let me know how to handle such
> cases.
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to