Thanks Sean!! Thats what I was looking for -- group by on mulitple fields.
I'm gonna play with it now. Thanks again!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html
Sent from the Apache Spark User List mail
If you are counting per time and per page, then you need to group by
time and page not just page. Something more like:
csv.groupBy(csv => (csv(0),csv(1))) ...
This gives a list of users per (time,page). As Nick suggests, then you
count the distinct values for each key:
... .mapValues(_.distinct.
Thats is correct Raffy. Assume I convert the timestamp field to date and in
the required format, is it possible to report it by date?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9790.html
Sent from the Apache Spar
> All I'm attempting is to report number of unique visitors per page by date.
But the way you are doing it currently, you will get a count per second. You
have to bucketize your dates by whatever time resolution you want.
-raffy
Thanks Nick.
All I'm attempting is to report number of unique visitors per page by date.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9786.html
Sent from the Apache Spark User List mailing list archive at Nabble.c
We have CDH 5.0.2 which doesn't include Spark SQL yet and may only be
available in CDH 5.1 which is yet to be released.
If Spark SQL is the only option then I might need to hack around to add it
into the current CDH deployment if thats possible.
--
View this message in context:
http://apache-s
Sounds like a job for Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html !
On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath
wrote:
> You can use .distinct.count on your user RDD.
>
> What are you trying to achieve with the time group by?
> —
> Sent from Mailbox
>
>
> On Tue
You can use .distinct.count on your user RDD.
What are you trying to achieve with the time group by?
—
Sent from Mailbox
On Tue, Jul 15, 2014 at 8:14 PM, buntu wrote:
> Hi --
> New to Spark and trying to figure out how to do a generate unique counts per
> page by date given this raw data:
> ti