Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Sean!! Thats what I was looking for -- group by on mulitple fields. I'm gonna play with it now. Thanks again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html Sent from the Apache Spark User List mail

Re: Count distinct with groupBy usage

2014-07-15 Thread Sean Owen
If you are counting per time and per page, then you need to group by time and page not just page. Something more like: csv.groupBy(csv => (csv(0),csv(1))) ... This gives a list of users per (time,page). As Nick suggests, then you count the distinct values for each key: ... .mapValues(_.distinct.

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thats is correct Raffy. Assume I convert the timestamp field to date and in the required format, is it possible to report it by date? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9790.html Sent from the Apache Spar

Re: Count distinct with groupBy usage

2014-07-15 Thread Raffael Marty
> All I'm attempting is to report number of unique visitors per page by date. But the way you are doing it currently, you will get a count per second. You have to bucketize your dates by whatever time resolution you want. -raffy

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Nick. All I'm attempting is to report number of unique visitors per page by date. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9786.html Sent from the Apache Spark User List mailing list archive at Nabble.c

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
We have CDH 5.0.2 which doesn't include Spark SQL yet and may only be available in CDH 5.1 which is yet to be released. If Spark SQL is the only option then I might need to hack around to add it into the current CDH deployment if thats possible. -- View this message in context: http://apache-s

Re: Count distinct with groupBy usage

2014-07-15 Thread Zongheng Yang
Sounds like a job for Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html ! On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath wrote: > You can use .distinct.count on your user RDD. > > What are you trying to achieve with the time group by? > — > Sent from Mailbox > > > On Tue

Re: Count distinct with groupBy usage

2014-07-15 Thread Nick Pentreath
You can use .distinct.count on your user RDD. What are you trying to achieve with the time group by? — Sent from Mailbox On Tue, Jul 15, 2014 at 8:14 PM, buntu wrote: > Hi -- > New to Spark and trying to figure out how to do a generate unique counts per > page by date given this raw data: > ti