Re: SQL query in scala API

2014-12-06 Thread Arun Luthra
Thanks, I will try this. On Fri, Dec 5, 2014 at 1:19 AM, Cheng Lian wrote: > Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write > you own aggregation with aggregateByKey: > > users.aggregateByKey((0, Set.empty[String]))({ case ((count, seen), user) => > (count + 1, seen

Re: SQL query in scala API

2014-12-05 Thread Cheng Lian
Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write you own aggregation with |aggregateByKey|: |users.aggregateByKey((0,Set.empty[String]))({case ((count, seen), user) => (count +1, seen + user) }, {case ((count0, seen0), (count1, seen1)) => (count0 + count1, seen0 ++

Re: SQL query in scala API

2014-12-04 Thread Stéphane Verlet
Disclaimer : I am new at Spark I did something similar in a prototype which works but I that did not test at scale yet val agg =3D users.mapValues(_ =3D> 1)..aggregateByKey(new CustomAggregation())(CustomAggregation.sequenceOp, CustomAggregation.comboO= p) class CustomAggregation() extends

Re: SQL query in scala API

2014-12-04 Thread Arun Luthra
Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian wrote: > You may do this: > > table("users").groupBy('zip)('zip, count('user), countDistinct('user)) > > On 12/4/14 8:47 AM, Arun Luthra wrote: > > I'm wondering how to do this kind

Re: SQL query in scala API

2014-12-03 Thread Cheng Lian
You may do this: |table("users").groupBy('zip)('zip, count('user), countDistinct('user)) | On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API

SQL query in scala API

2014-12-03 Thread Arun Luthra
I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API, I can make an RDD (called "users") of key-value pairs where the keys are zip (as in ZIP code) and the values are user id's. Then I ca