Re: Need some guidance

2015-04-14 Thread Victor Tso-Guillen
Thanks, yes. I was using Int for my V and didn't get the second param in the second closure right :) On Mon, Apr 13, 2015 at 1:55 PM, Dean Wampler wrote: > That appears to work, with a few changes to get the types correct: > > input.distinct().combineByKey((s: String) => 1, (agg: Int, s: String)

Re: Need some guidance

2015-04-13 Thread Dean Wampler
That appears to work, with a few changes to get the types correct: input.distinct().combineByKey((s: String) => 1, (agg: Int, s: String) => agg + 1, (agg1: Int, agg2: Int) => agg1 + agg2) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: Need some guidance

2015-04-13 Thread Victor Tso-Guillen
How about this? input.distinct().combineByKey((v: V) => 1, (agg: Int, x: Int) => agg + 1, (agg1: Int, agg2: Int) => agg1 + agg2).collect() On Mon, Apr 13, 2015 at 10:31 AM, Dean Wampler wrote: > The problem with using collect is that it will fail for large data sets, > as you'll attempt to copy

Re: Need some guidance

2015-04-13 Thread Dean Wampler
The problem with using collect is that it will fail for large data sets, as you'll attempt to copy the entire RDD to the memory of your driver program. The following works (Scala syntax, but similar to Python): scala> val i1 = input.distinct.groupByKey scala> i1.foreach(println) (1,CompactBuffer(b

Need some guidance

2015-04-13 Thread Marco Shaw
**Learning the ropes** I'm trying to grasp the concept of using the pipeline in pySpark... Simplified example: >>> list=[(1,"alpha"),(1,"beta"),(1,"foo"),(1,"alpha"),(2,"alpha"),(2,"alpha"),(2,"bar"),(3,"foo")] Desired outcome: [(1,3),(2,2),(3,1)] Basically for each key, I want the number of un