Re: Segmented fold count

2014-08-20 Thread fil
> > > Could I write groupCount() in Scala, and then use it from Pyspark? Care > to > > supply an example, I'm finding them hard to find :) > > It's doable, but not so convenient. If you really care about the > performance > difference, you should write your program in Scala. > Is it possible to wr

Re: Segmented fold count

2014-08-18 Thread fil
fil wrote > - Python functions like groupCount; these get reflected from their Python > AST and converted into a Spark DAG? Presumably if I try and do something > non-convertible this transformation process will throw an error? In other > words this runs in the JVM. Further to thi

Re: Segmented fold count

2014-08-18 Thread fil
will happen frequently. So by this you mean scanning the resulting mapPartitions() results? Presumably I could eliminate adjacent duplicates - or specifically look for duplicates at the end/start of different "batches" (what is the Spark term for this) from different nodes in the cluster. What

Segmented fold count

2014-08-17 Thread fil
Can anyone assist with a scan of the following kind (Python preferred, but whatever..)? I'm looking for a kind of segmented fold count. Input: [1,1,1,2,2,3,4,4,5,1] Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)] or preferably two output columns: id: [1,2,3,4,5,1] count: [3,2,1,2,1,1] I