Thanks for the reply!

def groupCount(l):
>    gs = itertools.groupby(l)
>    return map(lambda (n, it): (n, sum(1 for _ in it)), gs)
>
> If you have an RDD, you can use RDD.mapPartitions(groupCount).collect()
>

Yes, I am interested in RDD - not pure Python :)

I am new to Spark, can you explain:

- Python functions like groupCount; these get reflected from their Python
AST and converted into a Spark DAG? Presumably if I try and do something
non-convertible this transformation process will throw an error? In other
words this runs in the JVM.
- I had considered that "partitions" were batches of distributable work,
and generally large. Presumably the above is OK with small groups (eg.
average size < 10) - this won't kill performance?

On Mon, Aug 18, 2014 at 4:20 PM, Davies Liu <dav...@databricks.com> wrote:

> On Sun, Aug 17, 2014 at 11:07 PM, Andrew Ash <and...@andrewash.com> wrote:
> > What happens when a run of numbers is spread across a partition
> boundary?  I
> > think you might end up with two adjacent groups of the same value in that
> > situation.
>
> Yes, need another scan to combine this continuous groups with same value.
>

Yep - this will happen frequently. So by this you mean scanning the
resulting mapPartitions() results? Presumably I could eliminate adjacent
duplicates - or specifically look for duplicates at the end/start of
different "batches" (what is the Spark term for this) from different nodes
in the cluster. What's the Spark'iest way to do this efficiently? :)

Regards, Fil.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12295.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to