Thanks for the reply! def groupCount(l): > gs = itertools.groupby(l) > return map(lambda (n, it): (n, sum(1 for _ in it)), gs) > > If you have an RDD, you can use RDD.mapPartitions(groupCount).collect() >
Yes, I am interested in RDD - not pure Python :) I am new to Spark, can you explain: - Python functions like groupCount; these get reflected from their Python AST and converted into a Spark DAG? Presumably if I try and do something non-convertible this transformation process will throw an error? In other words this runs in the JVM. - I had considered that "partitions" were batches of distributable work, and generally large. Presumably the above is OK with small groups (eg. average size < 10) - this won't kill performance? On Mon, Aug 18, 2014 at 4:20 PM, Davies Liu <dav...@databricks.com> wrote: > On Sun, Aug 17, 2014 at 11:07 PM, Andrew Ash <and...@andrewash.com> wrote: > > What happens when a run of numbers is spread across a partition > boundary? I > > think you might end up with two adjacent groups of the same value in that > > situation. > > Yes, need another scan to combine this continuous groups with same value. > Yep - this will happen frequently. So by this you mean scanning the resulting mapPartitions() results? Presumably I could eliminate adjacent duplicates - or specifically look for duplicates at the end/start of different "batches" (what is the Spark term for this) from different nodes in the cluster. What's the Spark'iest way to do this efficiently? :) Regards, Fil. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12295.html Sent from the Apache Spark User List mailing list archive at Nabble.com.