On Sun, Aug 17, 2014 at 11:07 PM, Andrew Ash <and...@andrewash.com> wrote: > What happens when a run of numbers is spread across a partition boundary? I > think you might end up with two adjacent groups of the same value in that > situation.
Yes, need another scan to combine this continuous groups with same value. > On Mon, Aug 18, 2014 at 2:05 AM, Davies Liu <dav...@databricks.com> wrote: >> >> >>> import itertools >> >>> l = [1,1,1,2,2,3,4,4,5,1] >> >>> gs = itertools.groupby(l) >> >>> map(lambda (n, it): (n, sum(1 for _ in it)), gs) >> [(1, 3), (2, 2), (3, 1), (4, 2), (5, 1), (1, 1)] >> >> def groupCount(l): >> gs = itertools.groupby(l) >> return map(lambda (n, it): (n, sum(1 for _ in it)), gs) >> >> If you have an RDD, you can use RDD.mapPartitions(groupCount).collect() >> >> On Sun, Aug 17, 2014 at 10:34 PM, fil <f...@pobox.com> wrote: >> > Can anyone assist with a scan of the following kind (Python preferred, >> > but >> > whatever..)? I'm looking for a kind of segmented fold count. >> > >> > Input: [1,1,1,2,2,3,4,4,5,1] >> > Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)] >> > or preferably two output columns: >> > id: [1,2,3,4,5,1] >> > count: [3,2,1,2,1,1] >> > >> > I can use a groupby/count, except for the fact that I just want to scan >> > - >> > not resort. Ideally this would be as low-level as possible and perform >> > in a >> > simple single scan. It also needs to retain the original sort order. >> > >> > Thoughts? >> > >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org