Re: Segmented fold count

Davies Liu Sun, 17 Aug 2014 23:20:50 -0700

On Sun, Aug 17, 2014 at 11:07 PM, Andrew Ash <and...@andrewash.com> wrote:
> What happens when a run of numbers is spread across a partition boundary?  I
> think you might end up with two adjacent groups of the same value in that
> situation.


Yes, need another scan to combine this continuous groups with same value.

> On Mon, Aug 18, 2014 at 2:05 AM, Davies Liu <dav...@databricks.com> wrote:
>>
>> >>> import itertools
>> >>> l = [1,1,1,2,2,3,4,4,5,1]
>> >>> gs = itertools.groupby(l)
>> >>> map(lambda (n, it): (n, sum(1 for _ in it)), gs)
>> [(1, 3), (2, 2), (3, 1), (4, 2), (5, 1), (1, 1)]
>>
>> def groupCount(l):
>>    gs = itertools.groupby(l)
>>    return map(lambda (n, it): (n, sum(1 for _ in it)), gs)
>>
>> If you have an RDD, you can use RDD.mapPartitions(groupCount).collect()
>>
>> On Sun, Aug 17, 2014 at 10:34 PM, fil <f...@pobox.com> wrote:
>> > Can anyone assist with a scan of the following kind (Python preferred,
>> > but
>> > whatever..)? I'm looking for a kind of segmented fold count.
>> >
>> > Input: [1,1,1,2,2,3,4,4,5,1]
>> > Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)]
>> > or preferably two output columns:
>> > id: [1,2,3,4,5,1]
>> > count: [3,2,1,2,1,1]
>> >
>> > I can use a groupby/count, except for the fact that I just want to scan
>> > -
>> > not resort. Ideally this would be as low-level as possible and perform
>> > in a
>> > simple single scan. It also needs to retain the original sort order.
>> >
>> > Thoughts?
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Segmented fold count

Reply via email to