Re: Remove duplicate keys by always choosing first in file.

Sean Owen Mon, 21 Sep 2015 20:43:07 -0700

The zero value here is None. Combining None with any row should yield
Some(row). After that, combining is a no-op for other rows.


On Tue, Sep 22, 2015 at 4:27 AM, Philip Weaver <philip.wea...@gmail.com> wrote:
> Hmm, I don't think that's what I want. There's no "zero value" in my use
> case.
>
> On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> I think foldByKey is much more what you want, as it has more a notion
>> of building up some result per key by encountering values serially.
>> You would take the first and ignore the rest. Note that "first"
>> depends on your RDD having an ordering to begin with, or else you rely
>> on however it happens to be ordered after whatever operations give you
>> a key-value RDD.
>>
>> On Tue, Sep 22, 2015 at 1:26 AM, Philip Weaver <philip.wea...@gmail.com>
>> wrote:
>> > I am processing a single file and want to remove duplicate rows by some
>> > key
>> > by always choosing the first row in the file for that key.
>> >
>> > The best solution I could come up with is to zip each row with the
>> > partition
>> > index and local index, like this:
>> >
>> > rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
>> >   rows.zipWithIndex.map { case (row, localIndex) => (row.key,
>> > ((partitionIndex, localIndex), row)) }
>> > }
>> >
>> >
>> > And then using reduceByKey with a min ordering on the (partitionIndex,
>> > localIndex) pair.
>> >
>> > First, can i count on SparkContext.textFile to read the lines in such
>> > that
>> > the partition indexes are always increasing so that the above works?
>> >
>> > And, is there a better way to accomplish the same effect?
>> >
>> > Thanks!
>> >
>> > - Philip
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Remove duplicate keys by always choosing first in file.

Reply via email to