Re: Remove duplicate keys by always choosing first in file.

Philip Weaver Mon, 21 Sep 2015 20:27:56 -0700

Hmm, I don't think that's what I want. There's no "zero value" in my use
case.


On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen <so...@cloudera.com> wrote:

> I think foldByKey is much more what you want, as it has more a notion
> of building up some result per key by encountering values serially.
> You would take the first and ignore the rest. Note that "first"
> depends on your RDD having an ordering to begin with, or else you rely
> on however it happens to be ordered after whatever operations give you
> a key-value RDD.
>
> On Tue, Sep 22, 2015 at 1:26 AM, Philip Weaver <philip.wea...@gmail.com>
> wrote:
> > I am processing a single file and want to remove duplicate rows by some
> key
> > by always choosing the first row in the file for that key.
> >
> > The best solution I could come up with is to zip each row with the
> partition
> > index and local index, like this:
> >
> > rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
> >   rows.zipWithIndex.map { case (row, localIndex) => (row.key,
> > ((partitionIndex, localIndex), row)) }
> > }
> >
> >
> > And then using reduceByKey with a min ordering on the (partitionIndex,
> > localIndex) pair.
> >
> > First, can i count on SparkContext.textFile to read the lines in such
> that
> > the partition indexes are always increasing so that the above works?
> >
> > And, is there a better way to accomplish the same effect?
> >
> > Thanks!
> >
> > - Philip
> >
>

Re: Remove duplicate keys by always choosing first in file.

Reply via email to