Re: Remove duplicate keys by always choosing first in file.

Sean Owen Mon, 21 Sep 2015 20:21:38 -0700

I think foldByKey is much more what you want, as it has more a notion
of building up some result per key by encountering values serially.
You would take the first and ignore the rest. Note that "first"
depends on your RDD having an ordering to begin with, or else you rely
on however it happens to be ordered after whatever operations give you
a key-value RDD.


On Tue, Sep 22, 2015 at 1:26 AM, Philip Weaver <philip.wea...@gmail.com> wrote:
> I am processing a single file and want to remove duplicate rows by some key
> by always choosing the first row in the file for that key.
>
> The best solution I could come up with is to zip each row with the partition
> index and local index, like this:
>
> rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
>   rows.zipWithIndex.map { case (row, localIndex) => (row.key,
> ((partitionIndex, localIndex), row)) }
> }
>
>
> And then using reduceByKey with a min ordering on the (partitionIndex,
> localIndex) pair.
>
> First, can i count on SparkContext.textFile to read the lines in such that
> the partition indexes are always increasing so that the above works?
>
> And, is there a better way to accomplish the same effect?
>
> Thanks!
>
> - Philip
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Remove duplicate keys by always choosing first in file.

Reply via email to