Remove duplicate keys by always choosing first in file.

Philip Weaver Mon, 21 Sep 2015 17:26:56 -0700

I am processing a single file and want to remove duplicate rows by some key
by always choosing the first row in the file for that key.


The best solution I could come up with is to zip each row with the
partition index and local index, like this:

rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
  rows.zipWithIndex.map { case (row, localIndex) => (row.key,
((partitionIndex, localIndex), row)) }
}


And then using reduceByKey with a min ordering on the (partitionIndex,
localIndex) pair.

First, can i count on SparkContext.textFile to read the lines in such that
the partition indexes are always increasing so that the above works?

And, is there a better way to accomplish the same effect?

Thanks!

- Philip

Remove duplicate keys by always choosing first in file.

Reply via email to