I am processing a single file and want to remove duplicate rows by some key
by always choosing the first row in the file for that key.
The best solution I could come up with is to zip each row with the
partition index and local index, like this:
rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
rows.zipWithIndex.map { case (row, localIndex) => (row.key,
((partitionIndex, localIndex), row)) }
}
And then using reduceByKey with a min ordering on the (partitionIndex,
localIndex) pair.
First, can i count on SparkContext.textFile to read the lines in such that
the partition indexes are always increasing so that the above works?
And, is there a better way to accomplish the same effect?
Thanks!
- Philip