By looking through the docs and source code, I think you can get away with
rdd.zipWithIndex to get the index of each line in the file, as long as you
define the parallelism upfront:
sc.textFile("README.md", 4)
You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m skimming
through some tuples, hopefully this is clear enough.
-adrian
From: Philip Weaver
Date: Tuesday, September 22, 2015 at 3:26 AM
To: user
Subject: Remove duplicate keys by always choosing first in file.
I am processing a single file and want to remove duplicate rows by some key by
always choosing the first row in the file for that key.
The best solution I could come up with is to zip each row with the partition
index and local index, like this:
rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
rows.zipWithIndex.map { case (row, localIndex) => (row.key, ((partitionIndex,
localIndex), row)) }
}
And then using reduceByKey with a min ordering on the (partitionIndex,
localIndex) pair.
First, can i count on SparkContext.textFile to read the lines in such that the
partition indexes are always increasing so that the above works?
And, is there a better way to accomplish the same effect?
Thanks!
- Philip