By looking through the docs and source code, I think you can get away with 
rdd.zipWithIndex to get the index of each line in the file, as long as you 
define the parallelism upfront:
sc.textFile("README.md", 4)

You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m skimming 
through some tuples, hopefully this is clear enough.

-adrian

From: Philip Weaver
Date: Tuesday, September 22, 2015 at 3:26 AM
To: user
Subject: Remove duplicate keys by always choosing first in file.

I am processing a single file and want to remove duplicate rows by some key by 
always choosing the first row in the file for that key.

The best solution I could come up with is to zip each row with the partition 
index and local index, like this:

rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
  rows.zipWithIndex.map { case (row, localIndex) => (row.key, ((partitionIndex, 
localIndex), row)) }
}

And then using reduceByKey with a min ordering on the (partitionIndex, 
localIndex) pair.

First, can i count on SparkContext.textFile to read the lines in such that the 
partition indexes are always increasing so that the above works?

And, is there a better way to accomplish the same effect?

Thanks!

- Philip

Reply via email to