By looking through the docs and source code, I think you can get away with rdd.zipWithIndex to get the index of each line in the file, as long as you define the parallelism upfront: sc.textFile("README.md", 4)
You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m skimming through some tuples, hopefully this is clear enough. -adrian From: Philip Weaver Date: Tuesday, September 22, 2015 at 3:26 AM To: user Subject: Remove duplicate keys by always choosing first in file. I am processing a single file and want to remove duplicate rows by some key by always choosing the first row in the file for that key. The best solution I could come up with is to zip each row with the partition index and local index, like this: rdd.mapPartitionsWithIndex { case (partitionIndex, rows) => rows.zipWithIndex.map { case (row, localIndex) => (row.key, ((partitionIndex, localIndex), row)) } } And then using reduceByKey with a min ordering on the (partitionIndex, localIndex) pair. First, can i count on SparkContext.textFile to read the lines in such that the partition indexes are always increasing so that the above works? And, is there a better way to accomplish the same effect? Thanks! - Philip