Thanks. If textFile can be used in a way that preserves order, than both the partition index and the index within each partition should be consistent, right?
I overcomplicated the question by asking about removing duplicates. Fundamentally I think my question is, how does one sort lines in a file by line number. On Tue, Sep 22, 2015 at 6:15 AM, Adrian Tanase <atan...@adobe.com> wrote: > By looking through the docs and source code, I think you can get away with > rdd.zipWithIndex to get the index of each line in the file, as long as > you define the parallelism upfront: > sc.textFile("README.md", 4) > > You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m > skimming through some tuples, hopefully this is clear enough. > > -adrian > > From: Philip Weaver > Date: Tuesday, September 22, 2015 at 3:26 AM > To: user > Subject: Remove duplicate keys by always choosing first in file. > > I am processing a single file and want to remove duplicate rows by some > key by always choosing the first row in the file for that key. > > The best solution I could come up with is to zip each row with the > partition index and local index, like this: > > rdd.mapPartitionsWithIndex { case (partitionIndex, rows) => > rows.zipWithIndex.map { case (row, localIndex) => (row.key, > ((partitionIndex, localIndex), row)) } > } > > > And then using reduceByKey with a min ordering on the (partitionIndex, > localIndex) pair. > > First, can i count on SparkContext.textFile to read the lines in such that > the partition indexes are always increasing so that the above works? > > And, is there a better way to accomplish the same effect? > > Thanks! > > - Philip > >