Hmm, I don't think that's what I want. There's no "zero value" in my use case.
On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen <so...@cloudera.com> wrote: > I think foldByKey is much more what you want, as it has more a notion > of building up some result per key by encountering values serially. > You would take the first and ignore the rest. Note that "first" > depends on your RDD having an ordering to begin with, or else you rely > on however it happens to be ordered after whatever operations give you > a key-value RDD. > > On Tue, Sep 22, 2015 at 1:26 AM, Philip Weaver <philip.wea...@gmail.com> > wrote: > > I am processing a single file and want to remove duplicate rows by some > key > > by always choosing the first row in the file for that key. > > > > The best solution I could come up with is to zip each row with the > partition > > index and local index, like this: > > > > rdd.mapPartitionsWithIndex { case (partitionIndex, rows) => > > rows.zipWithIndex.map { case (row, localIndex) => (row.key, > > ((partitionIndex, localIndex), row)) } > > } > > > > > > And then using reduceByKey with a min ordering on the (partitionIndex, > > localIndex) pair. > > > > First, can i count on SparkContext.textFile to read the lines in such > that > > the partition indexes are always increasing so that the above works? > > > > And, is there a better way to accomplish the same effect? > > > > Thanks! > > > > - Philip > > >