The zero value here is None. Combining None with any row should yield Some(row). After that, combining is a no-op for other rows.
On Tue, Sep 22, 2015 at 4:27 AM, Philip Weaver <philip.wea...@gmail.com> wrote: > Hmm, I don't think that's what I want. There's no "zero value" in my use > case. > > On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen <so...@cloudera.com> wrote: >> >> I think foldByKey is much more what you want, as it has more a notion >> of building up some result per key by encountering values serially. >> You would take the first and ignore the rest. Note that "first" >> depends on your RDD having an ordering to begin with, or else you rely >> on however it happens to be ordered after whatever operations give you >> a key-value RDD. >> >> On Tue, Sep 22, 2015 at 1:26 AM, Philip Weaver <philip.wea...@gmail.com> >> wrote: >> > I am processing a single file and want to remove duplicate rows by some >> > key >> > by always choosing the first row in the file for that key. >> > >> > The best solution I could come up with is to zip each row with the >> > partition >> > index and local index, like this: >> > >> > rdd.mapPartitionsWithIndex { case (partitionIndex, rows) => >> > rows.zipWithIndex.map { case (row, localIndex) => (row.key, >> > ((partitionIndex, localIndex), row)) } >> > } >> > >> > >> > And then using reduceByKey with a min ordering on the (partitionIndex, >> > localIndex) pair. >> > >> > First, can i count on SparkContext.textFile to read the lines in such >> > that >> > the partition indexes are always increasing so that the above works? >> > >> > And, is there a better way to accomplish the same effect? >> > >> > Thanks! >> > >> > - Philip >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org