Re: Remove duplicate keys by always choosing first in file.

Philip Weaver Tue, 22 Sep 2015 09:49:57 -0700

The indices are definitely necessary. My first solution was just
reduceByKey { case (v, _) => v } and that didn't work. I needed to look at
both values and see which had the lower index.


On Tue, Sep 22, 2015 at 8:54 AM, Sean Owen <[email protected]> wrote:

> The point is that this only works if you already knew the file was
> presented in order within and across partitions, which was the
> original problem anyway. I don't think it is in general, but in
> practice, I do imagine it's already in the expected order from
> textFile. Maybe under the hood this ends up being ensured by
> TextInputFormat.
>
> So, adding the index and sorting on it doesn't add anything.
>
> On Tue, Sep 22, 2015 at 4:38 PM, Adrian Tanase <[email protected]> wrote:
> > just give zipWithIndex a shot, use it early in the pipeline. I think it
> > provides exactly the info you need, as the index is the original line
> number
> > in the file, not the index in the partition.
> >
> > Sent from my iPhone
> >
> > On 22 Sep 2015, at 17:50, Philip Weaver <[email protected]> wrote:
> >
> > Thanks. If textFile can be used in a way that preserves order, than both
> the
> > partition index and the index within each partition should be consistent,
> > right?
> >
> > I overcomplicated the question by asking about removing duplicates.
> > Fundamentally I think my question is, how does one sort lines in a file
> by
> > line number.
> >
> > On Tue, Sep 22, 2015 at 6:15 AM, Adrian Tanase <[email protected]>
> wrote:
> >>
> >> By looking through the docs and source code, I think you can get away
> with
> >> rdd.zipWithIndex to get the index of each line in the file, as long as
> you
> >> define the parallelism upfront:
> >> sc.textFile("README.md", 4)
> >>
> >> You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m
> >> skimming through some tuples, hopefully this is clear enough.
> >>
> >> -adrian
> >>
> >> From: Philip Weaver
> >> Date: Tuesday, September 22, 2015 at 3:26 AM
> >> To: user
> >> Subject: Remove duplicate keys by always choosing first in file.
> >>
> >> I am processing a single file and want to remove duplicate rows by some
> >> key by always choosing the first row in the file for that key.
> >>
> >> The best solution I could come up with is to zip each row with the
> >> partition index and local index, like this:
> >>
> >> rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
> >>   rows.zipWithIndex.map { case (row, localIndex) => (row.key,
> >> ((partitionIndex, localIndex), row)) }
> >> }
> >>
> >>
> >> And then using reduceByKey with a min ordering on the (partitionIndex,
> >> localIndex) pair.
> >>
> >> First, can i count on SparkContext.textFile to read the lines in such
> that
> >> the partition indexes are always increasing so that the above works?
> >>
> >> And, is there a better way to accomplish the same effect?
> >>
> >> Thanks!
> >>
> >> - Philip
> >>
> >
>

Re: Remove duplicate keys by always choosing first in file.

Reply via email to