Re: Remove duplicate keys by always choosing first in file.

Philip Weaver Tue, 22 Sep 2015 07:51:29 -0700

Thanks. If textFile can be used in a way that preserves order, than both
the partition index and the index within each partition should be
consistent, right?


I overcomplicated the question by asking about removing duplicates.
Fundamentally I think my question is, how does one sort lines in a file by
line number.

On Tue, Sep 22, 2015 at 6:15 AM, Adrian Tanase <atan...@adobe.com> wrote:

> By looking through the docs and source code, I think you can get away with
> rdd.zipWithIndex to get the index of each line in the file, as long as
> you define the parallelism upfront:
> sc.textFile("README.md", 4)
>
> You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m
> skimming through some tuples, hopefully this is clear enough.
>
> -adrian
>
> From: Philip Weaver
> Date: Tuesday, September 22, 2015 at 3:26 AM
> To: user
> Subject: Remove duplicate keys by always choosing first in file.
>
> I am processing a single file and want to remove duplicate rows by some
> key by always choosing the first row in the file for that key.
>
> The best solution I could come up with is to zip each row with the
> partition index and local index, like this:
>
> rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
>   rows.zipWithIndex.map { case (row, localIndex) => (row.key,
> ((partitionIndex, localIndex), row)) }
> }
>
>
> And then using reduceByKey with a min ordering on the (partitionIndex,
> localIndex) pair.
>
> First, can i count on SparkContext.textFile to read the lines in such that
> the partition indexes are always increasing so that the above works?
>
> And, is there a better way to accomplish the same effect?
>
> Thanks!
>
> - Philip
>
>

Re: Remove duplicate keys by always choosing first in file.

Reply via email to