Re: Remove duplicate keys by always choosing first in file.

Philip Weaver Tue, 22 Sep 2015 08:43:39 -0700

I have used the mapPartitionsWithIndex/zipWithIndex solution and so far it
has done the correct thing.


On Tue, Sep 22, 2015 at 8:38 AM, Adrian Tanase <atan...@adobe.com> wrote:

> just give zipWithIndex a shot, use it early in the pipeline. I think it
> provides exactly the info you need, as the index is the original line
> number in the file, not the index in the partition.
>
> Sent from my iPhone
>
> On 22 Sep 2015, at 17:50, Philip Weaver <philip.wea...@gmail.com> wrote:
>
> Thanks. If textFile can be used in a way that preserves order, than both
> the partition index and the index within each partition should be
> consistent, right?
>
> I overcomplicated the question by asking about removing duplicates.
> Fundamentally I think my question is, how does one sort lines in a file by
> line number.
>
> On Tue, Sep 22, 2015 at 6:15 AM, Adrian Tanase <atan...@adobe.com> wrote:
>
>> By looking through the docs and source code, I think you can get away
>> with rdd.zipWithIndex to get the index of each line in the file, as long
>> as you define the parallelism upfront:
>> sc.textFile("README.md", 4)
>>
>> You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m
>> skimming through some tuples, hopefully this is clear enough.
>>
>> -adrian
>>
>> From: Philip Weaver
>> Date: Tuesday, September 22, 2015 at 3:26 AM
>> To: user
>> Subject: Remove duplicate keys by always choosing first in file.
>>
>> I am processing a single file and want to remove duplicate rows by some
>> key by always choosing the first row in the file for that key.
>>
>> The best solution I could come up with is to zip each row with the
>> partition index and local index, like this:
>>
>> rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
>>   rows.zipWithIndex.map { case (row, localIndex) => (row.key,
>> ((partitionIndex, localIndex), row)) }
>> }
>>
>>
>> And then using reduceByKey with a min ordering on the (partitionIndex,
>> localIndex) pair.
>>
>> First, can i count on SparkContext.textFile to read the lines in such
>> that the partition indexes are always increasing so that the above works?
>>
>> And, is there a better way to accomplish the same effect?
>>
>> Thanks!
>>
>> - Philip
>>
>>
>

Re: Remove duplicate keys by always choosing first in file.

Reply via email to