Re: Remove duplicate keys by always choosing first in file.

2015-09-24 Thread Philip Weaver
Oops, I didn't catch the suggestion to just use RDD.zipWithIndex, which I forgot existed (and I've discoverd I actually used in another project!). I will use that instead of the mapPartitionsWithIndex/zipWithIndex solution that I posted originally. On Tue, Sep 22, 2015 at 9:07 AM, Philip Weaver w

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
The indices are definitely necessary. My first solution was just reduceByKey { case (v, _) => v } and that didn't work. I needed to look at both values and see which had the lower index. On Tue, Sep 22, 2015 at 8:54 AM, Sean Owen wrote: > The point is that this only works if you already knew the

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
The point is that this only works if you already knew the file was presented in order within and across partitions, which was the original problem anyway. I don't think it is in general, but in practice, I do imagine it's already in the expected order from textFile. Maybe under the hood this ends u

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
I have used the mapPartitionsWithIndex/zipWithIndex solution and so far it has done the correct thing. On Tue, Sep 22, 2015 at 8:38 AM, Adrian Tanase wrote: > just give zipWithIndex a shot, use it early in the pipeline. I think it > provides exactly the info you need, as the index is the origina

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Adrian Tanase
just give zipWithIndex a shot, use it early in the pipeline. I think it provides exactly the info you need, as the index is the original line number in the file, not the index in the partition. Sent from my iPhone On 22 Sep 2015, at 17:50, Philip Weaver mailto:philip.wea...@gmail.com>> wrote:

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
I don't know of a way to do this, out of the box, without maybe digging into custom InputFormats. The RDD from textFile doesn't have an ordering. I can't imagine a world in which partitions weren't iterated in line order, of course, but there's also no real guarantee about ordering among partitions

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
Thanks. If textFile can be used in a way that preserves order, than both the partition index and the index within each partition should be consistent, right? I overcomplicated the question by asking about removing duplicates. Fundamentally I think my question is, how does one sort lines in a file

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Adrian Tanase
By looking through the docs and source code, I think you can get away with rdd.zipWithIndex to get the index of each line in the file, as long as you define the parallelism upfront: sc.textFile("README.md", 4) You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m skimming through s

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
Yes, that's right, though "in order" depends on the RDD having an ordering, but so does the zip-based solution. Actually, I'm going to walk that back a bit, since I don't see a guarantee that foldByKey behaves like foldLeft. The implementation underneath, in combineByKey, appears that it will act

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
Hmm, ok, but I'm not seeing why foldByKey is more appropriate than reduceByKey? Specifically, is foldByKey guaranteed to walk the RDD in order, but reduceByKey is not? On Mon, Sep 21, 2015 at 8:41 PM, Sean Owen wrote: > The zero value here is None. Combining None with any row should yield > Some

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
The zero value here is None. Combining None with any row should yield Some(row). After that, combining is a no-op for other rows. On Tue, Sep 22, 2015 at 4:27 AM, Philip Weaver wrote: > Hmm, I don't think that's what I want. There's no "zero value" in my use > case. > > On Mon, Sep 21, 2015 at 8:

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
Hmm, I don't think that's what I want. There's no "zero value" in my use case. On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen wrote: > I think foldByKey is much more what you want, as it has more a notion > of building up some result per key by encountering values serially. > You would take the firs

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
I think foldByKey is much more what you want, as it has more a notion of building up some result per key by encountering values serially. You would take the first and ignore the rest. Note that "first" depends on your RDD having an ordering to begin with, or else you rely on however it happens to b