Oops, I didn't catch the suggestion to just use RDD.zipWithIndex, which I
forgot existed (and I've discoverd I actually used in another project!). I
will use that instead of the mapPartitionsWithIndex/zipWithIndex solution
that I posted originally.
On Tue, Sep 22, 2015 at 9:07 AM, Philip Weaver
w
The indices are definitely necessary. My first solution was just
reduceByKey { case (v, _) => v } and that didn't work. I needed to look at
both values and see which had the lower index.
On Tue, Sep 22, 2015 at 8:54 AM, Sean Owen wrote:
> The point is that this only works if you already knew the
The point is that this only works if you already knew the file was
presented in order within and across partitions, which was the
original problem anyway. I don't think it is in general, but in
practice, I do imagine it's already in the expected order from
textFile. Maybe under the hood this ends u
I have used the mapPartitionsWithIndex/zipWithIndex solution and so far it
has done the correct thing.
On Tue, Sep 22, 2015 at 8:38 AM, Adrian Tanase wrote:
> just give zipWithIndex a shot, use it early in the pipeline. I think it
> provides exactly the info you need, as the index is the origina
just give zipWithIndex a shot, use it early in the pipeline. I think it
provides exactly the info you need, as the index is the original line number in
the file, not the index in the partition.
Sent from my iPhone
On 22 Sep 2015, at 17:50, Philip Weaver
mailto:philip.wea...@gmail.com>> wrote:
I don't know of a way to do this, out of the box, without maybe
digging into custom InputFormats. The RDD from textFile doesn't have
an ordering. I can't imagine a world in which partitions weren't
iterated in line order, of course, but there's also no real guarantee
about ordering among partitions
Thanks. If textFile can be used in a way that preserves order, than both
the partition index and the index within each partition should be
consistent, right?
I overcomplicated the question by asking about removing duplicates.
Fundamentally I think my question is, how does one sort lines in a file
By looking through the docs and source code, I think you can get away with
rdd.zipWithIndex to get the index of each line in the file, as long as you
define the parallelism upfront:
sc.textFile("README.md", 4)
You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m skimming
through s
Yes, that's right, though "in order" depends on the RDD having an
ordering, but so does the zip-based solution.
Actually, I'm going to walk that back a bit, since I don't see a
guarantee that foldByKey behaves like foldLeft. The implementation
underneath, in combineByKey, appears that it will act
Hmm, ok, but I'm not seeing why foldByKey is more appropriate than
reduceByKey? Specifically, is foldByKey guaranteed to walk the RDD in
order, but reduceByKey is not?
On Mon, Sep 21, 2015 at 8:41 PM, Sean Owen wrote:
> The zero value here is None. Combining None with any row should yield
> Some
The zero value here is None. Combining None with any row should yield
Some(row). After that, combining is a no-op for other rows.
On Tue, Sep 22, 2015 at 4:27 AM, Philip Weaver wrote:
> Hmm, I don't think that's what I want. There's no "zero value" in my use
> case.
>
> On Mon, Sep 21, 2015 at 8:
Hmm, I don't think that's what I want. There's no "zero value" in my use
case.
On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen wrote:
> I think foldByKey is much more what you want, as it has more a notion
> of building up some result per key by encountering values serially.
> You would take the firs
I think foldByKey is much more what you want, as it has more a notion
of building up some result per key by encountering values serially.
You would take the first and ignore the rest. Note that "first"
depends on your RDD having an ordering to begin with, or else you rely
on however it happens to b
13 matches
Mail list logo