subject:"Union of 2 RDD's only returns the first one"

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim

-To: "user@spark.apache.org" Date: Wednesday, April 30, 2014 at 11:36 AM To: "user@spark.apache.org" Subject: Re: Union of 2 RDD's only returns the first one Which is what you shouldn't be doing as an API user, since that implementation code might change. The d

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mark Hamstra

Which is what you shouldn't be doing as an API user, since that implementation code might change. The documentation doesn't mention a row ordering guarantee, so none should be assumed. It is hard enough for us to correctly document all of the things that the API does do. We really shouldn't be f

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim

Okay, that makes sense. It’d be great if this can be better documented at some point, because the only way to find out about the resulting RDD row order is by looking at the code. Thanks for the discussion! Mingyu On 4/29/14, 11:59 PM, "Patrick Wendell" wrote: >I don't think we guarantee an

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Patrick Wendell

I don't think we guarantee anywhere that union(A, B) will behave by concatenating the partitions, it just happens to be an artifact of the current implementation. rdd1 = [1,2,3] rdd2 = [1,4,5] rdd1.union(rdd2) = [1,2,3,1,4,5] // how it is now rdd1.union(rdd2) = [1,4,5,1,2,3] // some day it could

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim

Yes, that’s what I meant. Sure, the numbers might not be actually sorted, but the order of rows semantically are kept throughout non-shuffling transforms. I’m on board with you on union as well. Back to the original question, then, why is it important to coalesce to a single partition? When you un

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell

If you call map() on an RDD it will retain the ordering it had before, but that is not necessarily a correct sort order for the new RDD. var rdd = sc.parallelize([2, 1, 3]); var sorted = rdd.map(x => (x, x)).sort(); // should be [1, 2, 3] var mapped = sorted.mapValues(x => 3 - x); // should be [2,

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim

Thanks for the quick response! To better understand it, the reason sorted RDD has a well-defined ordering is because sortedRDD.getPartitions() returns the partitions in the right order and each partition internally is properly sorted. So, if you have var rdd = sc.parallelize([2, 1, 3]); var sorte

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell

You are right, once you sort() the RDD, then yes it has a well defined ordering. But that ordering is lost as soon as you transform the RDD, including if you union it with another RDD. On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim wrote: > Hi Patrick, > > I¹m a little confused about your comment

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim

Hi Patrick, I¹m a little confused about your comment that RDDs are not ordered. As far as I know, RDDs keep list of partitions that are ordered and this is why I can call RDD.take() and get the same first k rows every time I call it and RDD.take() returns the same entries as RDD.map().take() beca

Re: Union of 2 RDD's only returns the first one

Re: Union of 2 RDD's only returns the first one

Re: Union of 2 RDD's only returns the first one

Re: Union of 2 RDD's only returns the first one

Re: Union of 2 RDD's only returns the first one

Re: Union of 2 RDD's only returns the first one

Re: Union of 2 RDD's only returns the first one

Re: Union of 2 RDD's only returns the first one

Re: Union of 2 RDD's only returns the first one

9 matches

Site Navigation

Mail list logo

Footer information