Re: Union of 2 RDD's only returns the first one

Mingyu Kim Tue, 29 Apr 2014 22:23:32 -0700

Hi Patrick,

I¹m a little confused about your comment that RDDs are not ordered. As far
as I know, RDDs keep list of partitions that are ordered and this is why I
can call RDD.take() and get the same first k rows every time I call it and
RDD.take() returns the same entries as RDD.map().take() because map
preserves the partition order. RDD order is also what allows me to get the
top k out of RDD by doing RDD.sort().take().


Am I misunderstanding it? Or, is it just when RDD is written to disk that
the order is not well preserved? Thanks in advance!

Mingyu




On 1/22/14, 4:46 PM, "Patrick Wendell" <pwend...@gmail.com> wrote:

>Ah somehow after all this time I've never seen that!
>
>On Wed, Jan 22, 2014 at 4:45 PM, Aureliano Buendia <buendia...@gmail.com>
>wrote:
>>
>>
>>
>> On Thu, Jan 23, 2014 at 12:37 AM, Patrick Wendell <pwend...@gmail.com>
>> wrote:
>>>
>>> What is the ++ operator here? Is this something you defined?
>>
>>
>> No, it's an alias for union defined in RDD.scala:
>>
>> def ++(other: RDD[T]): RDD[T] = this.union(other)
>>
>>>
>>>
>>> Another issue is that RDD's are not ordered, so when you union two
>>> together it doesn't have a well defined ordering.
>>>
>>> If you do want to do this you could coalesce into one partition, then
>>> call MapPartitions and return an iterator that first adds your header
>>> and then the rest of the file, then call saveAsTextFile. Keep in mind
>>> this will only work if you coalesce into a single partition.
>>
>>
>> Thanks! I'll give this a try.
>>
>>>
>>>
>>> myRdd.coalesce(1)
>>> .map(_.mkString(",")))
>>> .mapPartitions(it => (Seq("col1,col2,col3") ++ it).iterator)
>>> .saveAsTextFile("out.csv")
>>>
>>> - Patrick
>>>
>>> On Wed, Jan 22, 2014 at 11:12 AM, Aureliano Buendia
>>> <buendia...@gmail.com> wrote:
>>> > Hi,
>>> >
>>> > I'm trying to find a way to create a csv header when using
>>> > saveAsTextFile,
>>> > and I came up with this:
>>> >
>>> > (sc.makeRDD(Array("col1,col2,col3"), 1) ++
>>> > myRdd.coalesce(1).map(_.mkString(",")))
>>> >       .saveAsTextFile("out.csv")
>>> >
>>> > But it only saves the header part. Why is that the union method does
>>>not
>>> > return both RDD's?
>>
>>

smime.p7s
Description: S/MIME cryptographic signature

Re: Union of 2 RDD's only returns the first one

Reply via email to