Hi Patrick, I¹m a little confused about your comment that RDDs are not ordered. As far as I know, RDDs keep list of partitions that are ordered and this is why I can call RDD.take() and get the same first k rows every time I call it and RDD.take() returns the same entries as RDD.map().take() because map preserves the partition order. RDD order is also what allows me to get the top k out of RDD by doing RDD.sort().take().
Am I misunderstanding it? Or, is it just when RDD is written to disk that the order is not well preserved? Thanks in advance! Mingyu On 1/22/14, 4:46 PM, "Patrick Wendell" <pwend...@gmail.com> wrote: >Ah somehow after all this time I've never seen that! > >On Wed, Jan 22, 2014 at 4:45 PM, Aureliano Buendia <buendia...@gmail.com> >wrote: >> >> >> >> On Thu, Jan 23, 2014 at 12:37 AM, Patrick Wendell <pwend...@gmail.com> >> wrote: >>> >>> What is the ++ operator here? Is this something you defined? >> >> >> No, it's an alias for union defined in RDD.scala: >> >> def ++(other: RDD[T]): RDD[T] = this.union(other) >> >>> >>> >>> Another issue is that RDD's are not ordered, so when you union two >>> together it doesn't have a well defined ordering. >>> >>> If you do want to do this you could coalesce into one partition, then >>> call MapPartitions and return an iterator that first adds your header >>> and then the rest of the file, then call saveAsTextFile. Keep in mind >>> this will only work if you coalesce into a single partition. >> >> >> Thanks! I'll give this a try. >> >>> >>> >>> myRdd.coalesce(1) >>> .map(_.mkString(","))) >>> .mapPartitions(it => (Seq("col1,col2,col3") ++ it).iterator) >>> .saveAsTextFile("out.csv") >>> >>> - Patrick >>> >>> On Wed, Jan 22, 2014 at 11:12 AM, Aureliano Buendia >>> <buendia...@gmail.com> wrote: >>> > Hi, >>> > >>> > I'm trying to find a way to create a csv header when using >>> > saveAsTextFile, >>> > and I came up with this: >>> > >>> > (sc.makeRDD(Array("col1,col2,col3"), 1) ++ >>> > myRdd.coalesce(1).map(_.mkString(","))) >>> > .saveAsTextFile("out.csv") >>> > >>> > But it only saves the header part. Why is that the union method does >>>not >>> > return both RDD's? >> >>
smime.p7s
Description: S/MIME cryptographic signature