One approach would be to repartition the whole data into 1 (costly operation though, but will give you a single file). Also, You could try using zipWithIndex before writing it out.
Thanks Best Regards On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert < m_albert...@yahoo.com.invalid> wrote: > Greetings! > > I sorted a dataset in Spark and then wrote it out in avro/parquet. > > Then I wanted to check that it was sorted. > > It looks like each partition has been sorted, but when reading in, the > first "partition" (i.e., as > seen in the partition index of mapPartitionsWithIndex) is not the same as > implied by > the names of the parquet files (even when the number of partitions is the > same in the > rdd which was read as on disk). > > If I "take()" a few hundred values, they are sorted, but they are *not* > the same as if I > explicitly open "part-r-00000.parquet" and take values from that. > > It seems that when opening the rdd, the "partitions" of the rdd are not in > the same > order as implied by the data on disk (i.e., "part-r-00000.parquet, > part-r-00001.parquet, etc). > > So, how might one read the data so that one maintains the sort order? > > And while on the subject, after the "terasort", how did they check that > the > data was actually sorted correctly? (or did they :-) ? ). > > Is there any way to read the data back in so as to preserve the sort, or > do I need to > "zipWithIndex" before writing it out, and write the index at that time? (I > haven't tried the > latter yet). > > Thanks! > -Mike > >