One approach would be to repartition the whole data into 1 (costly
operation though, but will give you a single file). Also, You could try
using zipWithIndex before writing it out.

Thanks
Best Regards

On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert <
m_albert...@yahoo.com.invalid> wrote:

> Greetings!
>
> I sorted a dataset in Spark and then wrote it out in avro/parquet.
>
> Then I wanted to check that it was sorted.
>
> It looks like each partition has been sorted, but when reading in, the
> first "partition" (i.e., as
> seen in the partition index of mapPartitionsWithIndex) is not the same  as
> implied by
> the names of the parquet files (even when the number of partitions is the
> same in the
> rdd which was read as on disk).
>
> If I "take()" a few hundred values, they are sorted, but they are *not*
> the same as if I
> explicitly open "part-r-00000.parquet" and take values from that.
>
> It seems that when opening the rdd, the "partitions" of the rdd are not in
> the same
> order as implied by the data on disk (i.e., "part-r-00000.parquet,
> part-r-00001.parquet, etc).
>
> So, how might one read the data so that one maintains the sort order?
>
> And while on the subject, after the "terasort", how did they check that
> the
> data was actually sorted correctly? (or did they :-) ? ).
>
> Is there any way to read the data back in so as to preserve the sort, or
> do I need to
> "zipWithIndex" before writing it out, and write the index at that time? (I
> haven't tried the
> latter yet).
>
> Thanks!
> -Mike
>
>

Reply via email to