Thanks for the information! (to all who responded)
The code below *seems* to work.Any hidden gotcha's that anyone sees?
And still, in "terasort", how did they check that the data was actually sorted?
:-)
-Mike
class MyInputFormat[T] extends parquet.hadoop.ParquetInputFormat[T]{
override def getSplits(jobContext: org.apache.hadoop.mapreduce.JobContext)
:java.util.List[org.apache.hadoop.mapreduce.InputSplit] = { val
splits = super.getSplits(jobContext) import
scala.collection.JavaConversions._ splits.sortBy{ split => split match {
case fileSplit
:org.apache.hadoop.mapreduce.lib.input.FileSplit
=> (fileSplit.getPath.getName,
fileSplit.getStart) case _ => ("",-1L) } } }}
From: Sean Owen <[email protected]>
To: Michael Albert <[email protected]>
Cc: User <[email protected]>
Sent: Monday, March 23, 2015 7:31 AM
Subject: Re: How to check that a dataset is sorted after it has been written
out?
Data is not (necessarily) sorted when read from disk, no. A file might
have many blocks even, and while a block yields a partition in
general, the order in which those partitions appear in the RDD is not
defined. This is why you'd sort if you need the data sorted.
I think you could conceivably make some custom RDD or InputFormat that
reads blocks in a well-defined order and, assuming the data is sorted
in some knowable way on disk, then must have them sorted. I think
that's even been brought up.
Deciding whether the data is sorted is quite different. You'd have to
decide what ordering you expect (is part 0 before part 1? should it be
sorted in a part file?) and then just verify that externally.
On Fri, Mar 20, 2015 at 10:41 PM, Michael Albert
<[email protected]> wrote:
> Greetings!
>
> I sorted a dataset in Spark and then wrote it out in avro/parquet.
>
> Then I wanted to check that it was sorted.
>
> It looks like each partition has been sorted, but when reading in, the first
> "partition" (i.e., as
> seen in the partition index of mapPartitionsWithIndex) is not the same as
> implied by
> the names of the parquet files (even when the number of partitions is the
> same in the
> rdd which was read as on disk).
>
> If I "take()" a few hundred values, they are sorted, but they are *not* the
> same as if I
> explicitly open "part-r-00000.parquet" and take values from that.
>
> It seems that when opening the rdd, the "partitions" of the rdd are not in
> the same
> order as implied by the data on disk (i.e., "part-r-00000.parquet,
> part-r-00001.parquet, etc).
>
> So, how might one read the data so that one maintains the sort order?
>
> And while on the subject, after the "terasort", how did they check that the
> data was actually sorted correctly? (or did they :-) ? ).
>
> Is there any way to read the data back in so as to preserve the sort, or do
> I need to
> "zipWithIndex" before writing it out, and write the index at that time? (I
> haven't tried the
> latter yet).
>
> Thanks!
> -Mike
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]