Re: How to check that a dataset is sorted after it has been written out?

Michael Albert Mon, 23 Mar 2015 11:32:39 -0700

Thanks for the information! (to all who responded)
The code below *seems* to work.Any hidden gotcha's that anyone sees?
And still, in "terasort", how did they check that the data was actually sorted? 
:-)
-Mike
class MyInputFormat[T]    extends parquet.hadoop.ParquetInputFormat[T]{    
override def getSplits(jobContext: org.apache.hadoop.mapreduce.JobContext)      
  :java.util.List[org.apache.hadoop.mapreduce.InputSplit] =    {        val 
splits = super.getSplits(jobContext)           import 
scala.collection.JavaConversions._        splits.sortBy{ split => split match { 
                        case fileSplit                            
:org.apache.hadoop.mapreduce.lib.input.FileSplit                                
        => (fileSplit.getPath.getName,                                          
   fileSplit.getStart)                         case _ => ("",-1L) } }    }}


      From: Sean Owen <[email protected]>
 To: Michael Albert <[email protected]> 
Cc: User <[email protected]> 
 Sent: Monday, March 23, 2015 7:31 AM
 Subject: Re: How to check that a dataset is sorted after it has been written 
out?
   
Data is not (necessarily) sorted when read from disk, no. A file might
have many blocks even, and while a block yields a partition in
general, the order in which those partitions appear in the RDD is not
defined. This is why you'd sort if you need the data sorted.

I think you could conceivably make some custom RDD or InputFormat that
reads blocks in a well-defined order and, assuming the data is sorted
in some knowable way on disk, then must have them sorted. I think
that's even been brought up.

Deciding whether the data is sorted is quite different. You'd have to
decide what ordering you expect (is part 0 before part 1? should it be
sorted in a part file?) and then just verify that externally.



On Fri, Mar 20, 2015 at 10:41 PM, Michael Albert
<[email protected]> wrote:
> Greetings!
>
> I sorted a dataset in Spark and then wrote it out in avro/parquet.
>
> Then I wanted to check that it was sorted.
>
> It looks like each partition has been sorted, but when reading in, the first
> "partition" (i.e., as
> seen in the partition index of mapPartitionsWithIndex) is not the same  as
> implied by
> the names of the parquet files (even when the number of partitions is the
> same in the
> rdd which was read as on disk).
>
> If I "take()" a few hundred values, they are sorted, but they are *not* the
> same as if I
> explicitly open "part-r-00000.parquet" and take values from that.
>
> It seems that when opening the rdd, the "partitions" of the rdd are not in
> the same
> order as implied by the data on disk (i.e., "part-r-00000.parquet,
> part-r-00001.parquet, etc).
>
> So, how might one read the data so that one maintains the sort order?
>
> And while on the subject, after the "terasort", how did they check that the
> data was actually sorted correctly? (or did they :-) ? ).
>
> Is there any way to read the data back in so as to preserve the sort, or do
> I need to
> "zipWithIndex" before writing it out, and write the index at that time? (I
> haven't tried the
> latter yet).
>
> Thanks!
> -Mike
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: How to check that a dataset is sorted after it has been written out?

Reply via email to