Rohini wrote: > But the current distributed order by uses range > partitioning and same keys can go to different reducers.
I am puzzled by this statement. I don't understand how partitioning would work if the same keys go to different reducers. (I suppose I could understand if it were 'two' different reducers, because a specific key may be the max of one range and the min of the next. But even in that case I can't understand why it would be desirable to be make both ends of the ranges 'inclusive'. I think I'm heading down the wrong path ... ?) To help educate us newbies, please provide a more detailed explanation ... Q: Is there a URL which could explain "range partitioning ... same keys go to different reducers" ? and/or Q: Can you provide a brief explanation of "range partitioning ... same keys go to different reducers" ? Michael On Mon, Jun 8, 2015 at 4:08 PM, Rohini Palaniswamy <[email protected]> wrote: > If order by and distinct have the same key, it is possible to combine them > into one mapreduce job. But the current distributed order by uses range > partitioning and same keys can go to different reducers. Tagging along > distinct to that will require more work and not something we are planning > to do sometime soon. > > On Wed, Jun 3, 2015 at 11:14 AM, Mehmet Tepedelenlioglu < > [email protected]> wrote: > > > Order and distinct are 2 very different operations. You order by > > something, but you take the distinct over all the fields of a relation, > > which is to say that the key/value structure is quite different for the > > general case. > > > > > > > On Jun 3, 2015, at 11:02 AM, <[email protected]> < > > [email protected]> wrote: > > > > > > Dear Pig users, > > > Can Pig combine sorting and unique-ing into a single job? Doing this > > > --define Components, then > > > Sorted_0 = order Components by block_id parallel $par; > > > Sorted = DISTINCT Sorted_0; > > > > > > causes one more MR job to be launched than simply doing this: > > > --define Components, then > > > Sorted = order Components by block_id parallel $par; > > > > > > It would seem there should be some way to do the distinct in the same > > pass as the sort, like 'sort -u'. But I can't see how. Any tips would be > > much appreciated! > > > > > > Thanks, > > > Will > > > > > > William F Dowling > > > Senior Technologist > > > Thomson Reuters > > > > > > > >
