Rohini wrote: > Sorry for the confusion. Thanks for prompt clarification!
Michael On Mon, Jun 8, 2015 at 5:04 PM, Rohini Palaniswamy <[email protected]> wrote: > > Actually ignore what I said as DISTINCT is always on the alias and not on a > key and what I said will apply only if there is only one field in the > dataset. Sorry for the confusion. As Mehmet pointed out, your ORDER BY and > DISTINCT are on different keys. Ordering is done on block_id, but distinct > is done for the whole record. It is not possible to combine them into one > mapreduce job as keys are different. > > Also in your example, you are trying to do DISTINCT after ORDER BY. Your > results will not be sorted. Please put the DISTINCT before the ORDER BY. > > Regards, > Rohini > > On Mon, Jun 8, 2015 at 1:29 PM, Michael Howard <[email protected]> > wrote: > > > Rohini wrote: > > > But the current distributed order by uses range > > > partitioning and same keys can go to different reducers. > > > > I am puzzled by this statement. > > I don't understand how partitioning would work if the same keys go to > > different reducers. > > > > (I suppose I could understand if it were 'two' different reducers, because > > a specific key may be the max of one range and the min of the next. But > > even in that case I can't understand why it would be desirable to be make > > both ends of the ranges 'inclusive'. I think I'm heading down the wrong > > path ... ?) > > > > To help educate us newbies, please provide a more detailed explanation ... > > > > Q: Is there a URL which could explain "range partitioning ... same keys go > > to different reducers" ? > > > > and/or > > > > Q: Can you provide a brief explanation of "range partitioning ... same keys > > go to different reducers" ? > > > > > > Michael > > > > > > On Mon, Jun 8, 2015 at 4:08 PM, Rohini Palaniswamy < > > [email protected]> > > wrote: > > > > > If order by and distinct have the same key, it is possible to combine > > them > > > into one mapreduce job. But the current distributed order by uses range > > > partitioning and same keys can go to different reducers. Tagging along > > > distinct to that will require more work and not something we are planning > > > to do sometime soon. > > > > > > On Wed, Jun 3, 2015 at 11:14 AM, Mehmet Tepedelenlioglu < > > > [email protected]> wrote: > > > > > > > Order and distinct are 2 very different operations. You order by > > > > something, but you take the distinct over all the fields of a relation, > > > > which is to say that the key/value structure is quite different for the > > > > general case. > > > > > > > > > > > > > On Jun 3, 2015, at 11:02 AM, <[email protected]> < > > > > [email protected]> wrote: > > > > > > > > > > Dear Pig users, > > > > > Can Pig combine sorting and unique-ing into a single job? Doing this > > > > > --define Components, then > > > > > Sorted_0 = order Components by block_id parallel $par; > > > > > Sorted = DISTINCT Sorted_0; > > > > > > > > > > causes one more MR job to be launched than simply doing this: > > > > > --define Components, then > > > > > Sorted = order Components by block_id parallel $par; > > > > > > > > > > It would seem there should be some way to do the distinct in the same > > > > pass as the sort, like 'sort -u'. But I can't see how. Any tips would > > be > > > > much appreciated! > > > > > > > > > > Thanks, > > > > > Will > > > > > > > > > > William F Dowling > > > > > Senior Technologist > > > > > Thomson Reuters > > > > > > > > > > > > > > > > > >
