Thanks for your quick reply. Rank is a column which has integer data. I am writing to dynamoDB database tho. Not sure why only a single reducer is used tho. I will check sql with explain command again and will report my findings. I will check your implementation too.
------------------------------ Binesh Gummadi On Sun, Sep 2, 2012 at 4:01 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > > Sort by does not have the single reduce restriction. Not sure which rank > you are using but any one should allow you to sort and rank if the query is > written correctly. Our implementation on my github.com/edwardcaprioloallows > this. > > On Sunday, September 2, 2012, Binesh Gummadi <binesh.gumm...@gmail.com> > wrote: > > I am trying to insert data into a table after selecting and sorting by a > column. What I really want is order by a column and select the top million > rows. I am using Amazon EMR hive cloud to process data. > > Here is my query > > INSERT INTO TABLE ddb_table SELECT * FROM data_dump sort by rank desc > LIMIT 1000000; > > It creates two jobs. First job run rather quickly and second job reducer > is running forever as it is running with a single reducer. Here is my > question on stackoverflow( > http://stackoverflow.com/questions/12233343/why-is-sort-by-always-using-single-reducer > ). > > According to docs "order by" clause has a limitation of 1 reducer. Does > sort by has same limitation? Are there any other ways of solving the above > requirement? > > ________________________________ > > Binesh Gummadi > > > > >