Hi https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy#LanguageManualSortBy-DifferencebetweenSortByandOrderBy Sort By will give you only partially sorted results if you have more than one reducer
Ruslan On Mon, Sep 3, 2012 at 1:38 AM, Binesh Gummadi <binesh.gumm...@gmail.com> wrote: > Thanks for your quick reply. Rank is a column which has integer data. I am > writing to dynamoDB database tho. Not sure why only a single reducer is used > tho. I will check sql with explain command again and will report my > findings. I will check your implementation too. > > ________________________________ > Binesh Gummadi > > > > > On Sun, Sep 2, 2012 at 4:01 PM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: >> >> >> Sort by does not have the single reduce restriction. Not sure which rank >> you are using but any one should allow you to sort and rank if the query is >> written correctly. Our implementation on my github.com/edwardcapriolo allows >> this. >> >> On Sunday, September 2, 2012, Binesh Gummadi <binesh.gumm...@gmail.com> >> wrote: >> > I am trying to insert data into a table after selecting and sorting by a >> > column. What I really want is order by a column and select the top million >> > rows. I am using Amazon EMR hive cloud to process data. >> > Here is my query >> > INSERT INTO TABLE ddb_table SELECT * FROM data_dump sort by rank desc >> > LIMIT 1000000; >> > It creates two jobs. First job run rather quickly and second job reducer >> > is running forever as it is running with a single reducer. Here is my >> > question on >> > stackoverflow(http://stackoverflow.com/questions/12233343/why-is-sort-by-always-using-single-reducer). >> > According to docs "order by" clause has a limitation of 1 reducer. Does >> > sort by has same limitation? Are there any other ways of solving the above >> > requirement? >> > ________________________________ >> > Binesh Gummadi >> > >> > > > -- Best Regards, Ruslan Al-Fakikh