Thanks for your response. My thinking was that by turning off hive.map.aggr hive would do the following: col3 becomes the key in mapping. All rows with same col3 go to same reducer. In the reducer the values (=col1,col2) are sorted by key (=col3) and myUdf iterates over the over the values, with terminate() and init() being called when the key changes. If this is how it is implemented then what would be the situation where merge() and terminatePartial() would be called? I have run this with 200mm+ rows of data (but with all groups less than 10k in size) so far without any calls to merge() or terminatePartial(). Best, Koert
On Sun, Sep 4, 2011 at 11:10 PM, Huan Li <yrlih...@gmail.com> wrote: > Setting hive.map.aggr false will reduce the chance of terminatePartial() > and merge() being called. Though I don't think it will eliminate the > possibility. If your data is large, it's still possible that a group of data > is processed by multiple reducers and those two methods are needed. > > If you need to process records in each group in a single method, you can > first use collect_set to collect your group data and process them in a UDF. > > > 2011/9/4 Koert Kuipers <ko...@tresata.com> > >> Hey, my question wasn't very clear. I have a UDAF that I apply per group. >> The UDAF does not support terminatePartial() and merge(). So to do this i >> run: >> >> set hive.map.aggr=false; >> select myUdf(col1, col2) from table group by col3; >> >> Now this seems to work. But are my assumptions correct that this will >> never call terminatePartial() or merge()? >> Thanks Koert >> >> >> On Thu, Sep 1, 2011 at 11:59 PM, Huan Li <yrlih...@gmail.com> wrote: >> >>> Koert, Not sure what you mean by "results can be merged between groups". >>> UDAF should be used to aggregated records by group. Why need to merge >>> between groups? >>> >>> Can you give some examples of what kind of query you'd like to run? >>> >>> >>> 2011/8/30 Koert Kuipers <ko...@tresata.com> >>> >>>> If i run my own UDAF with group by, can i be sure that a single UDAF >>>> instance initialized once will process all members in a group? Or should i >>>> code so as to take into account the situation where even within a group >>>> multiple UDAFs could run, and i would have to deal with terminatePartial() >>>> and merge() even within a group? >>>> My problem is that my results within a group are not easily merged, but >>>> between groups they are. >>>> >>> >>> >> >