Hi All, Greatly appreciate any feedback on this. May be this may sound infeasible. Just wanted check with the experts on this. Anyway the problem of incremental data processing is a very interesting one if it can be accommodated for.
Best Regards Buddhika On Wed, Jan 16, 2013 at 12:36 PM, buddhika chamith <chamibuddh...@gmail.com>wrote: > Hi All, > > After digging in to the code more I realized that GroupbyOperator can be > present at the map side of the computation as well, in which case it's > doing partial computations. So in that case the terminate of UDAF will get > called for partial results. However for the queries that I tried the > terminate methods inside the UDAFs in GroupbyOperator at reduce side tree > of the computation finishes with fully completed aggregation results as > expected. Can be behaviour be expected in any query? (Reduce side computing > fully aggregated result for any aggregation function) > > The problem I am having is that I need a point where previous aggregation > results to be merged with the current run results. But since terminate can > behave bit differently depending on whether it's in map side or reduce side > would it make sense to selectively add this logic at reduce side based on > some configuration property? (I see property mapred.task.is.map can be of > potential use here). > > Also there needs to be some identifier to uniquely identify the > aggregation UDAF in operator tree so that the previous aggregations can be > fetched from the result cache using that identifier. Is there such > possibility where aggregation function can be uniquely identified within > the query? > > I realize this might be a long shot but I am still up for it if this is > feasible albeit with some work. Or any other possible ways to achieve this > is highly appreciated. > > Regards > Buddhika > > > On Mon, Jan 14, 2013 at 8:16 PM, buddhika chamith <chamibuddh...@gmail.com > > wrote: > >> Any suggestions on this are greatly appreciated. Any one see major road >> blocks on this? >> >> Regards >> Buddhika >> >> >> On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith < >> chamibuddh...@gmail.com> wrote: >> >>> Hi All, >>> >>> In order to achieve above I am researching on the feasibility of using a >>> set of custom UADFs for distributive aggregate operations (e.g: sum, count >>> etc..). Idea is to incorporate some state persisted from earlier >>> aggregations to the current aggregation value inside merge of the UDAF. For >>> distributing state data I was thinking of utilizing Hadoop distributed >>> cache. But I am not sure about how exactly UDAF's are executed at runtime. >>> Would including the logic to add the persisted state to the current result >>> at terminate() ensure that it would be added only once? (Assuming all the >>> aggregations fan in at terminate. I may gotten it all wrong here. :)). Or >>> is there better way of achieving the same? >>> >>> Regards >>> Buddhika >>> >> >> >