Re: implementing moving average as a UDF

John Sichi Tue, 22 Feb 2011 14:59:26 -0800

Yes, your query makes sense and should already work as expected.  The idea of 
HIVE-1994 is that once the new annotation is available, we'll make a guarantee 
that your query as written below will continue to work in the face of any new 
optimizer changes (with the downside being that in some cases you won't be able 
to take advantage of such optimizer changes).


Each mapper or reducer gets its own instance of the UDF, so (a) you don't have 
to worry about any unwanted sharing between them and (b) you have to make sure 
that your DISTRIBUTE/SORT clauses are present and correct (Hive won't know 
anything about the dependency).

Long term, an implementation of the SQL/OLAP frameworks would be preferable 
since it would allow Hive to fully understand the semantics and apply all 
relevant validations and optimizations transparently, but in the meantime, 
stateful UDF's will be the duct tape.

JVS

On Feb 22, 2011, at 11:55 AM, Igor Tatarinov wrote:

> Thank you, John.
> 
> It's not quite clear from the page whether my solution:
> 1. makes sense
> 2. works now
> 3. will work in the future if the issue is resolved/implemented
> 
> Could you elaborate?
> 
> Also, there is no mentioning of UDF object sharing (between mappers) in the 
> current implementation. Is this a problem? do I need to use ThreadLocal or 
> something like that?
> 
> On Tue, Feb 22, 2011 at 11:42 AM, John Sichi <jsi...@fb.com> wrote:
> Please see the discussion in this JIRA issue:
> 
> https://issues.apache.org/jira/browse/HIVE-1994
> 
> JVS
> 
> On Feb 21, 2011, at 10:45 PM, Igor Tatarinov wrote:
> 
> > I would like to implement the moving average as a UDF (instead of a 
> > streaming reducer). Here is what I am thinking. Please let me know if I am 
> > missing something here:
> >
> > SELECT product, date, mavg(product, price, 10)
> > FROM (
> >   SELECT *
> >   FROM prices
> >   DISTRIBUTE BY product
> >   SORT BY product, date
> > )
> >
> > I have to pass the key to mavg() because it has to detect when one product 
> > grouping ends and another starts.
> >
> > Unfortunately, mavg will also need to maintain a state (moving sum and 
> > count). That's where I am worried that Hive (Hadoop?) will use a single 
> > instance of my UDF to process concurrent groupings and this idea won't work.
> >
> > Is that the main issue? Is there something I can do to fix that?
> >
> > Thanks!
> > igor
> >
> 
>

Re: implementing moving average as a UDF

Reply via email to