Yes, your query makes sense and should already work as expected. The idea of HIVE-1994 is that once the new annotation is available, we'll make a guarantee that your query as written below will continue to work in the face of any new optimizer changes (with the downside being that in some cases you won't be able to take advantage of such optimizer changes).
Each mapper or reducer gets its own instance of the UDF, so (a) you don't have to worry about any unwanted sharing between them and (b) you have to make sure that your DISTRIBUTE/SORT clauses are present and correct (Hive won't know anything about the dependency). Long term, an implementation of the SQL/OLAP frameworks would be preferable since it would allow Hive to fully understand the semantics and apply all relevant validations and optimizations transparently, but in the meantime, stateful UDF's will be the duct tape. JVS On Feb 22, 2011, at 11:55 AM, Igor Tatarinov wrote: > Thank you, John. > > It's not quite clear from the page whether my solution: > 1. makes sense > 2. works now > 3. will work in the future if the issue is resolved/implemented > > Could you elaborate? > > Also, there is no mentioning of UDF object sharing (between mappers) in the > current implementation. Is this a problem? do I need to use ThreadLocal or > something like that? > > On Tue, Feb 22, 2011 at 11:42 AM, John Sichi <jsi...@fb.com> wrote: > Please see the discussion in this JIRA issue: > > https://issues.apache.org/jira/browse/HIVE-1994 > > JVS > > On Feb 21, 2011, at 10:45 PM, Igor Tatarinov wrote: > > > I would like to implement the moving average as a UDF (instead of a > > streaming reducer). Here is what I am thinking. Please let me know if I am > > missing something here: > > > > SELECT product, date, mavg(product, price, 10) > > FROM ( > > SELECT * > > FROM prices > > DISTRIBUTE BY product > > SORT BY product, date > > ) > > > > I have to pass the key to mavg() because it has to detect when one product > > grouping ends and another starts. > > > > Unfortunately, mavg will also need to maintain a state (moving sum and > > count). That's where I am worried that Hive (Hadoop?) will use a single > > instance of my UDF to process concurrent groupings and this idea won't work. > > > > Is that the main issue? Is there something I can do to fix that? > > > > Thanks! > > igor > > > >