I would like to implement the moving average as a UDF (instead of a
streaming reducer). Here is what I am thinking. Please let me know if I am
missing something here:

SELECT product, date, mavg(product, price, 10)
FROM (
  SELECT *
  FROM prices
  DISTRIBUTE BY product
  SORT BY product, date
)

I have to pass the key to mavg() because it has to detect when one product
grouping ends and another starts.

Unfortunately, mavg will also need to maintain a state (moving sum and
count). That's where I am worried that Hive (Hadoop?) will use a single
instance of my UDF to process concurrent groupings and this idea won't
work.

Is that the main issue? Is there something I can do to fix that?

Thanks!
igor

Reply via email to