I would like to implement the moving average as a UDF (instead of a streaming reducer). Here is what I am thinking. Please let me know if I am missing something here:
SELECT product, date, mavg(product, price, 10) FROM ( SELECT * FROM prices DISTRIBUTE BY product SORT BY product, date ) I have to pass the key to mavg() because it has to detect when one product grouping ends and another starts. Unfortunately, mavg will also need to maintain a state (moving sum and count). That's where I am worried that Hive (Hadoop?) will use a single instance of my UDF to process concurrent groupings and this idea won't work. Is that the main issue? Is there something I can do to fix that? Thanks! igor