I suffer from "sql-itis", especially the SQL:2003 variant. I can barely get
a data-related programming line out of my fingers that doesn't include a
"partition by" or an "over order by".

Many of the data sets I analyze can be simplified for comprehension by
thinking of them as having only three columns: a descriptive columns
(ENTITY_NAME), a date column (20101611), and a numeric or double value
column (456.789). A time series. Much of what I do is partition the
descriptive column, order by the date column, and apply a moving aggregate
(like average) over a sliding window of the value column.

Example:

select entity_name, value_date, avg(value) over (partition by entity_name
order by value_date rows between 20 preceding and current row) as
twenty_period_moving_average
from table_with_a_billion_rows

I am aware that HQL does not currently include this type of SQL analytic
function. What are my options, if any, to accomplish the above with Hive?
I'm looking at UDAF, but I haven't yet grokked.

Thanks,

Michael

Michael Roessler, CFA
michael.roess...@keyevent.com

Reply via email to