Hi Cam, I couldn't find a function that achieved precisely what you were looking for, but there is a function that gets pretty close to what you want.
select id, collect_set(date_hour), collect_set(count), sum(count) from test group by id; The problem with using collect_set is that it removes duplicates, but if you're not too shy to go and modify the code a little bit, you can very easily modify ql.udf.generic.GenericUDAFCollectSet to use an ArrayList instead of a HashSet, and allow dupes, which will get you the results you're looking for: Ended Job = job_201101281644_0007 OK 1 [2011310115,2011310116] [3,1] 4 2 [2011310117,2011310118] [1,1] 2 Time taken: 22.872 seconds -Jon On Mon, Jan 31, 2011 at 1:13 PM, Cam Bazz <camb...@gmail.com> wrote: > Hello, > > After doing some aggregate counting, I now have data in a table like this: > > id count date_hour (this is a partition name) > 1 3 2011310115 > 1 1 2011310116 > 2 1 2011310117 > 2 1 2011310118 > > > and I need to turn this into: > > 1 [2011310115,2011310115] [3,1] 4 > 2 [2011310117,2011310118] [1,1] 2 > > explanation: first field is id, second field is a list of date_hour's > - a partition from previous table, third field lis a list of counts, > and the fourth field is the sum of counts. > > given that I used a date_hour key for partition, how can I do this, or > accomplish similar? > > currently I process data hourly, but i might need to another aggregate > to find daily results, i.e, iterate over multiple partitions maybe for > a months data, and generate the statistics daily instead of hourly. > > Any ideas / recommendation greatly appreciated. > > Best Regards, > > -C.B. >