Re: data question

Jonathan Natkins Mon, 31 Jan 2011 23:47:23 -0800

Hi Cam,

I couldn't find a function that achieved precisely what you were looking
for, but there is a function that gets pretty close to what you want.


select id, collect_set(date_hour), collect_set(count), sum(count) from test
group by id;

The problem with using collect_set is that it removes duplicates, but if
you're not too shy to go and modify the code a little bit, you can very
easily modify ql.udf.generic.GenericUDAFCollectSet to use an ArrayList
instead of a HashSet, and allow dupes, which will get you the results you're
looking for:

Ended Job = job_201101281644_0007
OK
1 [2011310115,2011310116] [3,1] 4
2 [2011310117,2011310118] [1,1] 2
Time taken: 22.872 seconds

-Jon

On Mon, Jan 31, 2011 at 1:13 PM, Cam Bazz <camb...@gmail.com> wrote:

> Hello,
>
> After doing some aggregate counting, I now have data in a table like this:
>
> id     count    date_hour (this is a partition name)
> 1       3       2011310115
> 1       1       2011310116
> 2       1       2011310117
> 2       1       2011310118
>
>
> and I need to turn this into:
>
> 1 [2011310115,2011310115] [3,1] 4
> 2 [2011310117,2011310118] [1,1] 2
>
> explanation: first field is id, second field is a list of date_hour's
> - a partition from previous table, third field lis a list of counts,
> and the fourth field is the sum of counts.
>
> given that I used a date_hour key for partition, how can I do this, or
> accomplish similar?
>
> currently I process data hourly, but i might need to another aggregate
> to find daily results, i.e, iterate over multiple partitions maybe for
> a months data, and generate the statistics daily instead of hourly.
>
> Any ideas / recommendation greatly appreciated.
>
> Best Regards,
>
> -C.B.
>

Re: data question

Reply via email to