On Mon, Nov 19, 2012 at 4:02 PM, qiaoresearcher
<qiaoresearc...@gmail.com> wrote:
> The table format is something like:
>
> user_id    visiting_time      visiting_web_page
> user1          time11             page_string_11
> user1          time12             page_string_12
> user1          time13             page_string_13          <-- latest time
> stamp
> user1          time14             page_string_14          <-- also latest
> time stamp, same as the one above
> user1          time15             page_string_15
>   ...                ...                       .....
> user2          time21             page_string_21
> user2          time22             page_string_22
> user2          time23             page_string_23
> user2          time24             page_string_24
> user2          time25             page_string_25
>    ....              .....                       ....
>
> Assume time13 and time14 are the latest time stamp for user1, time22 and
> time25 are the latest time stamp for user2,
> how to obtain output like:
> user1        [page_string_13, page_string_14]
> user2        [page_string_22, page_string_25]
>
> ps: run {   select user, max(visiting_time) group by user  }    WILL return
> result like:
> user1, time13   (same as time 14)
> user2, time22   (same as time 25)
>
> many thanks in advance!
>

Part of the solution is to use hive collect_set UDF. This collects
group into a hive array type.

select userid,visitingtime, collect_Set(visiting_web_page) from table
group by userid,visitingtime;

Collect_set de-duplicates. If you do not want this use the collect_udf I wrote.

https://github.com/edwardcapriolo/hive-collect

If you want only the newest one you can use rank.

https://github.com/edwardcapriolo/hive-rank

Reply via email to