On Mon, Nov 19, 2012 at 4:02 PM, qiaoresearcher <qiaoresearc...@gmail.com> wrote: > The table format is something like: > > user_id visiting_time visiting_web_page > user1 time11 page_string_11 > user1 time12 page_string_12 > user1 time13 page_string_13 <-- latest time > stamp > user1 time14 page_string_14 <-- also latest > time stamp, same as the one above > user1 time15 page_string_15 > ... ... ..... > user2 time21 page_string_21 > user2 time22 page_string_22 > user2 time23 page_string_23 > user2 time24 page_string_24 > user2 time25 page_string_25 > .... ..... .... > > Assume time13 and time14 are the latest time stamp for user1, time22 and > time25 are the latest time stamp for user2, > how to obtain output like: > user1 [page_string_13, page_string_14] > user2 [page_string_22, page_string_25] > > ps: run { select user, max(visiting_time) group by user } WILL return > result like: > user1, time13 (same as time 14) > user2, time22 (same as time 25) > > many thanks in advance! >
Part of the solution is to use hive collect_set UDF. This collects group into a hive array type. select userid,visitingtime, collect_Set(visiting_web_page) from table group by userid,visitingtime; Collect_set de-duplicates. If you do not want this use the collect_udf I wrote. https://github.com/edwardcapriolo/hive-collect If you want only the newest one you can use rank. https://github.com/edwardcapriolo/hive-rank