Re: Hive Row number Use case

Brad Kaneyama Wed, 18 Jul 2018 01:13:00 -0700

As you noticed your partition by clause is improper since your row_number()
counts all of the members that are the same -- instead of sequencing a run
of the same values.


Here's a POC using a simplified version of your dataset:
You can adjust the grouping windows and change detection accordingly.

CREATE TABLE foo
(
  dt string,
  flag int
);

insert into foo
values
('2017-07-12 22:59:17',1),
('2017-07-12 23:02:14',0),
('2017-07-12 23:07:40',0),
('2017-07-12 23:12:41',0),
('2017-07-12 23:22:06',0),
('2017-07-12 23:38:35',0),
('2017-07-12 23:44:19',0),
('2017-07-12 23:47:49',1),
('2017-07-12 23:48:49',1),
('2017-07-12 23:53:31',0),
('2017-07-12 23:57:01',1),
('2017-07-13 00:03:10',1),
('2017-07-13 00:06:35',0),
('2017-07-13 00:07:29',1),
('2017-07-12 12:59:17',1),
('2017-07-12 13:02:14',0),
('2017-07-12 13:07:40',0),
('2017-07-12 13:12:41',0);

with cte
AS
(
  select
    dt,
    flag,
    lag(flag) over(order by dt) AS prev_flag,
    (case
      when flag <> lag(flag,1) over(order by dt) then 1
      else 0
    end) as changed_indicator
  from
    foo as x
),
cte2
AS
(
  select
    *,
    sum(changed_indicator) over(order by dt) AS grouping_key
  from
    cte
)
select
  *,
  row_number() over(partition by grouping_key order by dt) as seq
from
  cte2
order by
  dt;


+---------------------+------+-----------+-------------------+--------------+-----+
|         dt          | flag | prev_flag | changed_indicator | grouping_key
| seq |
+---------------------+------+-----------+-------------------+--------------+-----+
| 2017-07-12 12:59:17 |    1 | NULL      |                 0 |            0
|   1 |
| 2017-07-12 13:02:14 |    0 | 1         |                 1 |            1
|   1 |
| 2017-07-12 13:07:40 |    0 | 0         |                 0 |            1
|   2 |
| 2017-07-12 13:12:41 |    0 | 0         |                 0 |            1
|   3 |
| 2017-07-12 22:59:17 |    1 | 0         |                 1 |            2
|   1 |
| 2017-07-12 23:02:14 |    0 | 1         |                 1 |            3
|   1 |
| 2017-07-12 23:07:40 |    0 | 0         |                 0 |            3
|   2 |
| 2017-07-12 23:12:41 |    0 | 0         |                 0 |            3
|   3 |
| 2017-07-12 23:22:06 |    0 | 0         |                 0 |            3
|   4 |
| 2017-07-12 23:38:35 |    0 | 0         |                 0 |            3
|   5 |
| 2017-07-12 23:44:19 |    0 | 0         |                 0 |            3
|   6 |
| 2017-07-12 23:47:49 |    1 | 0         |                 1 |            4
|   1 |
| 2017-07-12 23:48:49 |    1 | 1         |                 0 |            4
|   2 |
| 2017-07-12 23:53:31 |    0 | 1         |                 1 |            5
|   1 |
| 2017-07-12 23:57:01 |    1 | 0         |                 1 |            6
|   1 |
| 2017-07-13 00:03:10 |    1 | 1         |                 0 |            6
|   2 |
| 2017-07-13 00:06:35 |    0 | 1         |                 1 |            7
|   1 |
| 2017-07-13 00:07:29 |    1 | 0         |                 1 |            8
|   1 |
+---------------------+------+-----------+-------------------+--------------+-----+


On Tue, Jul 17, 2018 at 11:08 PM Anup Tiwari <anupsdtiw...@gmail.com> wrote:

> Can someone look into this and revert if possible?
>
>
> Regards,
> Anup Tiwari
>
> On Sat, Jul 14, 2018 at 12:28 AM, Anup Tiwari <anupsdtiw...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Can someone look into this and revert if possible?
>>
>> Thanks.
>>
>>
>> On Thu, 12 Jul 2018 12:56 Anup Tiwari, <anupsdtiw...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> We have a use case where we want to assign a row number to a table based
>>> on 3 column ( uid, update_date, flag) i.e. if value of any of the
>>> column gets changed, we want to reset this number. Please find below sample
>>> input data and expected output data.
>>>
>>> Also please note that we have tried row_number() over(partition by uid,
>>> update_date, flag order by update_time asc)  but due to this actual
>>> input ordering got break due to I believe partition by clause because it
>>> seems partition by creates group within column specified and then it start
>>> row number and due to this actual ordering is breaking. So i just wanted to
>>> know that is there any function available in hive which can give us below
>>> result OR we are missing something in window function?
>>>
>>>
>>> *Input Data :- *
>>>
>>> *uid* *update_date* *update_time* *flag*
>>> 468730 2017-07-12 12/07/2017 22:59:17 1
>>> 468730 2017-07-12 12/07/2017 23:02:14 0
>>> 468730 2017-07-12 12/07/2017 23:07:40 0
>>> 468730 2017-07-12 12/07/2017 23:12:41 0
>>> 468730 2017-07-12 12/07/2017 23:22:06 0
>>> 468730 2017-07-12 12/07/2017 23:38:35 0
>>> 468730 2017-07-12 12/07/2017 23:44:19 0
>>> 468730 2017-07-12 12/07/2017 23:47:49 1
>>> 468730 2017-07-12 12/07/2017 23:48:49 1
>>> 468730 2017-07-12 12/07/2017 23:53:31 0
>>> 468730 2017-07-12 12/07/2017 23:57:01 1
>>> 468730 2017-07-13 13/07/2017 00:03:10 1
>>> 468730 2017-07-13 13/07/2017 00:06:35 0
>>> 468730 2017-07-13 13/07/2017 00:07:29 1
>>> 468731 2017-07-13 12/07/2017 12:59:17 1
>>> 468731 2017-07-13 12/07/2017 13:02:14 0
>>> 468731 2017-07-13 12/07/2017 13:07:40 0
>>> 468731 2017-07-13 12/07/2017 13:12:41 0
>>>
>>>
>>> *Output Data :-*
>>>
>>> *uid* *update_date* *update_time* *flag* *required_row_num*
>>> 468730 2017-07-12 12/07/2017 22:59:17 1 1
>>> 468730 2017-07-12 12/07/2017 23:02:14 0 1
>>> 468730 2017-07-12 12/07/2017 23:07:40 0 2
>>> 468730 2017-07-12 12/07/2017 23:12:41 0 3
>>> 468730 2017-07-12 12/07/2017 23:22:06 0 4
>>> 468730 2017-07-12 12/07/2017 23:38:35 0 5
>>> 468730 2017-07-12 12/07/2017 23:44:19 0 6
>>> 468730 2017-07-12 12/07/2017 23:47:49 1 1
>>> 468730 2017-07-12 12/07/2017 23:48:49 1 2
>>> 468730 2017-07-12 12/07/2017 23:53:31 0 1
>>> 468730 2017-07-12 12/07/2017 23:57:01 1 1
>>> 468730 2017-07-13 13/07/2017 00:03:10 1 1
>>> 468730 2017-07-13 13/07/2017 00:06:35 0 1
>>> 468730 2017-07-13 13/07/2017 00:07:29 1 1
>>> 468731 2017-07-13 12/07/2017 12:59:17 1 1
>>> 468731 2017-07-13 12/07/2017 13:02:14 0 1
>>> 468731 2017-07-13 12/07/2017 13:07:40 0 2
>>> 468731 2017-07-13 12/07/2017 13:12:41 0 3
>>> *FYI :* We are one Hive 2.3.1.
>>>
>>
>

Re: Hive Row number Use case

Reply via email to