> A lot of our queries do the following style of simultaneous windowing ..

The windowing is not simultaneous unless they are all over the same window
- the following query has 3 different windows applied over the same rows
sequentially.

> SELECT
>    row_number() OVER( PARTITION BY app, user,
> type ORDER BY ts )as a_number,
>    row_number() OVER( PARTITION BY day, app, user,
> type ORDER BY ts )as type_rank,

> Since each OVER / PARTITION-By clause is independent they can the put
>into parallelized Reducer phases.

They are all over the same rows so they're done in sequence, so the final
row-set contains all values, causing multiple shuffles of the same rows.

> Is this something that Tez tries to do at all or an optimization that I
>can use to my benefit ?

Does your query only have row_numbers or does it have other columns in
them?

Cheers,
Gopal


Reply via email to