Re: Tez reducer parallelism ..

Gopal Vijayaraghavan Sat, 19 Mar 2016 04:26:26 -0700

> So you'r saying, since these windows  are part of a single SELECT
>projection they need to be serial?


Yes, with a full shuffle of the result so far for each new OVER().

>              row_number() OVER( PARTITION BY app, user, type ORDER BY ts
>) as a_number,
>              row_number() OVER( PARTITION BY day, app, user, type ORDER
>BY ts ) as type_rank,
>              row_number() OVER( PARTITION BY day, app, user   ORDER BY
>ts ) as  dau_rank,

You can write your own stateful windowing UDAF which does this over the
a subset window, provided your partition by is pretty wide.

compute_app_dau(app, day, user, type, ts) OVER(PARTITION BY app, user
ORDER BY day, type,
ts) as a_number
compute_type_dau(app, day, user, type, ts) OVER(PARTITION BY app, user
ORDER BY day, type, ts) as type_rank

etc.

It's upto your impl to remember the last row & take avantage of the
ordering to produce ranking along timestamps - hive will distribute by
app, user & sort by app, user, day, type, ts & feed into your UDAF.

If this is mostly done at the user+app tuple, the real skew should only be
the biggest user X app combo.

I'm guessing it can even be a single UDAF returning a Struct with all
numbers in one go, but I haven't dug deeper into
ISupportStreamingModeForWindowing yet.

Cheers,
Gopal

Re: Tez reducer parallelism ..

Reply via email to