Re: Multi-GroupBy-Insert optimization

shan s Mon, 04 Jun 2012 08:19:04 -0700

Thanks for the explanation Jan.
If I understand correctly, the input will be read one single time and will
be preprocessed in some form,  and this intermediate data is used for
subsequent group-by..
Not sure if my scenario will help this single step, since group-by varies
across vast entities.

If I were to implement group-by,manually, generally  we could club them
together in single program. Can I do better with hive, with some
hints/optimizations?
Or  is there a possibility that Pig might perform better in this case.(
Assuming Pig would probably handle this in a single job?)

Thank You,
Prashant

p.s.
Just In case, if the below data helps...
In my scenario, my data has # of entity1 = 500,000 and # of entity2=500, #
of entity3=5.
Fact table has 250M rows (entity1 * entity2)
Current job has 22 group bys,  based on various combination of 3 entities,
and fact table record types, it produces  22M rows. It takes 3 hours on 4
machine cluster, with good configuration.

On Mon, Jun 4, 2012 at 6:52 PM, Jan Dolinár <dolik....@gmail.com> wrote:

>
>    On Fri, Jun 1, 2012 at 5:25 PM, shan s <mysub...@gmail.com> wrote:
>>
>>> I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job
>>> which would club the group-bys together.
>>> However it is scheduling n jobs where n = number of group bys..
>>> Could you please explain this behaviour.
>>>
>>>
>>
> No, it will result in at least as many jobs as there is group-bys. The
> efficiency is hidden not in lowering number of jobs, but in fact that the
> first job usually reduces the amount of the data that the rest needs to go
> through. E.g. if the FROM clause includes subquery or when the group-bys
> have similar WHERE caluses - then this "pre-selection" is executed first
> and the subsequent jobs operate on the results of the first instead of
> entire table/partition and are therefore much faster.
>
>
> J. Dolinar
>

Re: Multi-GroupBy-Insert optimization

Reply via email to