Thanks for the explanation Jan. If I understand correctly, the input will be read one single time and will be preprocessed in some form, and this intermediate data is used for subsequent group-by.. Not sure if my scenario will help this single step, since group-by varies across vast entities.
If I were to implement group-by,manually, generally we could club them together in single program. Can I do better with hive, with some hints/optimizations? Or is there a possibility that Pig might perform better in this case.( Assuming Pig would probably handle this in a single job?) Thank You, Prashant p.s. Just In case, if the below data helps... In my scenario, my data has # of entity1 = 500,000 and # of entity2=500, # of entity3=5. Fact table has 250M rows (entity1 * entity2) Current job has 22 group bys, based on various combination of 3 entities, and fact table record types, it produces 22M rows. It takes 3 hours on 4 machine cluster, with good configuration. On Mon, Jun 4, 2012 at 6:52 PM, Jan Dolinár <dolik....@gmail.com> wrote: > > On Fri, Jun 1, 2012 at 5:25 PM, shan s <mysub...@gmail.com> wrote: >> >>> I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job >>> which would club the group-bys together. >>> However it is scheduling n jobs where n = number of group bys.. >>> Could you please explain this behaviour. >>> >>> >> > No, it will result in at least as many jobs as there is group-bys. The > efficiency is hidden not in lowering number of jobs, but in fact that the > first job usually reduces the amount of the data that the rest needs to go > through. E.g. if the FROM clause includes subquery or when the group-bys > have similar WHERE caluses - then this "pre-selection" is executed first > and the subsequent jobs operate on the results of the first instead of > entire table/partition and are therefore much faster. > > > J. Dolinar >