grouping sets for a high number of grouping set keys

Namit Jain (JIRA) Wed, 28 Nov 2012 00:41:04 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Namit Jain updated HIVE-3552:
-----------------------------

    Description: 
This is a follow up for HIVE-3433.

Had a offline discussion with Sambavi - she pointed out a scenario where the
implementation in HIVE-3433 will not scale. Assume that the user is performing
a cube on many columns, say '8' columns. So, each row would generate 256 rows
for the hash table, which may kill the current group by implementation.

A better implementation would be to add an additional mr job - in the first 
mr job perform the group by assuming there was no cube. Add another mr job, 
where
you would perform the cube. The assumption is that the group by would have 
decreased the output data significantly, and the rows would appear in the order 
of
grouping keys which has a higher probability of hitting the hash table.

  was:
This is a follow up for HIVE-3433.

Had a offline discussion with Sambavi - she pointed out a scenario where the
implementation in HIVE-3433 will not scale. Assume that the user is performing
a cube on many columns, say '8' columns. So, each row would generate 256 rows
for the hash table, which may kill the current group by implementation.

A better implementation would be to add an additional stage - in the first 
stage perform the group by assuming there was no cube. Ad another stage, where
you would perform the cube. The assumption is that the group by would have 
decreased the output data significantly.

    
> HIVE-3552 performant manner for performing cubes/rollups/grouping sets for a 
> high number of grouping set keys
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-3552
>                 URL: https://issues.apache.org/jira/browse/HIVE-3552
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3552.1.patch
>
>
> This is a follow up for HIVE-3433.
> Had a offline discussion with Sambavi - she pointed out a scenario where the
> implementation in HIVE-3433 will not scale. Assume that the user is performing
> a cube on many columns, say '8' columns. So, each row would generate 256 rows
> for the hash table, which may kill the current group by implementation.
> A better implementation would be to add an additional mr job - in the first 
> mr job perform the group by assuming there was no cube. Add another mr job, 
> where
> you would perform the cube. The assumption is that the group by would have 
> decreased the output data significantly, and the rows would appear in the 
> order of
> grouping keys which has a higher probability of hitting the hash table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3552) HIVE-3552 performant manner for performing cubes/rollups/grouping sets for a high number of grouping set keys

Reply via email to