[jira] [Created] (HIVE-5170) Sorted Bucketed Partitioned Insert hard-codes the reducer count == bucket count

Gopal V (JIRA) Thu, 29 Aug 2013 13:37:34 -0700

Gopal V created HIVE-5170:
-----------------------------

             Summary: Sorted Bucketed Partitioned Insert hard-codes the reducer 
count == bucket count
                 Key: HIVE-5170
                 URL: https://issues.apache.org/jira/browse/HIVE-5170
             Project: Hive
          Issue Type: Bug
          Components: Query Processor
    Affects Versions: 0.12.0
         Environment: Ubuntu LXC
            Reporter: Gopal V

When performing a hive sorted-partitioned insert, the insert optimizer
hard-codes the number of output files to the actual bucket count of the table.

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L4852

We need at least that many reducers or if limited, switch to multi-spray (as
implemented already), but more reducers is wasteful as long as the HiveKey only
contains the partition columns.

At this point, we're limited to reducers = n-bucket still, which is a problem
for partitioning requests which need to insert nearly a terabyte of data into a
single-digit bucket count and four-digit partition count.

Since that is routed by the hasCode of the HiveKey, we can ensure that works by
modifying the HiveKey to handle n-buckets internally.

Basically it should only generate hashCode = (sort_cols.hashCode() % n) routing
only to n reducers over-all, despite how many we spin up.

So far so good with the hard-coded reducer count.

But provided we fix the issues brought up by HIVE-5169, the insert becomes
friendlier to a higher reducer count as well.

At this juncture, we can modify the hashCode to be slightly more interesting.

hashCode = (part_cols.hashCode()*31 + (sort_cols.hashCode() % n))

This generates somewhere between n to partition_count * n unique hash-codes.

Since the sort-order & bucketing has to be maintained per-partition dir,
distributing this equally across any number of reducers will result in the
scale-out of the reducer count.

This will allow a reducer count that will allow for far faster inserts of ORC
data into a partitioned/sorted table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HIVE-5170) Sorted Bucketed Partitioned Insert hard-codes the reducer count == bucket count

Reply via email to