Prasanth J created HIVE-6455:
--------------------------------

             Summary: Scalable dynamic partitioning and bucketing optimization
                 Key: HIVE-6455
                 URL: https://issues.apache.org/jira/browse/HIVE-6455
             Project: Hive
          Issue Type: New Feature
          Components: Query Processor
    Affects Versions: 0.13.0
            Reporter: Prasanth J
            Assignee: Prasanth J


The current implementation of dynamic partition works by keeping at least one 
record writer open per dynamic partition directory. In case of bucketing there 
can be multispray file writers which further adds up to the number of open 
record writers. The record writers of column oriented file format (like ORC, 
RCFile etc.) keeps some sort of in-memory buffers (value buffer or compression 
buffers) open all the time to buffer up the rows and compress them before 
flushing it to disk. Since these buffers are maintained per column basis the 
amount of constant memory that will required at runtime increases as the number 
of partitions and number of columns per partition increases. This often leads 
to OutOfMemory (OOM) exception in mappers or reducers depending on the number 
of open record writers. Users often tune the JVM heapsize (runtime memory) to 
get over such OOM issues. 

With this optimization, the dynamic partition columns and bucketing columns (in 
case of bucketed tables) are sorted before being fed to the reducers. Since the 
partitioning and bucketing columns are sorted, each reducers can keep only one 
record writer open at any time thereby reducing the memory pressure on the 
reducers. This optimization is highly scalable as the number of partition and 
number of columns per partition increases at the cost of sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to