[ 
https://issues.apache.org/jira/browse/HIVE-11807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744025#comment-14744025
 ] 

Owen O'Malley commented on HIVE-11807:
--------------------------------------

Ok, there are a couple changes that I'd propose:
* Use the stripe size rather than the available memory. This is more important 
because the stripe will be flushed when the buffering reaches the stripe size.
* Count all of the columns not just the top level ones.
* Most of the streams have at most 2 large streams so if we use 20 buffers, 
that will give us a reasonable balance between internal fragmentation and 
throughput.


> Set ORC buffer size in relation to set stripe size
> --------------------------------------------------
>
>                 Key: HIVE-11807
>                 URL: https://issues.apache.org/jira/browse/HIVE-11807
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> A customer produced ORC files with very small stripe sizes (10k rows/stripe) 
> by setting a small 64MB stripe size and 256K buffer size for a 54 column 
> table. At that size, each of the streams only get a buffer or two before the 
> stripe size is reached. The current code uses the available memory instead of 
> the stripe size and thus doesn't shrink the buffer size if the JVM has much 
> more memory than the stripe size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to