[ https://issues.apache.org/jira/browse/HIVE-11807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744025#comment-14744025 ]
Owen O'Malley commented on HIVE-11807: -------------------------------------- Ok, there are a couple changes that I'd propose: * Use the stripe size rather than the available memory. This is more important because the stripe will be flushed when the buffering reaches the stripe size. * Count all of the columns not just the top level ones. * Most of the streams have at most 2 large streams so if we use 20 buffers, that will give us a reasonable balance between internal fragmentation and throughput. > Set ORC buffer size in relation to set stripe size > -------------------------------------------------- > > Key: HIVE-11807 > URL: https://issues.apache.org/jira/browse/HIVE-11807 > Project: Hive > Issue Type: Improvement > Components: File Formats > Reporter: Owen O'Malley > Assignee: Owen O'Malley > > A customer produced ORC files with very small stripe sizes (10k rows/stripe) > by setting a small 64MB stripe size and 256K buffer size for a 54 column > table. At that size, each of the streams only get a buffer or two before the > stripe size is reached. The current code uses the available memory instead of > the stripe size and thus doesn't shrink the buffer size if the JVM has much > more memory than the stripe size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)