Re: Hive 0.12 ORC Heap Issues on Write

Prasanth Jayachandran Mon, 28 Apr 2014 11:25:40 -0700

Glad that presentation was useful to you :)

hive.exec.orc.memory.pool is the fraction of memory that ORC writers are 
allowed to use. If your heap size is 1GB and if the hive.exec.orc.memory.pool 
is set to 0.5 then ORC writers can use maximum of 500MB memory. If there are 
more ORC writers and if their memory requirement is greater than available 
memory then stripe size is scaled down to adjust to the available memory. Let 
say, we have 2 ORC writers and 500MB memory for ORC, then each ORC writer gets 
256MB (equivalent to default stripe size). Now if there are 4 ORC writers then 
each writer gets only 128MB memory and the stripe size will be scaled down to 
128MB. So when there are more dynamic partitions (or more columns) and less 
memory the stripe size will get reduced or may throw OOM exception.


I am not sure why 16k with .5 is failing whereas 16k with .25 is successful. I 
expected the later to fail as it had less memory.

Thanks
Prasanth Jayachandran

On Apr 28, 2014, at 4:45 AM, John Omernik <j...@omernik.com> wrote:

> Prasanth -
> 
> This is easily the best and most complete explanation I've received to any 
> online posted question ever.  I know that sounds like a an overstatement, but 
> this answer is awesome.  :)  I really appreciate your insight on this.  My 
> only follow-up is asking how the memory.pool percentage plays a roll in my 
> success vs. fail. I.e. in my data, when I got down to 16k but had the default 
> memory pool of .50, it failed, when I scaled that back to .25, it was 
> successful at 16k.  Thoughts? 
> 
> Thanks again for your research on this.
> 
> 
> 
> 
> On Sun, Apr 27, 2014 at 11:07 PM, Prasanth Jayachandran 
> <pjayachand...@hortonworks.com> wrote:
> Hi John
> 
> I prepared a presentation earlier that explains the impact of changing 
> compression buffer size on the overall size of ORC file. It should help you 
> understand all the questions that you had.
> 
> In Hive 0.13, a new optimization is added that should avoid this OOM issue. 
> https://issues.apache.org/jira/browse/HIVE-6455 
> Unfortunately, hive 0.12 does not support this optimization. Hence reducing 
> the compression size is the only option. As you can see from the PPT, 
> reducing the compression buffer size does not have significant impact in file 
> size or query execution time.
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader of 
> this message is not the intended recipient, you are hereby notified that any 
> printing, copying, dissemination, distribution, disclosure or forwarding of 
> this communication is strictly prohibited. If you have received this 
> communication in error, please contact the sender immediately and delete it 
> from your system. Thank You.
> 
> Thanks
> Prasanth Jayachandran
> 
> On Apr 27, 2014, at 3:06 PM, John Omernik <j...@omernik.com> wrote:
> 
>> So one more follow-up:
>> 
>> The 16-.25-Success turns to a fail if I throw more data (and hence more 
>> partitions) at the problem. Could there be some sort of issue that rears 
>> it's head based on the number of output dynamic partitions?
>> 
>> Thanks all!
>> 
>> 
>> 
>> 
>> On Sun, Apr 27, 2014 at 3:33 PM, John Omernik <j...@omernik.com> wrote:
>> Here is some testing, I focused on two variables (Not really understanding 
>> what they do)
>> orc.compress.size (256k by default)
>> hive.exec.orc.memory.pool (0.50 by default).
>> 
>> The job I am running is a admittedly complex job running through a Python 
>> Transform script.  However, as noted above, RCFile writes have NO issues. 
>> Another point... the results of this job end up being is LOTs of Dynamic 
>> partitions.  I am not sure if that plays a role here, or could help in 
>> troubleshooting. 
>> 
>> So for these two I ran a bunch of tests, the results are in the format 
>> (compress.size in k-memory.pool-Success/fail)
>> 256-0.50-Fail
>> 128-0.50-Fail
>>    64-0.50-Fail
>>    32-0.50-Fail
>>    16-0.50-Fail
>>    16-0.25-Success
>>    32-0.25-Fail
>>    16-0.35-Success
>>    16-0.45-Success
>> 
>> 
>> So after doing this I have questions:
>> 1. On the memory.pool what is happening when I change this? Is this 
>> affecting the written files on subsequent reads? 
>> 2. Does the hive memory pool change the speed of things? (I'll take slower 
>> speed if it "works")
>> 3. On the compress.size, do I hurt subsequent reads with the smaller 
>> compress size?
>> 4. These two variables, changed by themselves do not fix the problem, but 
>> together they seem to... lucky? Or are they related?
>> 5. Is there a better approach I can take on this?
>> 6. Any other variables I could look at?
>> 
>> 
>> 
>>     
>> 
>> 
>> 
>> 
>> 
>> On Sun, Apr 27, 2014 at 11:56 AM, John Omernik <j...@omernik.com> wrote:
>> Hello all, 
>> 
>> I am working with Hive 0.12 right now on YARN.  When I am writing a table 
>> that is admittedly quite "wide" (there are lots of columns, near 60, 
>> including one binary field that can get quite large).   Some tasks will fail 
>> on ORC file write with Java Heap Space Issues. 
>> 
>> I have confirmed that using RCFiles on the same data produces no failures. 
>> 
>> This led me down the path of experimenting with the table properties. 
>> Obviously, living on the cutting edge here makes it so there is not tons of 
>> documentation on what these settings do, I have lots of slide shows showing 
>> me the settings that be used to tune ORC, but not what they do, or what the 
>> ramifications may be. 
>> 
>> For example, I've gone ahead and reduced the orc.compress.size to 64k This 
>> seems to address lots of the failures, (all other things being unchanged). 
>> But what does that mean for me in the long run? Larger files?  More files?  
>> How is this negatively affecting me from a file perspective? 
>> 
>> In addition, would this be a good time to try SNAPPY over ZLIB as my default 
>> compression? I tried to find some direct memory comparisons but didn't see 
>> anything. 
>> 
>> So, give my data and the issues on write for my wide table, how would you 
>> recommend I address this? Is the compress.size the way to go?  What are the 
>> long term affects of this?  Any thoughts would be welcome. 
>> 
>> Thanks!
>> 
>> John
>> 
>> 
> 
> 
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hive 0.12 ORC Heap Issues on Write

Reply via email to