Here is some testing, I focused on two variables (Not really understanding
what they do)
orc.compress.size (256k by default)
hive.exec.orc.memory.pool (0.50 by default).

The job I am running is a admittedly complex job running through a Python
Transform script.  However, as noted above, RCFile writes have NO issues.
Another point... the results of this job end up being is LOTs of Dynamic
partitions.  I am not sure if that plays a role here, or could help in
troubleshooting.

So for these two I ran a bunch of tests, the results are in the format
(compress.size in k-memory.pool-Success/fail)
256-0.50-Fail
128-0.50-Fail
   64-0.50-Fail
   32-0.50-Fail
   16-0.50-Fail
   16-0.25-Success
   32-0.25-Fail
   16-0.35-Success
   16-0.45-Success


So after doing this I have questions:
1. On the memory.pool what is happening when I change this? Is this
affecting the written files on subsequent reads?
2. Does the hive memory pool change the speed of things? (I'll take slower
speed if it "works")
3. On the compress.size, do I hurt subsequent reads with the smaller
compress size?
4. These two variables, changed by themselves do not fix the problem, but
together they seem to... lucky? Or are they related?
5. Is there a better approach I can take on this?
6. Any other variables I could look at?









On Sun, Apr 27, 2014 at 11:56 AM, John Omernik <j...@omernik.com> wrote:

> Hello all,
>
> I am working with Hive 0.12 right now on YARN.  When I am writing a table
> that is admittedly quite "wide" (there are lots of columns, near 60,
> including one binary field that can get quite large).   Some tasks will
> fail on ORC file write with Java Heap Space Issues.
>
> I have confirmed that using RCFiles on the same data produces no failures.
>
> This led me down the path of experimenting with the table properties.
> Obviously, living on the cutting edge here makes it so there is not tons of
> documentation on what these settings do, I have lots of slide shows showing
> me the settings that be used to tune ORC, but not what they do, or what the
> ramifications may be.
>
> For example, I've gone ahead and reduced the orc.compress.size to 64k This
> seems to address lots of the failures, (all other things being unchanged).
> But what does that mean for me in the long run? Larger files?  More files?
>  How is this negatively affecting me from a file perspective?
>
> In addition, would this be a good time to try SNAPPY over ZLIB as my
> default compression? I tried to find some direct memory comparisons but
> didn't see anything.
>
> So, give my data and the issues on write for my wide table, how would you
> recommend I address this? Is the compress.size the way to go?  What are the
> long term affects of this?  Any thoughts would be welcome.
>
> Thanks!
>
> John
>

Reply via email to