Hello all,

I am working with Hive 0.12 right now on YARN.  When I am writing a table
that is admittedly quite "wide" (there are lots of columns, near 60,
including one binary field that can get quite large).   Some tasks will
fail on ORC file write with Java Heap Space Issues.

I have confirmed that using RCFiles on the same data produces no failures.

This led me down the path of experimenting with the table properties.
Obviously, living on the cutting edge here makes it so there is not tons of
documentation on what these settings do, I have lots of slide shows showing
me the settings that be used to tune ORC, but not what they do, or what the
ramifications may be.

For example, I've gone ahead and reduced the orc.compress.size to 64k This
seems to address lots of the failures, (all other things being unchanged).
But what does that mean for me in the long run? Larger files?  More files?
 How is this negatively affecting me from a file perspective?

In addition, would this be a good time to try SNAPPY over ZLIB as my
default compression? I tried to find some direct memory comparisons but
didn't see anything.

So, give my data and the issues on write for my wide table, how would you
recommend I address this? Is the compress.size the way to go?  What are the
long term affects of this?  Any thoughts would be welcome.

Thanks!

John

Reply via email to