many columns

Gopal V Tue, 14 Oct 2014 10:35:56 -0700

On 10/13/14, 10:53 PM, Sean McNamara wrote:

I’ve found a condition where the MemoryManager will wait too long before 
notifying writers
to check their memory and flush.

...

This issue affects anyone who is writing a lot of columns, very large columns, 
or worst of
all: both. I have tested and confirmed this issue on hive 0.12, 0.13, and trunk.

Can you post the exact query, because this OOM is in my list of alreadyfixed performance issues (HIVE-6455).

I have tested Hive-13 partitioned inserts with just "insert into tableselect *" for both 30Tb of data and 10,000 columns.

This issue happens in hive-12 & before, which keeps too many ORC filesopen at the same time.

If you are on hive-13 or later, setting the config optionhive.optimize.sort.dynamic.partition=true; should fix this issue.

This follows a path within the FileSinkOperator which keeps exactly 1stripe open at any given time, so that the this always works correctlyassuming the orc.stripe.size fits within memory.

The issue is on line 50: ROWS_BETWEEN_CHECKS = 5000;

For large or many columns it’s easy to hit GC issues or OOM before 5k rows are 
written.

I believe that rows-between-checks should be made a configuration parameter 
that can be passed
in on the JobConf.

5000 rows is probably the wrong thing to check, for sure - but it is asane default. Perhaps instead it could check between every stride indexbeing written (which is every 10,000 rows) or some fraction of it.

But that check produces bad ORC files and still doesn't fix the actualissue - this is merely postponing the inevitable.

Let me describe the errors I hit before we had the sort.partitionimplementation.

At multiple terabyte scale, the next error you will hit will be an HDFSLease Expired exception, then the system runs out of file handles andafter that it runs of stack for DFSOutputStream threads.

Even if you don't go that far, the memory manager doesn't slice memoryall the way down to a single row. The minimum size of a stripe isnum-cols * compress-size, we can't shrink the stripe size below that.

The trouble is that with tiny stripes of less than 1Mb, the read-pathsuffers heavily, the split generation becomes incredibly expensive andthe inter-stripe padding becomes a significant fraction of the HDFSspace used (upto 47% of space will be padding).

So you can submit a patch for the JobConf to work around this, but itwill generate sub-optimal ORC files.

The scalable & logically correct fix is already there in Hive, you haveto make sure the config option is on.

FYI, the Hive plan we generate corresponds to an MRv2 example whichcombines LazyOutputFormat with MultipleOutputs to produce similar results.


Not sure if a similar option exists in Pig.

Cheers,
Gopal

Re: GC/OOM fix when writing large/many columns

Reply via email to