hcatreader out of memory error

Nathan Bamford Tue, 01 Aug 2017 10:01:01 -0700

Hello,

  My company has a product that is a data processing  yarn app. Because we 
essentially take the place of map reduce, we use HCatalog for reading and 
writing Hive tables.

We implemented our solution using the reader and writer as described here:

https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter

HCatalog ReaderWriter - Apache Hive - Apache Software
...<https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter>
cwiki.apache.org
Overview. HCatalog provides a data transfer API for parallel input and output
without using MapReduce. This API provides a way to read data from a Hadoop
cluster or ...

This has worked more or less okay, but there are a couple of issues with it.

First, some time back (I think either 0.13 or 0.14), the interface to the
ReaderContext changed so we were no longer able to retrieve InputSplit objects
from the ReaderContext via getSplits(). Now one must getNumSplits() and
retrieve individual splits by an id number.

This was a big problem for us, because we have our own load balancing
algorithms and need to know the locations and sizes of the splits. I managed to
get around this by using reflection to call the internal getSplits(), but of
course this is far from a good solution.

Recently, we've been getting into some very large clusters with very large
Hive tables. In some cases, tens of thousands of data splits.

This has the effect of causing out of memory errors from the JVM when this
call is made:

ReaderContext cntxt = reader.prepareRead();

org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1078)

org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1105)

org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:153)

org.apache.hive.hcatalog.data.transfer.impl.HCatInputFormatReader.prepareRead(HCatInputFormatReader.java:68)

Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
Java heap space

Sometimes we've been able to deal with it by increasing the memory to the JVM
(although the slowdown in prepareRead is awful), but sometimes we can't seem to
provide enough.
I notice from perusing the code that each InputSplit contains a copy of the
table schema, which is enormous in these cases.

My question to the community at large is: Is HCatalog still the recommended
way for a yarn app like us to interface with Hive? HiveServer2 has most of the
functionality we need, but no ability to get information about the data splits.
If HCatalog is the only game in town, how are we meant to deal with these
memory errors?

thanks,

Nathan Bamford

hcatreader out of memory error

Reply via email to