Hello,

  My company has a product that is a data processing  yarn app. Because we 
essentially take the place of map reduce, we use HCatalog for reading and 
writing Hive tables.

  We implemented our solution using the reader and writer as described here:

https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter

HCatalog ReaderWriter - Apache Hive - Apache Software 
...<https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter>
cwiki.apache.org
Overview. HCatalog provides a data transfer API for parallel input and output 
without using MapReduce. This API provides a way to read data from a Hadoop 
cluster or ...

  This has worked more or less okay, but there are a couple of issues with it.


  First, some time back (I think either 0.13 or 0.14), the interface to the 
ReaderContext changed so we were no longer able to retrieve InputSplit objects 
from the ReaderContext via getSplits(). Now one must getNumSplits() and 
retrieve individual splits by an id number.

  This was a big problem for us, because we have our own load balancing 
algorithms and need to know the locations and sizes of the splits. I managed to 
get around this by using reflection to call the internal getSplits(), but of 
course this is far from a good solution.

  Recently, we've been getting into some very large clusters with very large 
Hive tables. In some cases, tens of thousands of data splits.

  This has the effect of causing out of memory errors from the JVM when this 
call is made:

ReaderContext cntxt = reader.prepareRead();

    
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1078)
    
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1105)
    
org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:153)
    
org.apache.hive.hcatalog.data.transfer.impl.HCatInputFormatReader.prepareRead(HCatInputFormatReader.java:68)

Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: 
Java heap space


  Sometimes we've been able to deal with it by increasing the memory to the JVM 
(although the slowdown in prepareRead is awful), but sometimes we can't seem to 
provide enough.
  I notice from perusing the code that each InputSplit contains a copy of the 
table schema, which is enormous in these cases.

  My question to the community at large is: Is HCatalog still the recommended 
way for a yarn app like us to interface with Hive? HiveServer2 has most of the 
functionality we need, but no ability to get information about the data splits. 
If HCatalog is the only game in town, how are we meant to deal with these 
memory errors?

thanks,

Nathan Bamford

Reply via email to