Istvan Fajth created HIVE-21265:
-----------------------------------
Summary: Hive miss-uses HBase HConnection object and that puts
high load on Zookeeper
Key: HIVE-21265
URL: https://issues.apache.org/jira/browse/HIVE-21265
Project: Hive
Issue Type: Bug
Components: HBase Handler
Reporter: Istvan Fajth
When there is a table in Hive backed by an HBase table, then the following
access pattern is shown multiple times in Zookeeper even for a simple query
like "SELECT * FROM table":
- A client is connecting to Zookeeper
- Checks whether the /hbase ZNode exists
- Reads /hbase/hbaseid
- Client closes the connection.
The amount of these accesses are depending on the amount of data most likely it
is correlating to the number of HBase regions.
The same access pattern one can see in ZK when one runs the following Java code:
{code}import org.apache.hadoop.hbase.client.*;
public class Test {
public static void main(String args[]) throws Exception {
Connection c = ConnectionFactory.createConnection();
c.close();
}
}{code}
The problem with this is that for large tables this creates an enormous amount
of session creation which is expensive in ZK, and if the amount of queries to
this table is high, then the ZK transaction log is heavily written, and there
are way more snapshots created then otherwise due to the amount of
createSession closeSession transaction in Zookeeper. In this particular case
the Zookeeper data directory was filled with about 24GB of data and caused the
device to almost fill under the Zookeeper data directory. ~90% of the data
written was createSession and closeSession transactions.
I am not sure what logs I should provide, but reproducing the behaviour is easy
enough. In Zookeeper if one enables DEBUG level logging, the logs are showing
what is being read by sessions. These sessions live for 1-5ms tops.
I imagine that the solution is to somehow share the connection object between
the mappers if possible, and use one connection according to the suggestion in
the API documentation of ConnectionFactory and request table/admin/any object
from the one connection, or at least use only one connection object per
map/reduce, and make it a longer living connection that is there for the whole
map/reduce lifetime.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)