Mass Dosage created HIVE-15965:
----------------------------------

             Summary: Metastore incorrectly re-uses a broken database connection
                 Key: HIVE-15965
                 URL: https://issues.apache.org/jira/browse/HIVE-15965
             Project: Hive
          Issue Type: Bug
          Components: Metastore
    Affects Versions: storage-2.2.0
            Reporter: Mass Dosage


*Background*
In our setup we have a shared standalone MetaStore server running on EMR that 
is accessed by various clients (Hive CLI, HiveServer2, Spark etc.) and connects 
to an external MariaDB database for the MetaStore DB. It came to our attention 
that MetaStore (or rather the underlying DataNucleus / BoneCP combo) will keep 
re-using the same DB connections even when those get suddenly closed for a 
reason that renders them unusable.

For instance, due to a bug in the MariaDB JDBC driver v1.3.6 (see 
https://jira.mariadb.org/browse/CONJ-270), a huge query including over 8 
thousand parameter placeholders (e.g. partition IDs in case of a 
{{get_partitions_by_expr}} function call)
will yield a {{java.nio.BufferOverflowException}} and cause the SQL connection 
be closed by the driver itself.

This will ultimately result in the abortion of all further MetaStore Thrift 
calls due to the failure of {{bonecp.ConnectionHandle.prepareStatement()}}.

Such scenarios will be then caught by DataNucleus and translated to an 
appropriate {{JDOException}}, only to be "ignored" by the 
MetaStore.{{RetryingHMSHandler}} will, of course, continue retrying the failing 
operation, but this is already pointless by that time since they will 
invariably fail as long as the SQL connection remains closed. Please see the 
attached MetaStore log [^hive.log] for details

(captured from Hive 2.1.1 running on Windows in Eclipse IDE).

 *Proposed behavior*

We suggest that MetaStore should automatically renew the DB connection whenever:

* The connection gets closed by one of the underlying frameworks (DataNucleus, 
BoneCP, JDBC driver); or
* Query timeout is detected.

This feature should be optional and configurable (disabled by default for 
backward compatibility). Reconnection failures could probably be treated as 
fatal errors and cause the immediate termination of MetaStore.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to