[ https://issues.apache.org/jira/browse/HIVE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170041#comment-14170041 ]
Sushanth Sowmyan commented on HIVE-7368: ---------------------------------------- Hi Selina, Yes, I would agree that the connection pool (or jdbc driver, since I've since been able to see this happening a couple of times with DBCP as well) is probably raising some sort of internal error that is being incorrectly read as normal operation by DN, which results in a NSOE by the hive ObjectStore. I definitely agree that this is the underlying error that we need to reproduce and track down to fix. In the case of a persistent remote metastore, I would agree that increasing the size of the connection pools makes sense, and should be the way to go. I generally do advise a larger pool, and always going through the metastore. But in the case of parallel hive fatclients, the embedded metastore is effectively single-threaded w.r.t to connections to the database, so I'm afraid I don't yet understand how having a larger pool would help in this case. Could you please expand on this bit? (And yes, "datanucleus.connectionPool.testSQL=SELECT 1" is so that the overhead of DN testing connectivity to the db is minimized - without that, DN creates a bunch of deleteme* tables and drops them to test connectivity.) > datanucleus sometimes returns an empty result instead of an error or data > ------------------------------------------------------------------------- > > Key: HIVE-7368 > URL: https://issues.apache.org/jira/browse/HIVE-7368 > Project: Hive > Issue Type: Bug > Components: Metastore > Affects Versions: 0.12.0 > Reporter: Sushanth Sowmyan > > I investigated a scenario wherein a user needed to use a large number of > concurrent hive clients doing simple DDL tasks, while not using a standalone > metastore server. Say, for eg., each of them doing "drop table if exists > tmp_blah_${i};" > This would consistently fail stating that it could not create a db, which is > a funny error to have when trying to drop a db "if exists". On digging in, it > turned out that the error was a mistaken report, coming instead from an > attempt by the embedded metastore attempting to create a "default" db when it > did not exist. The funny thing being that the default db did exist, and the > getDatabase call would return empty, rather than returning an error or > returning a result. We could disable hive.metastore.checkForDefaultDb and the > number of these reports would drastically fall, but that only moved the > problem, and we'd get phantom reports from time to time of various other > databases that existed that were being reported as non-existent. > On digging further, parallelism seemed to be an important factor in whether > or not hive was able to perform getDatabases without error. With about 20 > simultaneous processes, there seemed to be no errors whatsoever. At about 40 > simultaneous processes, at least 1 would consistently fail. At about 200, > about 15-20 would consistently fail, in addition to taking a long time to run. > I wrote a sample JDBC ping (actually a get_database mimic) utility to see > whether the issue was with connecting from that machine to the database > server, and this had no errors whatsoever up to 400 simultaneous processes. > The mysql server in question was configured to serve up to 650 connections, > and it seemed to be serving responses quickly and did not seem overloaded. We > also disabled connection pooling in case that was exacerbating a connection > availability issue with that many concurrent processes, each with an embedded > metastore. That, especially in conjunction with disabling schema checking, > and specifying a "datanucleus.connectionPool.testSQL=SELECT 1" did a fair > amount for performance in this scenarios, but the errors (or rather, the > null-result-successes when there shouldn't have been one) continued. > On checking through hive again, if we modified hive to have datanucleus > simply return a connection, with which we did a direct sql get database, > there would not be any error, but if we tried to use jdo on datanucleus to > construct a db object, we would get an empty result, so the issue seems to > crop up in the jdo mapping. > One of the biggest issues with this investigation, for me, was the difficulty > of reproducibility. When trying to reproduce in a lab, we were unable to > create a similar enough environment that caused the issue. Even in the > client's environment, moving from RHEL5 to RHEL6 made the issue go away. > Thus, we still have work to do on determining the underlying issue, I'm > logging this issue to collect information on similar issues we discover so we > can work towards nailing down the issue and then fixing it(in DN if need be) -- This message was sent by Atlassian JIRA (v6.3.4#6332)