[ 
https://issues.apache.org/jira/browse/HIVE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170041#comment-14170041
 ] 

Sushanth Sowmyan commented on HIVE-7368:
----------------------------------------

Hi Selina,

Yes, I would agree that the connection pool (or jdbc driver, since I've since 
been able to see this happening a couple of times with DBCP as well) is 
probably raising some sort of internal error that is being incorrectly read as 
normal operation by DN, which results in a NSOE by the hive ObjectStore. I 
definitely agree that this is the underlying error that we need to reproduce 
and track down to fix.

In the case of a persistent remote metastore, I would agree that increasing the 
size of the connection pools makes sense, and should be the way to go. I 
generally do advise a larger pool, and always going through the metastore. 

But in the case of parallel hive fatclients, the embedded metastore is 
effectively single-threaded w.r.t to connections to the database, so I'm afraid 
I don't yet understand how having a larger pool would help in this case. Could 
you please expand on this bit?

(And yes, "datanucleus.connectionPool.testSQL=SELECT 1" is so that the overhead 
of DN testing connectivity to the db is minimized - without that, DN creates a 
bunch of deleteme* tables and drops them to test connectivity.)


> datanucleus sometimes returns an empty result instead of an error or data
> -------------------------------------------------------------------------
>
>                 Key: HIVE-7368
>                 URL: https://issues.apache.org/jira/browse/HIVE-7368
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 0.12.0
>            Reporter: Sushanth Sowmyan
>
> I investigated a scenario wherein a user needed to use a large number of 
> concurrent hive clients doing simple DDL tasks, while not using a standalone 
> metastore server. Say, for eg., each of them doing "drop table if exists 
> tmp_blah_${i};"
> This would consistently fail stating that it could not create a db, which is 
> a funny error to have when trying to drop a db "if exists". On digging in, it 
> turned out that the error was a mistaken report, coming instead from an 
> attempt by the embedded metastore attempting to create a "default" db when it 
> did not exist. The funny thing being that the default db did exist, and the 
> getDatabase call would return empty, rather than returning an error or 
> returning a result. We could disable hive.metastore.checkForDefaultDb and the 
> number of these reports would drastically fall, but that only moved the 
> problem, and we'd get phantom reports from time to time of various other 
> databases that existed that were being reported as non-existent.
> On digging further, parallelism seemed to be an important factor in whether 
> or not hive was able to perform getDatabases without error. With about 20 
> simultaneous processes, there seemed to be no errors whatsoever. At about 40 
> simultaneous processes, at least 1 would consistently fail. At about 200, 
> about 15-20 would consistently fail, in addition to taking a long time to run.
> I wrote a sample JDBC ping (actually a get_database mimic) utility to see 
> whether the issue was with connecting from that machine to the database 
> server, and this had no errors whatsoever up to 400 simultaneous processes. 
> The mysql server in question was configured to serve up to 650 connections, 
> and it seemed to be serving responses quickly and did not seem overloaded. We 
> also disabled connection pooling in case that was exacerbating a connection 
> availability issue with that many concurrent processes, each with an embedded 
> metastore. That, especially in conjunction with disabling schema checking, 
> and specifying a "datanucleus.connectionPool.testSQL=SELECT 1" did a fair 
> amount for performance in this scenarios, but the errors (or rather, the 
> null-result-successes when there shouldn't have been one) continued.
> On checking through hive again, if we modified hive to have datanucleus 
> simply return a connection, with which we did a direct sql get database, 
> there would not be any error, but if we tried to use jdo on datanucleus to 
> construct a db object, we would get an empty result, so the issue seems to 
> crop up in the jdo mapping.
> One of the biggest issues with this investigation, for me, was the difficulty 
> of reproducibility. When trying to reproduce in a lab, we were unable to 
> create a similar enough environment that caused the issue. Even in the 
> client's environment, moving from RHEL5 to RHEL6 made the issue go away.
> Thus, we still have work to do on determining the underlying issue, I'm 
> logging this issue to collect information on similar issues we discover so we 
> can work towards nailing down the issue and then fixing it(in DN if need be)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to