could you file a Jira for this please? From: Eugene Koifman <ekoif...@hortonworks.com<mailto:ekoif...@hortonworks.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>> Date: Wednesday, September 23, 2015 at 8:15 AM To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>> Subject: Re: Transaction deadlocks
that’s indeed a problem. From: Steve Howard <stevedhow...@gmail.com<mailto:stevedhow...@gmail.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>> Date: Tuesday, September 22, 2015 at 6:48 PM To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>> Subject: Re: Transaction deadlocks Am I missing it, or do you reset the deadlock count to 0 every time the lock method is taken, so the ten count is never reached in checkRetryable()? public LockResponse lock(LockRequest rqst) throws NoSuchTxnException, TxnAbortedException, MetaException { >>> this.deadlockCnt = 0; <<< try { Connection dbConn = null; try { dbConn = getDbConn(8); return lock(dbConn, rqst, true); } catch (SQLException e) { LOG.debug("Going to rollback"); rollbackDBConn(dbConn); checkRetryable(dbConn, e, "lock(" + rqst + ")"); throw new MetaException("Unable to update transaction database " + StringUtils.stringifyException(e)); } finally { closeDbConn(dbConn); } return lock(rqst); } catch (RetryException e) {} } On Tue, Sep 22, 2015 at 9:24 PM, Steve Howard <stevedhow...@gmail.com<mailto:stevedhow...@gmail.com>> wrote: Hi Eugene, This is HDP 2.3, 1.2.1.2.3.0.0-2557. I see the deadlock count is 10, and we haven't overriden ConfVars.HMSHANDLERATTEMPTS, which also defaults to 10. I am at a loss why the call stack would be that large (I think it is 1,000), as that indicates to me it keeps failing to get the lock and keeps calling itself. protected void checkRetryable(Connection conn, SQLException e, String caller) throws TxnHandler.RetryException, MetaException { if ((this.dbProduct == null) && (conn != null)) { determineDatabaseProduct(conn); } if (((e instanceof SQLTransactionRollbackException)) || (((this.dbProduct != DatabaseProduct.MYSQL) && (this.dbProduct != DatabaseProduct.POSTGRES) && (this.dbProduct != DatabaseProduct.SQLSERVER)) || ((e.getSQLState().equals("40001")) || ((this.dbProduct == DatabaseProduct.POSTGRES) && (e.getSQLState().equals("40P01"))) || ((this.dbProduct == DatabaseProduct.ORACLE) && ((e.getMessage().contains("deadlock detected")) || (e.getMessage().contains("can't serialize access for this transaction"))))))) { if (this.deadlockCnt++ < 10) { long waitInterval = this.deadlockRetryInterval * this.deadlockCnt; LOG.warn("Deadlock detected in " + caller + ". Will wait " + waitInterval + "ms try again up to " + (10 - this.deadlockCnt + 1) + " times."); try { Thread.sleep(waitInterval); } catch (InterruptedException ie) {} throw new RetryException(); } LOG.error("Too many repeated deadlocks in " + caller + ", giving up."); this.deadlockCnt = 0; } else if (isRetryable(e)) { if (this.retryNum++ < this.retryLimit) { LOG.warn("Retryable error detected in " + caller + ". Will wait " + this.retryInterval + "ms and retry up to " + (this.retryLimit - this.retryNum + 1) + " times. Error: " + getMessage(e)); try { Thread.sleep(this.retryInterval); } catch (InterruptedException ex) {} throw new RetryException(); } LOG.error("Fatal error. Retry limit (" + this.retryLimit + ") reached. Last error: " + getMessage(e)); this.retryNum = 0; } else { this.deadlockCnt = 0; this.retryNum = 0; } } Thanks, Steve On Tue, Sep 22, 2015 at 6:22 PM, Eugene Koifman <ekoif...@hortonworks.com<mailto:ekoif...@hortonworks.com>> wrote: What version of Hive are you running? In the current codebase at least, TxnHandler.checkRetryable() keeps track of how many retries it allowed – it will not retry forever. From: Steve Howard <stevedhow...@gmail.com<mailto:stevedhow...@gmail.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>> Date: Tuesday, September 22, 2015 at 1:32 PM To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>> Subject: Transaction deadlocks Thread A… "HiveServer2-Background-Pool: Thread-35" #35 prio=5 os_prio=0 tid=0x00007fd150e40000 nid=0x2c97 runnable [0x00007fd146e0a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at oracle.net.ns.Packet.receive(Packet.java:300) at oracle.net.ns.DataPacket.receive(DataPacket.java:106) at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:315) at oracle.net.ns.NetInputStream.read(NetInputStream.java:260) at oracle.net.ns.NetInputStream.read(NetInputStream.java:185) at oracle.net.ns.NetInputStream.read(NetInputStream.java:102) at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:124) at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:80) at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1137) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:290) at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192) at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531) at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:193) at oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1033) at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1329) at oracle.jdbc.driver.OracleStatement.executeUpdateInternal(OracleStatement.java:1838) at oracle.jdbc.driver.OracleStatement.executeUpdate(OracleStatement.java:1803) - locked <0x00000000c09fcda0> (a oracle.jdbc.driver.T4CConnection) at oracle.jdbc.driver.OracleStatementWrapper.executeUpdate(OracleStatementWrapper.java:294) at org.apache.commons.dbcp.DelegatingStatement.executeUpdate(DelegatingStatement.java:228) at org.apache.commons.dbcp.DelegatingStatement.executeUpdate(DelegatingStatement.java:228) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:1432) - locked <0x00000000c09fcc28> (a java.lang.Object) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:422) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) …blocks Thread B, resulting in a deadlock message being written to the logs over and over again… "HiveServer2-Background-Pool: Thread-51" #51 prio=5 os_prio=0 tid=0x000000000279f000 nid=0x7227 waiting for monitor entry [0x00007fd146a47000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:1361) - waiting to lock <0x00000000c09fcc28> (a java.lang.Object) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:422) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:433) …but my guess is the bigger issue is that eventually the stack will be exhausted, as the method below recursively calls itself (notice the stack above)… public LockResponse lock(LockRequest rqst) throws NoSuchTxnException, TxnAbortedException, MetaException { this.deadlockCnt = 0; try { Connection dbConn = null; try { dbConn = getDbConn(8); return lock(dbConn, rqst, true); } catch (SQLException e) { LOG.debug("Going to rollback"); rollbackDBConn(dbConn); checkRetryable(dbConn, e, "lock(" + rqst + ")"); throw new MetaException("Unable to update transaction database " + StringUtils.stringifyException(e)); } finally { closeDbConn(dbConn); } return lock(rqst); } catch (RetryException e) {} } This effectively freezes all access to a table. Any ideas?