[ https://issues.apache.org/jira/browse/HIVE-11317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658546#comment-14658546 ]
Alan Gates commented on HIVE-11317: ----------------------------------- Why did you decide to go with a separate thread rather than integrating this with the initiator or the cleaner? The functionality here is pretty simple and it seems like it would be easy to integrate with either of those. TxnHandler line 1730 (in heartbeatTxn) you added code to check if the heartbeat failed because the txn was already committed. A comment to make clear what you're checking for here would be helpful. TxnHandler, new method performTimeouts. You run a query with a hard coded limit (of 2500) and then have do{}while loop to add those values to the list to be deleted until you've reached your batch size. Once you reach the batch size you call abortTxns, and then go rerun the query. So why the limit clause and the do/while loop. Why not just ask up front for the number of entries in batch with the limit clause? Tests in general: I have found tests that rely on sleeps to be flaky. They will usually work locally, but placed on an EC2 box as part of the auto-patch testing they fail because the box is so busy the timeouts are no longer large enough. In the other compactor threads I've put in flags to make sure the thread ran once rather than relying on timeouts. This has produced much more reliable results. > ACID: Improve transaction Abort logic due to timeout > ---------------------------------------------------- > > Key: HIVE-11317 > URL: https://issues.apache.org/jira/browse/HIVE-11317 > Project: Hive > Issue Type: Bug > Components: Metastore, Transactions > Affects Versions: 1.0.0 > Reporter: Eugene Koifman > Assignee: Eugene Koifman > Labels: triage > Attachments: HIVE-11317.2.patch, HIVE-11317.patch > > > the logic to Abort transactions that have stopped heartbeating is in > TxnHandler.timeOutTxns() > This is only called when DbTxnManger.getValidTxns() is called. > So if there is a lot of txns that need to be timed out and the there are not > SQL clients talking to the system, there is nothing to abort dead > transactions, and thus compaction can't clean them up so garbage accumulates > in the system. > Also, streaming api doesn't call DbTxnManager at all. > Need to move this logic into Initiator (or some other metastore side thread). > Also, make sure it is broken up into multiple small(er) transactions against > metastore DB. > Also more timeOutLocks() locks there as well. > see about adding TXNS.COMMENT field which can be used for "Auto aborted due > to timeout" for example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)