[ 
https://issues.apache.org/jira/browse/HIVE-11317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658546#comment-14658546
 ] 

Alan Gates commented on HIVE-11317:
-----------------------------------

Why did you decide to go with a separate thread rather than integrating this 
with the initiator or the cleaner?  The functionality here is pretty simple and 
it seems like it would be easy to integrate with either of those.

TxnHandler line 1730 (in heartbeatTxn) you added code to check if the heartbeat 
failed because the txn was already committed.  A comment to make clear what 
you're checking for here would be helpful.

TxnHandler, new method performTimeouts.  You run a query with a hard coded 
limit (of 2500) and then have do{}while loop to add those values to the list to 
be deleted until you've reached your batch size.  Once you reach the batch size 
you call abortTxns, and then go rerun the query.  So why the limit clause and 
the do/while loop.  Why not just ask up front for the number of entries in 
batch with the limit clause?

Tests in general:  I have found tests that rely on sleeps to be flaky.  They 
will usually work locally, but placed on an EC2 box as part of the auto-patch 
testing they fail because the box is so busy the timeouts are no longer large 
enough.  In the other compactor threads I've put in flags to make sure the 
thread ran once rather than relying on timeouts.  This has produced much more 
reliable results.



> ACID: Improve transaction Abort logic due to timeout
> ----------------------------------------------------
>
>                 Key: HIVE-11317
>                 URL: https://issues.apache.org/jira/browse/HIVE-11317
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore, Transactions
>    Affects Versions: 1.0.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>              Labels: triage
>         Attachments: HIVE-11317.2.patch, HIVE-11317.patch
>
>
> the logic to Abort transactions that have stopped heartbeating is in
> TxnHandler.timeOutTxns()
> This is only called when DbTxnManger.getValidTxns() is called.
> So if there is a lot of txns that need to be timed out and the there are not 
> SQL clients talking to the system, there is nothing to abort dead 
> transactions, and thus compaction can't clean them up so garbage accumulates 
> in the system.
> Also, streaming api doesn't call DbTxnManager at all.
> Need to move this logic into Initiator (or some other metastore side thread).
> Also, make sure it is broken up into multiple small(er) transactions against 
> metastore DB.
> Also more timeOutLocks() locks there as well.
> see about adding TXNS.COMMENT field which can be used for "Auto aborted due 
> to timeout" for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to