[ 
https://issues.apache.org/jira/browse/HIVE-27332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-27332:
----------------------------------
    Labels: pull-request-available  (was: )

> Add retry backoff mechanism for abort cleanup
> ---------------------------------------------
>
>                 Key: HIVE-27332
>                 URL: https://issues.apache.org/jira/browse/HIVE-27332
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sourabh Badhya
>            Assignee: Sourabh Badhya
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> HIVE-27019 and HIVE-27020 added the functionality to directly clean data 
> directories from aborted transactions without using Initiator & Worker. 
> However, during the event of continuous failure during cleanup, the retry 
> mechanism is initiated every single time. We need to add retry backoff 
> mechanism to control the time required to initiate retry again and not 
> continuously retry.
> There are widely 3 cases wherein retry due to abort cleanup is impacted - 
> *1. Abort cleanup on the table failed + Compaction on the table failed.*
> *2. Abort cleanup on the table failed + Compaction on the table passed*
> *3. Abort cleanup on the table failed + No compaction on the table.*
> *Solution -* 
> *We create a new table called TXN_CLEANUP_QUEUE with following fields to 
> store the retry metadata -* 
> CREATE TABLE TXN_CLEANUP_QUEUE (
> TCQ_DATABASE varchar(128) NOT NULL, 
> TCQ_TABLE varchar(256) NOT NULL,
> TCQ_PARTITION varchar(767), 
> TCQ_RETRY_RETENTION bigint NOT NULL DEFAULT 0, 
> TCQ_ERROR_MESSAGE mediumtext in MySQL / clob in derby, oracle DB / text in 
> postgres / varchar(max) in mssql DB
> );
> *Advantage: Separates the flow of metadata. We also eliminate the chance of 
> breaking the compaction/abort cleanup when modifying metadata of abort 
> cleanup/compaction. Easier debugging in case of failures.*
> *Actions performed by TaskHandler in the case of failure -* 
> *AbortTxnCleaner -* 
> Action: Just add retry details in the queue table during the abort failure.
> *CompactionCleaner -* 
> Action: If compaction on the same table is successful, delete the retry entry 
> in markCleaned when removing any TXN_COMPONENTS entries except when there are 
> no uncompacted aborts. We do not want to be in a situation where there is a 
> queue entry for a table but there is no record in TXN_COMPONENTS associated 
> with the same table.
> *Advantage: Expecting no performance issues with this approach. Since we 
> delete 1 record most of the times for the associated table/partition.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to