[ https://issues.apache.org/jira/browse/HIVE-27332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HIVE-27332: ---------------------------------- Labels: pull-request-available (was: ) > Add retry backoff mechanism for abort cleanup > --------------------------------------------- > > Key: HIVE-27332 > URL: https://issues.apache.org/jira/browse/HIVE-27332 > Project: Hive > Issue Type: Sub-task > Reporter: Sourabh Badhya > Assignee: Sourabh Badhya > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > HIVE-27019 and HIVE-27020 added the functionality to directly clean data > directories from aborted transactions without using Initiator & Worker. > However, during the event of continuous failure during cleanup, the retry > mechanism is initiated every single time. We need to add retry backoff > mechanism to control the time required to initiate retry again and not > continuously retry. > There are widely 3 cases wherein retry due to abort cleanup is impacted - > *1. Abort cleanup on the table failed + Compaction on the table failed.* > *2. Abort cleanup on the table failed + Compaction on the table passed* > *3. Abort cleanup on the table failed + No compaction on the table.* > *Solution -* > *We create a new table called TXN_CLEANUP_QUEUE with following fields to > store the retry metadata -* > CREATE TABLE TXN_CLEANUP_QUEUE ( > TCQ_DATABASE varchar(128) NOT NULL, > TCQ_TABLE varchar(256) NOT NULL, > TCQ_PARTITION varchar(767), > TCQ_RETRY_RETENTION bigint NOT NULL DEFAULT 0, > TCQ_ERROR_MESSAGE mediumtext in MySQL / clob in derby, oracle DB / text in > postgres / varchar(max) in mssql DB > ); > *Advantage: Separates the flow of metadata. We also eliminate the chance of > breaking the compaction/abort cleanup when modifying metadata of abort > cleanup/compaction. Easier debugging in case of failures.* > *Actions performed by TaskHandler in the case of failure -* > *AbortTxnCleaner -* > Action: Just add retry details in the queue table during the abort failure. > *CompactionCleaner -* > Action: If compaction on the same table is successful, delete the retry entry > in markCleaned when removing any TXN_COMPONENTS entries except when there are > no uncompacted aborts. We do not want to be in a situation where there is a > queue entry for a table but there is no record in TXN_COMPONENTS associated > with the same table. > *Advantage: Expecting no performance issues with this approach. Since we > delete 1 record most of the times for the associated table/partition.* -- This message was sent by Atlassian Jira (v8.20.10#820010)