[ 
https://issues.apache.org/jira/browse/HIVE-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-29644:
----------------------------------
    Labels: pull-request-available  (was: )

> HMS hang/deadlock during ACID replication: compaction enqueue incorrectly 
> runs inside replTableWriteIdState transaction
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29644
>                 URL: https://issues.apache.org/jira/browse/HIVE-29644
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Shreenidhi
>            Assignee: Shreenidhi
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Problem
> During large Hive ACID bootstrap replication on the target (DR) cluster, HMS 
> can become unresponsive. Queries stall at compile time waiting to open 
> transactions. The issue requires HMS restart to recover.
> Postgres {{pg_stat_activity}} shows multiple {{idle in transaction}} 
> connections on:
>  * {{AUX_TABLE}} ({{{}SELECT ... FOR UPDATE{}}} for {{CompactionScheduler}} 
> mutex)
>  * {{COMPACTION_QUEUE}} / {{NEXT_COMPACTION_QUEUE_ID}}
> HMS logs show cross-node blocking between:
>  * HMS running replication ({{{}ReplTableWriteIdStateFunction{}}} / 
> {{{}repl_tbl_writeid_state{}}})
>  * HMS running compaction initiator ({{{}CompactFunction{}}} via 
> {{{}TxnHandler.compact{}}})
> ----
> h3. Root cause
> When replication applies ACID write-ID state for tables with aborted write 
> IDs, HMS schedules major compaction for each partition to clean aborted delta 
> files.
> Before HIVE-27481, {{TxnHandler.replTableWriteIdState}} worked correctly:
>  # Apply write-ID state in one DB transaction
>  # Commit
>  # Call separate {{compact()}} per partition (each with its own transaction)
> After HIVE-27481 ({{{}TxnHandler cleanup{}}}), logic moved to 
> {{ReplTableWriteIdStateFunction}} inside a single 
> "{{{}@Transactional(POOL_TX)"{}}} method. 
> Compaction enqueue via {{CompactFunction}} was incorrectly inlined in the 
> same transaction as write-ID apply:
> @Transactional(POOL_TX) replTableWriteIdState()
> ├── apply aborted write IDs, insert NEXT_WRITE_ID
> └── for each partition:
>                   CompactFunction.execute() // mutex (POOL_MUTEX) + NCQ lock 
> (POOL_TX)
> └── commit (only at end)
> This causes:
>  * {{NEXT_COMPACTION_QUEUE_ID}} row lock held across all partition enqueues 
> in one long transaction
>  * Repeated acquisition of {{CompactionScheduler}} mutex across loop 
> iterations
>  * Cross-connection lock contention / AB-BA deadlock with concurrent 
> {{compact()}} (initiator, another replication job, or manual compact)
> Manual {{ALTER TABLE ... COMPACT 'major'}} does not exhibit this because each 
> {{compact()}} is a separate {{@Transactional(POOL_TX)}} call that commits 
> immediately — same as pre-HIVE-27481 behavior.
> ----
> h3. Locking details
> Compaction enqueue uses two DB connections:
> ||Connection||Lock||Purpose||
> |POOL_MUTEX|{{AUX_TABLE}} CompactionScheduler|Serialize compaction scheduling|
> |POOL_TX|{{NEXT_COMPACTION_QUEUE_ID}} FOR UPDATE|Generate unique compaction 
> queue ID|
> Deadlock/contention occurs when:
>  * Thread A holds NCQ lock (long repl txn) and waits for mutex (next 
> partition iteration)
>  * Thread B holds mutex (inside {{{}CompactFunction{}}}) and waits for NCQ 
> lock
> Disabling compactor initiator on DR reduces but does not eliminate risk — 
> concurrent replication jobs alone can trigger the same pattern.
> ----
> h3. Regression introduced by
> HIVE-27481 — {{TxnHandler cleanup}} (Dec 2023)
> File: {{ReplTableWriteIdStateFunction.java}} — inlined {{CompactFunction}} 
> loop inside {{@Transactional(POOL_TX)}} {{{}replTableWriteIdState{}}}.
> Pre-HIVE-27481 code explicitly committed write-ID state first, then called 
> {{compact()}} separately per partition.
> ----
> h3. Proposed fix
> Restore pre-HIVE-27481 behavior in the refactored code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to