[
https://issues.apache.org/jira/browse/HIVE-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HIVE-29644:
----------------------------------
Labels: pull-request-available (was: )
> HMS hang/deadlock during ACID replication: compaction enqueue incorrectly
> runs inside replTableWriteIdState transaction
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29644
> URL: https://issues.apache.org/jira/browse/HIVE-29644
> Project: Hive
> Issue Type: Bug
> Reporter: Shreenidhi
> Assignee: Shreenidhi
> Priority: Major
> Labels: pull-request-available
>
> h3. Problem
> During large Hive ACID bootstrap replication on the target (DR) cluster, HMS
> can become unresponsive. Queries stall at compile time waiting to open
> transactions. The issue requires HMS restart to recover.
> Postgres {{pg_stat_activity}} shows multiple {{idle in transaction}}
> connections on:
> * {{AUX_TABLE}} ({{{}SELECT ... FOR UPDATE{}}} for {{CompactionScheduler}}
> mutex)
> * {{COMPACTION_QUEUE}} / {{NEXT_COMPACTION_QUEUE_ID}}
> HMS logs show cross-node blocking between:
> * HMS running replication ({{{}ReplTableWriteIdStateFunction{}}} /
> {{{}repl_tbl_writeid_state{}}})
> * HMS running compaction initiator ({{{}CompactFunction{}}} via
> {{{}TxnHandler.compact{}}})
> ----
> h3. Root cause
> When replication applies ACID write-ID state for tables with aborted write
> IDs, HMS schedules major compaction for each partition to clean aborted delta
> files.
> Before HIVE-27481, {{TxnHandler.replTableWriteIdState}} worked correctly:
> # Apply write-ID state in one DB transaction
> # Commit
> # Call separate {{compact()}} per partition (each with its own transaction)
> After HIVE-27481 ({{{}TxnHandler cleanup{}}}), logic moved to
> {{ReplTableWriteIdStateFunction}} inside a single
> "{{{}@Transactional(POOL_TX)"{}}} method.
> Compaction enqueue via {{CompactFunction}} was incorrectly inlined in the
> same transaction as write-ID apply:
> @Transactional(POOL_TX) replTableWriteIdState()
> ├── apply aborted write IDs, insert NEXT_WRITE_ID
> └── for each partition:
> CompactFunction.execute() // mutex (POOL_MUTEX) + NCQ lock
> (POOL_TX)
> └── commit (only at end)
> This causes:
> * {{NEXT_COMPACTION_QUEUE_ID}} row lock held across all partition enqueues
> in one long transaction
> * Repeated acquisition of {{CompactionScheduler}} mutex across loop
> iterations
> * Cross-connection lock contention / AB-BA deadlock with concurrent
> {{compact()}} (initiator, another replication job, or manual compact)
> Manual {{ALTER TABLE ... COMPACT 'major'}} does not exhibit this because each
> {{compact()}} is a separate {{@Transactional(POOL_TX)}} call that commits
> immediately — same as pre-HIVE-27481 behavior.
> ----
> h3. Locking details
> Compaction enqueue uses two DB connections:
> ||Connection||Lock||Purpose||
> |POOL_MUTEX|{{AUX_TABLE}} CompactionScheduler|Serialize compaction scheduling|
> |POOL_TX|{{NEXT_COMPACTION_QUEUE_ID}} FOR UPDATE|Generate unique compaction
> queue ID|
> Deadlock/contention occurs when:
> * Thread A holds NCQ lock (long repl txn) and waits for mutex (next
> partition iteration)
> * Thread B holds mutex (inside {{{}CompactFunction{}}}) and waits for NCQ
> lock
> Disabling compactor initiator on DR reduces but does not eliminate risk —
> concurrent replication jobs alone can trigger the same pattern.
> ----
> h3. Regression introduced by
> HIVE-27481 — {{TxnHandler cleanup}} (Dec 2023)
> File: {{ReplTableWriteIdStateFunction.java}} — inlined {{CompactFunction}}
> loop inside {{@Transactional(POOL_TX)}} {{{}replTableWriteIdState{}}}.
> Pre-HIVE-27481 code explicitly committed write-ID state first, then called
> {{compact()}} separately per partition.
> ----
> h3. Proposed fix
> Restore pre-HIVE-27481 behavior in the refactored code
--
This message was sent by Atlassian Jira
(v8.20.10#820010)