[PR] [Enhancement](recyclebin) Optimize lock granularity in CatalogRecycleBin [doris]

via GitHub Sun, 15 Mar 2026 21:35:01 -0700


morrySnow opened a new pull request, #61366:
URL: https://github.com/apache/doris/pull/61366


   ## Problem
   
   All methods in `CatalogRecycleBin.java` use `synchronized` (single monitor 
lock), creating extremely coarse lock granularity. When `erasePartition()` runs 
slowly with many partitions, other `synchronized` methods block waiting for the 
lock. Callers like `recyclePartition()` hold TABLE WRITE LOCK while waiting, 
causing cascading blocking that can bring down the entire Doris metadata 
service.
   
   ## Solution
   
   Two complementary optimizations:
   
   ### 1. Replace `synchronized` with `ReentrantReadWriteLock`
   - **Lock-free** (8 methods): Simple ConcurrentHashMap lookups 
(`isRecyclePartition`, `getRecycleTimeById`, etc.)
   - **Read lock** (4 methods): Read-only iterations 
(`allTabletsInRecycledStatus`, `getInfo`, `write`, etc.)
   - **Write lock** (11 methods): Map mutations 
(`recycleDatabase/Table/Partition`, `recover*`, `clearAll`)
   
   ### 2. Microbatch Erase Pattern (Critical)
   Refactored all 12 erase methods to process items **one at a time** with lock 
release between items:
   - **Inside write lock (per item)**: cleanup RPCs + map removal + edit log 
write
   - **Release lock between items**: other operations can proceed
   
   This reduces lock hold time from **O(N × T)** (all items) to **O(T)** (one 
item) per acquisition.
   
   ## Data Structure Changes
   
   Changed 4 internal maps from `HashMap` to `ConcurrentHashMap` to enable 
lock-free reads.
   
   ## Bug Fixes (found during self-review)
   
   1. **NPE in `getIdListToEraseByRecycleTime`**: Used `getOrDefault` to handle 
stale IDs that may be concurrently removed between snapshot and processing
   2. **DdlException in cascade erase**: Added try-catch in 
`eraseDatabaseInstantly`/`eraseTableInstantly` for partitions/tables 
concurrently erased by daemon
   
   ## Testing
   
   - All 24 existing unit tests pass
   - Added 3 new concurrency tests:
     - `testConcurrentReadsDoNotBlock` — 10 concurrent reader threads
     - `testConcurrentRecycleAndRead` — writer + 5 readers simultaneously
     - `testMicrobatchEraseReleasesLockBetweenItems` — verifies 
recyclePartition succeeds during erase
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [ ] Regression test
       - [x] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [x] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [x] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [Enhancement](recyclebin) Optimize lock granularity in CatalogRecycleBin [doris]

Reply via email to