Hastyshell opened a new pull request, #61380:
URL: https://github.com/apache/doris/pull/61380

   ## Problem
   
   In cloud mode, schema change on MOW (Merge-on-Write) tables intermittently 
fails with:
   
   ```
   task type: ALTER, status_code: INTERNAL_ERROR, status_message:
   [(BE_IP)[INTERNAL_ERROR]failed to start tablet job:
   meta_service_job.cpp could not perform compaction on expired tablet cache.
   req_base_compaction_cnt=0, base_compaction_cnt=0,
   req_cumulative_compaction_cnt=8, cumulative_compaction_cnt=9]
   ```
   
   ## Root Cause
   
   Schema change on a MOW table calls `_process_delete_bitmap()`, which 
registers a `STOP_TOKEN` compaction job via 
`CloudCompactionStopToken::do_register()`. The `STOP_TOKEN` is **not a real 
compaction** — it is a lock marker that blocks concurrent compactions during 
delete bitmap recalculation.
   
   However, `start_compaction_job()` in the meta-service applies the stale 
tablet cache check **unconditionally to all compaction types**, including 
`STOP_TOKEN`. If a concurrent compaction on another BE node advances 
`cumulative_compaction_cnt` in the meta-service while the schema change BE 
still holds its old cached value, the `STOP_TOKEN` registration is rejected 
with `STALE_TABLET_CACHE`. This error propagates back to the FE as a fatal 
ALTER task failure.
   
   ## Fix
   
   Skip the stale tablet cache check when the compaction job type is 
`STOP_TOKEN`. Since `STOP_TOKEN` does not read or compact any rowsets, 
verifying the freshness of cached compaction counts is meaningless for it.
   
   ```cpp
   // Before
   if (compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
       compaction.cumulative_compaction_cnt() < 
stats.cumulative_compaction_cnt()) {
   
   // After
   if (compaction.type() != TabletCompactionJobPB::STOP_TOKEN &&
       (compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
        compaction.cumulative_compaction_cnt() < 
stats.cumulative_compaction_cnt())) {
   ```
   
   ## Testing
   
   Added regression test `StopTokenSkipsStaleTabletCacheCheck` in 
`cloud/test/meta_service_job_test.cpp` that:
   1. Sets up a tablet with `cumulative_compaction_cnt=9` on the meta-service 
side
   2. Verifies that a regular `CUMULATIVE` compaction with stale count=8 is 
still correctly rejected with `STALE_TABLET_CACHE`
   3. Verifies that a `STOP_TOKEN` with the same stale count=8 succeeds with 
`OK`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to