(doris) branch branch-4.1 updated: branch-4.1: [fix](test) stabilize Cloud-P0 flaky cases (test-only backport subset of #64525) (#64597)

yiguolei Wed, 17 Jun 2026 03:54:37 -0700

This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch branch-4.1
in repository https://gitbox.apache.org/repos/asf/doris.git



The following commit(s) were added to refs/heads/branch-4.1 by this push:
     new 79165cac0ac branch-4.1: [fix](test) stabilize Cloud-P0 flaky cases 
(test-only backport subset of #64525) (#64597)
79165cac0ac is described below

commit 79165cac0acd5f7aaea94773b2addfa2f6f83768
Author: shuke <[email protected]>
AuthorDate: Wed Jun 17 18:54:13 2026 +0800

    branch-4.1: [fix](test) stabilize Cloud-P0 flaky cases (test-only backport 
subset of #64525) (#64597)
    
    ## What problem does this PR solve?
    
    Stabilizes the **branch-4.1 Cloud-P0** regression pipeline by
    back-porting the
    relevant, **test-only** subset of branch-4.0 #64525 (which already
    stabilized
    the 4.0 P0/Cloud-P0 pipeline). These suites are currently
    muted-but-still-failing
    on 4.1 Cloud-P0; each commit deflakes one case so it can later be
    un-muted.
    
    Cases addressed (failure → fix):
    - **scanner_profile** — hard-coded `actualRows=9` assert (fails when
    tablets prune to <9 rows) → assert `actualRows ∈ [1,9]` via robust
    regex, and poll until the async execution profile lands.
    - **temp_table** — `assertEquals(show_tablets.size(), 3)` assumed 1
    replica; on the 3-replica cloud cluster SHOW TABLETS returns 3×3=9 rows
    → derive replica count robustly (the old `/:(\d+)/` regex missed the
    space in `tag.location.default: 3`).
    - **insert_group_commit_into_max_filter_ratio** — group-commit async
    count race.
    - **adaptive_pipeline_task_serial_read_on_limit** — poll until the
    async-reported profile is complete.
    - **prune_bucket_with_bucket_shuffle_join** — pin
    `parallel_pipeline_task_num=1` so bucket-shuffle plan is deterministic.
    
    All changes are under `regression-test/` (no FE/BE code), so no
    compilation impact.
    
    ## Cherry-picked from branch-4.0 #64525 (with `-x` provenance)
    - `[fix](test) fix scanner_profile actualRows regex (partial backport of
    #64238)`
    - `[fix](test) make test_temp_table replica detection robust`
    - `[fix](case) fix insert_group_commit_into_max_filter_ratio`
    - `[fix](test) poll for complete profile in scanner_profile`
    - `[fix](test) poll for complete profile in
    adaptive_pipeline_task_serial_read_on_limit`
    - `[fix](test) pin parallel_pipeline_task_num=1 to keep bucket shuffle
    in prune_bucket_with_bucket_shuffle_join`
    
    ## Deliberately NOT included (follow-up)
    - **ANN index-only-scan rework** (#64525 replaces
    `ann_index_only_scan*.groovy` with debug-point variants) — needs
    confirmation the new debug points exist on 4.1; deferred to a separate
    PR.
    - **time_series compaction debug-point** — touches BE
    (`cumulative_compaction_time_series_policy.cpp`), needs a compile;
    deferred.
    - **`skip temp_table_p0/test_temp_table.groovy`** — intentionally
    omitted; we keep the case running rather than skipping it, to confirm
    the robust assert is sufficient on 4.1.
    
    ## Release note
    None
    
    🤖 Generated with [Claude Code](https://claude.com/claude-code)
    
    ---------
    
    Co-authored-by: morningman <[email protected]>
    Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
    Co-authored-by: meiyi <[email protected]>
---
 .../prune_bucket_with_bucket_shuffle_join.groovy   |  4 +++
 ...ptive_pipeline_task_serial_read_on_limit.groovy | 41 ++++++++++++----------
 .../suites/query_profile/scanner_profile.groovy    | 30 +++++++++++++---
 .../suites/temp_table_p0/test_temp_table.groovy    | 30 +++++++++++++++-
 4 files changed, 82 insertions(+), 23 deletions(-)

diff --git 
a/regression-test/suites/nereids_syntax_p0/distribute/prune_bucket_with_bucket_shuffle_join.groovy
 
b/regression-test/suites/nereids_syntax_p0/distribute/prune_bucket_with_bucket_shuffle_join.groovy
index 8ea8f35f914..befef5cb83b 100644
--- 
a/regression-test/suites/nereids_syntax_p0/distribute/prune_bucket_with_bucket_shuffle_join.groovy
+++ 
b/regression-test/suites/nereids_syntax_p0/distribute/prune_bucket_with_bucket_shuffle_join.groovy
@@ -76,10 +76,14 @@ suite("prune_bucket_with_bucket_shuffle_join") {
         assertTrue(exchangeNum > 1)
     }
 
+    // Pin parallelism so the bucket-shuffle downgrade heuristic
+    // (totalBucketNum < backEndNum*paraNum*0.8 in 
ChildrenPropertiesRegulator) cannot fire on
+    // high-core or multi-BE hosts; with paraNum=1 the 10-bucket left side 
keeps BUCKET_SHUFFLE.
     multi_sql """
         set enable_nereids_distribute_planner=true;
         set enable_pipeline_x_engine=true;
         set disable_join_reorder=true;
+        set parallel_pipeline_task_num=1;
         """
 
     // With enable_nereids_distribute_planner=true the RIGHT OUTER JOIN 
distribution is
diff --git 
a/regression-test/suites/query_profile/adaptive_pipeline_task_serial_read_on_limit.groovy
 
b/regression-test/suites/query_profile/adaptive_pipeline_task_serial_read_on_limit.groovy
index 4a36401b02b..0c206ed46a5 100644
--- 
a/regression-test/suites/query_profile/adaptive_pipeline_task_serial_read_on_limit.groovy
+++ 
b/regression-test/suites/query_profile/adaptive_pipeline_task_serial_read_on_limit.groovy
@@ -19,29 +19,34 @@ import groovy.json.JsonOutput
 import groovy.json.JsonSlurper
 import groovy.json.StringEscapeUtils
 import org.apache.doris.regression.action.ProfileAction
+import org.awaitility.Awaitility
+import static java.util.concurrent.TimeUnit.SECONDS
 
 def verifyProfileContent = { suiteContext, stmt, serialReadOnLimit ->
-    // Sleep 500ms to wait for the profile collection 
-    Thread.sleep(500)
-    // Get profile list by using getProfileList
     def profileAction = new ProfileAction(suiteContext)
-    List profileData = profileAction.getProfileList()
-    // Find the profile id for the query that we just emitted
-    String profileId = ""
-    for (def profileItem : profileData) {
-        if (profileItem["Sql Statement"].toString().contains(stmt)) {
-            profileId = profileItem["Profile ID"].toString()
-            logger.info("Profile ID of ${stmt} is ${profileId}")
-            break
+    // The BE reports the detailed execution profile to FE asynchronously, 
after
+    // the query result has already been returned to the client and the FE
+    // coordinator has been torn down. Fetching too early yields a profile 
whose
+    // MergedProfile carries no operators, so the MaxScanConcurrency counter is
+    // not present yet. A fixed sleep keeps racing the report under CI load, so
+    // poll until the scan operator's MaxScanConcurrency counter has landed.
+    String profileContent = ""
+    Awaitility.await().atMost(60, SECONDS).pollInterval(1, SECONDS).until {
+        List profileData = profileAction.getProfileList()
+        // Find the profile id for the query that we just emitted
+        String profileId = ""
+        for (def profileItem : profileData) {
+            if (profileItem["Sql Statement"].toString().contains(stmt)) {
+                profileId = profileItem["Profile ID"].toString()
+                break
+            }
         }
+        if (profileId == "" || profileId == null) {
+            return false
+        }
+        profileContent = profileAction.getProfile(profileId)
+        return profileContent.contains("MaxScanConcurrency")
     }
-
-    if (profileId == "" || profileId == null) {
-        logger.error("Profile ID of ${stmt} is not found")
-        return false
-    }
-    // Get profile content by using getProfile
-    String profileContent = profileAction.getProfile(profileId)
     logger.info("Profile content of ${stmt} is\n${profileContent}")
     // Check if the profile contains the expected content
     if (serialReadOnLimit) {
diff --git a/regression-test/suites/query_profile/scanner_profile.groovy 
b/regression-test/suites/query_profile/scanner_profile.groovy
index 200599315a2..2ba663b39bb 100644
--- a/regression-test/suites/query_profile/scanner_profile.groovy
+++ b/regression-test/suites/query_profile/scanner_profile.groovy
@@ -19,6 +19,8 @@ import groovy.json.JsonOutput
 import groovy.json.JsonSlurper
 import groovy.json.StringEscapeUtils
 import org.apache.doris.regression.action.ProfileAction
+import org.awaitility.Awaitility
+import static java.util.concurrent.TimeUnit.SECONDS
 
 suite('scanner_profile') {
     sql """
@@ -74,8 +76,21 @@ suite('scanner_profile') {
         return profileContent        
     }
 
-    List profileData = profileAction.getProfileList()
-    def profileWithLimit1 = getProfileByToken(token)
+    // The BE reports the detailed execution profile to FE asynchronously, 
after
+    // the query result has already been returned to the client. Fetching too
+    // early yields a profile whose MergedProfile carries no operators, so the
+    // scanner counters and the backfilled actualRows are not present yet. Poll
+    // until the scan operator shows up, i.e. the execution profile has landed.
+    def getCompleteProfileByToken = { pattern ->
+        String content = ""
+        Awaitility.await().atMost(60, SECONDS).pollInterval(1, SECONDS).until {
+            content = getProfileByToken(pattern).toString()
+            return content.contains("OLAP_SCAN_OPERATOR")
+        }
+        return content
+    }
+
+    def profileWithLimit1 = getCompleteProfileByToken(token)
     logger.info("${token} Profile Data: ${profileWithLimit1}")
     assertTrue(profileWithLimit1.toString().contains("- MaxScanConcurrency: 
1"))
 
@@ -84,7 +99,14 @@ suite('scanner_profile') {
         select "${token}", * from scanner_profile where id < 10;
     """
 
-    String profileWithFilter = getProfileByToken(token)
+    String profileWithFilter = getCompleteProfileByToken(token)
     logger.info("${token} Profile Data: ${profileWithFilter}")
-    assertTrue(profileWithFilter.toString().contains("actualRows=9"))
+    // Verify actualRows is backfilled onto the scan node. `id < 10` matches 
the
+    // 9 small keys, so actualRows is expected to be 9; assert the [1, 9] range
+    // defensively in case runtime tablet/bucket selection trims a few rows.
+    def matcher = (profileWithFilter.toString() =~ 
/PhysicalOlapScan[^\n]*scanner_profile[^\n]*actualRows=(\d+)/)
+    assertTrue(matcher.find(), "actualRows not found on 
PhysicalOlapScan[scanner_profile] in profile")
+    int actualRows = matcher.group(1) as int
+    assertTrue(actualRows >= 1 && actualRows <= 9,
+            "expect actualRows in [1, 9], got ${actualRows}")
 }
\ No newline at end of file
diff --git a/regression-test/suites/temp_table_p0/test_temp_table.groovy 
b/regression-test/suites/temp_table_p0/test_temp_table.groovy
index 86989c17dc4..f1ce95e4830 100644
--- a/regression-test/suites/temp_table_p0/test_temp_table.groovy
+++ b/regression-test/suites/temp_table_p0/test_temp_table.groovy
@@ -273,7 +273,35 @@ suite('test_temp_table', 'p0') {
     assertEquals(show_column_result.size(), 3)
 
     def show_tablets_result = sql "show tablets from t_test_temp_table1"
-    assertEquals(show_tablets_result.size(), 3)
+    // t_test_temp_table1 has 3 partitions x 1 bucket = 3 tablets. SHOW 
TABLETS returns one row
+    // per replica, so the total row count = 3 tablets x the cluster's 
effective replica count.
+    // Assert both: there are exactly 3 tablets, AND the per-tablet replica 
count matches the
+    // cluster's forced replication (so a wrong/missing replica is still 
caught).
+    assertEquals(3, show_tablets_result.collect { it[0] }.unique().size())
+    // Derive the expected replica count robustly. Mirror FE precedence
+    // (PropertyAnalyzer.analyzeReplicaAllocation): 
force_olap_table_replication_allocation is the
+    // primary force path, so check it first (summing the per-tag counts), 
then fall back to
+    // force_olap_table_replication_num. The allocation regex is 
whitespace-tolerant because the
+    // canonical value is "tag.location.default: 3" (a space after the colon), 
which the old
+    // /:(\d+)/ silently missed -> replicaNum stuck at 1.
+    def replicaNum = 1
+    def forceReplicaAlloc = 
getFeConfig('force_olap_table_replication_allocation')
+    if (forceReplicaAlloc != null && !forceReplicaAlloc.isEmpty()) {
+        def total = 0
+        def m = (forceReplicaAlloc =~ /:\s*(\d+)/)
+        while (m.find()) {
+            total += m.group(1).toInteger()
+        }
+        if (total > 0) {
+            replicaNum = total
+        }
+    } else {
+        def forceReplicaNum = getFeConfig('force_olap_table_replication_num')
+        if (forceReplicaNum != null && forceReplicaNum.isInteger() && 
(forceReplicaNum as int) > 0) {
+            replicaNum = forceReplicaNum as int
+        }
+    }
+    assertEquals(3 * replicaNum, show_tablets_result.size())
     def tablet_id = show_tablets_result[0][0]
     // admin user will see temporary table's internal name
     show_tablets_result = sql "show tablet ${tablet_id}"


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris) branch branch-4.1 updated: branch-4.1: [fix](test) stabilize Cloud-P0 flaky cases (test-only backport subset of #64525) (#64597)

Reply via email to