This is an automated email from the ASF dual-hosted git repository.
yiguolei pushed a commit to branch branch-4.1
in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/branch-4.1 by this push:
new 79165cac0ac branch-4.1: [fix](test) stabilize Cloud-P0 flaky cases
(test-only backport subset of #64525) (#64597)
79165cac0ac is described below
commit 79165cac0acd5f7aaea94773b2addfa2f6f83768
Author: shuke <[email protected]>
AuthorDate: Wed Jun 17 18:54:13 2026 +0800
branch-4.1: [fix](test) stabilize Cloud-P0 flaky cases (test-only backport
subset of #64525) (#64597)
## What problem does this PR solve?
Stabilizes the **branch-4.1 Cloud-P0** regression pipeline by
back-porting the
relevant, **test-only** subset of branch-4.0 #64525 (which already
stabilized
the 4.0 P0/Cloud-P0 pipeline). These suites are currently
muted-but-still-failing
on 4.1 Cloud-P0; each commit deflakes one case so it can later be
un-muted.
Cases addressed (failure → fix):
- **scanner_profile** — hard-coded `actualRows=9` assert (fails when
tablets prune to <9 rows) → assert `actualRows ∈ [1,9]` via robust
regex, and poll until the async execution profile lands.
- **temp_table** — `assertEquals(show_tablets.size(), 3)` assumed 1
replica; on the 3-replica cloud cluster SHOW TABLETS returns 3×3=9 rows
→ derive replica count robustly (the old `/:(\d+)/` regex missed the
space in `tag.location.default: 3`).
- **insert_group_commit_into_max_filter_ratio** — group-commit async
count race.
- **adaptive_pipeline_task_serial_read_on_limit** — poll until the
async-reported profile is complete.
- **prune_bucket_with_bucket_shuffle_join** — pin
`parallel_pipeline_task_num=1` so bucket-shuffle plan is deterministic.
All changes are under `regression-test/` (no FE/BE code), so no
compilation impact.
## Cherry-picked from branch-4.0 #64525 (with `-x` provenance)
- `[fix](test) fix scanner_profile actualRows regex (partial backport of
#64238)`
- `[fix](test) make test_temp_table replica detection robust`
- `[fix](case) fix insert_group_commit_into_max_filter_ratio`
- `[fix](test) poll for complete profile in scanner_profile`
- `[fix](test) poll for complete profile in
adaptive_pipeline_task_serial_read_on_limit`
- `[fix](test) pin parallel_pipeline_task_num=1 to keep bucket shuffle
in prune_bucket_with_bucket_shuffle_join`
## Deliberately NOT included (follow-up)
- **ANN index-only-scan rework** (#64525 replaces
`ann_index_only_scan*.groovy` with debug-point variants) — needs
confirmation the new debug points exist on 4.1; deferred to a separate
PR.
- **time_series compaction debug-point** — touches BE
(`cumulative_compaction_time_series_policy.cpp`), needs a compile;
deferred.
- **`skip temp_table_p0/test_temp_table.groovy`** — intentionally
omitted; we keep the case running rather than skipping it, to confirm
the robust assert is sufficient on 4.1.
## Release note
None
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: morningman <[email protected]>
Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
Co-authored-by: meiyi <[email protected]>
---
.../prune_bucket_with_bucket_shuffle_join.groovy | 4 +++
...ptive_pipeline_task_serial_read_on_limit.groovy | 41 ++++++++++++----------
.../suites/query_profile/scanner_profile.groovy | 30 +++++++++++++---
.../suites/temp_table_p0/test_temp_table.groovy | 30 +++++++++++++++-
4 files changed, 82 insertions(+), 23 deletions(-)
diff --git
a/regression-test/suites/nereids_syntax_p0/distribute/prune_bucket_with_bucket_shuffle_join.groovy
b/regression-test/suites/nereids_syntax_p0/distribute/prune_bucket_with_bucket_shuffle_join.groovy
index 8ea8f35f914..befef5cb83b 100644
---
a/regression-test/suites/nereids_syntax_p0/distribute/prune_bucket_with_bucket_shuffle_join.groovy
+++
b/regression-test/suites/nereids_syntax_p0/distribute/prune_bucket_with_bucket_shuffle_join.groovy
@@ -76,10 +76,14 @@ suite("prune_bucket_with_bucket_shuffle_join") {
assertTrue(exchangeNum > 1)
}
+ // Pin parallelism so the bucket-shuffle downgrade heuristic
+ // (totalBucketNum < backEndNum*paraNum*0.8 in
ChildrenPropertiesRegulator) cannot fire on
+ // high-core or multi-BE hosts; with paraNum=1 the 10-bucket left side
keeps BUCKET_SHUFFLE.
multi_sql """
set enable_nereids_distribute_planner=true;
set enable_pipeline_x_engine=true;
set disable_join_reorder=true;
+ set parallel_pipeline_task_num=1;
"""
// With enable_nereids_distribute_planner=true the RIGHT OUTER JOIN
distribution is
diff --git
a/regression-test/suites/query_profile/adaptive_pipeline_task_serial_read_on_limit.groovy
b/regression-test/suites/query_profile/adaptive_pipeline_task_serial_read_on_limit.groovy
index 4a36401b02b..0c206ed46a5 100644
---
a/regression-test/suites/query_profile/adaptive_pipeline_task_serial_read_on_limit.groovy
+++
b/regression-test/suites/query_profile/adaptive_pipeline_task_serial_read_on_limit.groovy
@@ -19,29 +19,34 @@ import groovy.json.JsonOutput
import groovy.json.JsonSlurper
import groovy.json.StringEscapeUtils
import org.apache.doris.regression.action.ProfileAction
+import org.awaitility.Awaitility
+import static java.util.concurrent.TimeUnit.SECONDS
def verifyProfileContent = { suiteContext, stmt, serialReadOnLimit ->
- // Sleep 500ms to wait for the profile collection
- Thread.sleep(500)
- // Get profile list by using getProfileList
def profileAction = new ProfileAction(suiteContext)
- List profileData = profileAction.getProfileList()
- // Find the profile id for the query that we just emitted
- String profileId = ""
- for (def profileItem : profileData) {
- if (profileItem["Sql Statement"].toString().contains(stmt)) {
- profileId = profileItem["Profile ID"].toString()
- logger.info("Profile ID of ${stmt} is ${profileId}")
- break
+ // The BE reports the detailed execution profile to FE asynchronously,
after
+ // the query result has already been returned to the client and the FE
+ // coordinator has been torn down. Fetching too early yields a profile
whose
+ // MergedProfile carries no operators, so the MaxScanConcurrency counter is
+ // not present yet. A fixed sleep keeps racing the report under CI load, so
+ // poll until the scan operator's MaxScanConcurrency counter has landed.
+ String profileContent = ""
+ Awaitility.await().atMost(60, SECONDS).pollInterval(1, SECONDS).until {
+ List profileData = profileAction.getProfileList()
+ // Find the profile id for the query that we just emitted
+ String profileId = ""
+ for (def profileItem : profileData) {
+ if (profileItem["Sql Statement"].toString().contains(stmt)) {
+ profileId = profileItem["Profile ID"].toString()
+ break
+ }
}
+ if (profileId == "" || profileId == null) {
+ return false
+ }
+ profileContent = profileAction.getProfile(profileId)
+ return profileContent.contains("MaxScanConcurrency")
}
-
- if (profileId == "" || profileId == null) {
- logger.error("Profile ID of ${stmt} is not found")
- return false
- }
- // Get profile content by using getProfile
- String profileContent = profileAction.getProfile(profileId)
logger.info("Profile content of ${stmt} is\n${profileContent}")
// Check if the profile contains the expected content
if (serialReadOnLimit) {
diff --git a/regression-test/suites/query_profile/scanner_profile.groovy
b/regression-test/suites/query_profile/scanner_profile.groovy
index 200599315a2..2ba663b39bb 100644
--- a/regression-test/suites/query_profile/scanner_profile.groovy
+++ b/regression-test/suites/query_profile/scanner_profile.groovy
@@ -19,6 +19,8 @@ import groovy.json.JsonOutput
import groovy.json.JsonSlurper
import groovy.json.StringEscapeUtils
import org.apache.doris.regression.action.ProfileAction
+import org.awaitility.Awaitility
+import static java.util.concurrent.TimeUnit.SECONDS
suite('scanner_profile') {
sql """
@@ -74,8 +76,21 @@ suite('scanner_profile') {
return profileContent
}
- List profileData = profileAction.getProfileList()
- def profileWithLimit1 = getProfileByToken(token)
+ // The BE reports the detailed execution profile to FE asynchronously,
after
+ // the query result has already been returned to the client. Fetching too
+ // early yields a profile whose MergedProfile carries no operators, so the
+ // scanner counters and the backfilled actualRows are not present yet. Poll
+ // until the scan operator shows up, i.e. the execution profile has landed.
+ def getCompleteProfileByToken = { pattern ->
+ String content = ""
+ Awaitility.await().atMost(60, SECONDS).pollInterval(1, SECONDS).until {
+ content = getProfileByToken(pattern).toString()
+ return content.contains("OLAP_SCAN_OPERATOR")
+ }
+ return content
+ }
+
+ def profileWithLimit1 = getCompleteProfileByToken(token)
logger.info("${token} Profile Data: ${profileWithLimit1}")
assertTrue(profileWithLimit1.toString().contains("- MaxScanConcurrency:
1"))
@@ -84,7 +99,14 @@ suite('scanner_profile') {
select "${token}", * from scanner_profile where id < 10;
"""
- String profileWithFilter = getProfileByToken(token)
+ String profileWithFilter = getCompleteProfileByToken(token)
logger.info("${token} Profile Data: ${profileWithFilter}")
- assertTrue(profileWithFilter.toString().contains("actualRows=9"))
+ // Verify actualRows is backfilled onto the scan node. `id < 10` matches
the
+ // 9 small keys, so actualRows is expected to be 9; assert the [1, 9] range
+ // defensively in case runtime tablet/bucket selection trims a few rows.
+ def matcher = (profileWithFilter.toString() =~
/PhysicalOlapScan[^\n]*scanner_profile[^\n]*actualRows=(\d+)/)
+ assertTrue(matcher.find(), "actualRows not found on
PhysicalOlapScan[scanner_profile] in profile")
+ int actualRows = matcher.group(1) as int
+ assertTrue(actualRows >= 1 && actualRows <= 9,
+ "expect actualRows in [1, 9], got ${actualRows}")
}
\ No newline at end of file
diff --git a/regression-test/suites/temp_table_p0/test_temp_table.groovy
b/regression-test/suites/temp_table_p0/test_temp_table.groovy
index 86989c17dc4..f1ce95e4830 100644
--- a/regression-test/suites/temp_table_p0/test_temp_table.groovy
+++ b/regression-test/suites/temp_table_p0/test_temp_table.groovy
@@ -273,7 +273,35 @@ suite('test_temp_table', 'p0') {
assertEquals(show_column_result.size(), 3)
def show_tablets_result = sql "show tablets from t_test_temp_table1"
- assertEquals(show_tablets_result.size(), 3)
+ // t_test_temp_table1 has 3 partitions x 1 bucket = 3 tablets. SHOW
TABLETS returns one row
+ // per replica, so the total row count = 3 tablets x the cluster's
effective replica count.
+ // Assert both: there are exactly 3 tablets, AND the per-tablet replica
count matches the
+ // cluster's forced replication (so a wrong/missing replica is still
caught).
+ assertEquals(3, show_tablets_result.collect { it[0] }.unique().size())
+ // Derive the expected replica count robustly. Mirror FE precedence
+ // (PropertyAnalyzer.analyzeReplicaAllocation):
force_olap_table_replication_allocation is the
+ // primary force path, so check it first (summing the per-tag counts),
then fall back to
+ // force_olap_table_replication_num. The allocation regex is
whitespace-tolerant because the
+ // canonical value is "tag.location.default: 3" (a space after the colon),
which the old
+ // /:(\d+)/ silently missed -> replicaNum stuck at 1.
+ def replicaNum = 1
+ def forceReplicaAlloc =
getFeConfig('force_olap_table_replication_allocation')
+ if (forceReplicaAlloc != null && !forceReplicaAlloc.isEmpty()) {
+ def total = 0
+ def m = (forceReplicaAlloc =~ /:\s*(\d+)/)
+ while (m.find()) {
+ total += m.group(1).toInteger()
+ }
+ if (total > 0) {
+ replicaNum = total
+ }
+ } else {
+ def forceReplicaNum = getFeConfig('force_olap_table_replication_num')
+ if (forceReplicaNum != null && forceReplicaNum.isInteger() &&
(forceReplicaNum as int) > 0) {
+ replicaNum = forceReplicaNum as int
+ }
+ }
+ assertEquals(3 * replicaNum, show_tablets_result.size())
def tablet_id = show_tablets_result[0][0]
// admin user will see temporary table's internal name
show_tablets_result = sql "show tablet ${tablet_id}"
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]