This is an automated email from the ASF dual-hosted git repository.
morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push:
new 732aad49040 [fix](regression-test) stabilize 2 muted external_table_p0
tests (#63646)
732aad49040 is described below
commit 732aad490404823c306e0c5a0a300be1eb5a5526
Author: Mingyu Chen (Rayner) <[email protected]>
AuthorDate: Wed May 27 17:28:14 2026 +0800
[fix](regression-test) stabilize 2 muted external_table_p0 tests (#63646)
## Summary
Both tests have been muted on the External Regression pipeline due to
long-standing flakiness (analysis based on TeamCity build #92687 / id
953050). Neither is a real product bug — both are test-side robustness
issues.
### `test_file_cache_query_limit` (~50% pass rate)
After `POST /api/file_cache?op=clear&sync=true` the test waited exactly
one `file_cache_background_monitor_interval_ms` window and then asserted
`normal_queue_curr_size == 0` once. The counters surfaced by
`information_schema.file_cache_statistics` are republished by the
background monitor on its own cadence, so a single fixed-time wait races
the refresh and the assert fails roughly half the time even when the
cache really is empty.
- Replace the four wait-then-assert blocks (`size == 0` after clear,
`size > 0` after a query) with `Awaitility`-based polling (already
imported) on the relevant metric until the predicate holds, with a
`max(30s, 6 × monitor_interval)` timeout.
- The original `assertFalse(...)` calls with their metric-specific
messages are kept as the final guard, so real failures still surface a
precise reason.
- The two waits for BE config propagation
(`enable_file_cache_query_limit` flip) are left untouched — not in the
failure path.
### `test_hive_query_cache` (~20–25% fail rate)
The `test { sql ...; time 20000 }` block at L122 ran TPC-H Q9 against
containerized hive parquet with `enable_sql_cache=false` set above, so
the 20s upper bound was timing a cold 6-table join, not a cache hit. The
query routinely exceeds 20s under cluster load.
- Drop the time guard; the `qt_tpch_1sf_q09` above already validates
correctness, and the cache behavior is exercised in the subsequent
blocks that explicitly enable sql cache.
## Test plan
- [ ] Run External Regression pipeline on this PR and confirm both cases
pass.
- [ ] After 5+ consecutive green runs, follow up to unmute these cases
in TeamCity.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 <[email protected]>
---
.../cache/test_file_cache_query_limit.groovy | 64 ++++++++++++----------
.../hive/test_hive_query_cache.groovy | 11 ++--
2 files changed, 41 insertions(+), 34 deletions(-)
diff --git
a/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy
b/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy
index cc6a4a15bf3..a69d2eefb25 100644
---
a/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy
+++
b/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy
@@ -71,6 +71,26 @@ suite("test_file_cache_query_limit", "p0,external") {
String hms_port = context.config.otherConfigs.get(hivePrefix + "HmsPort")
int queryCacheCapacity
+ // Poll a file_cache_statistics metric until predicate holds, or until
timeout.
+ // file_cache_statistics is refreshed by the background monitor on its own
cadence,
+ // so waiting a single fixed interval (the previous behavior) races the
refresh and
+ // makes assertions flaky. On timeout we swallow the exception so the
caller's
+ // assertFalse below can surface its own metric-specific message.
+ def pollFileCacheMetric = { String metricName, Closure predicate, long
timeoutSeconds ->
+ try {
+ Awaitility.await()
+ .atMost(timeoutSeconds, TimeUnit.SECONDS)
+ .pollInterval(1, TimeUnit.SECONDS)
+ .until {
+ def r = sql """select METRIC_VALUE from
information_schema.file_cache_statistics
+ where METRIC_NAME = '${metricName}' limit 1;"""
+ return r.size() > 0 &&
predicate(Double.valueOf(r[0][0]))
+ }
+ } catch (org.awaitility.core.ConditionTimeoutException ignored) {
+ // fall through; the caller's assert will surface the precise
failure
+ }
+ }
+
sql """drop catalog if exists ${catalog_name} """
sql """CREATE CATALOG ${catalog_name} PROPERTIES (
@@ -147,14 +167,13 @@ suite("test_file_cache_query_limit", "p0,external") {
def totalWaitTime =
(fileCacheBackgroundMonitorIntervalMsResult[0][3].toLong() / 1000) as int
def interval = 1
def iterations = totalWaitTime / interval
+ long pollTimeoutSeconds = Math.max(30L, (long) totalWaitTime * 6L)
- // Waiting for file cache clearing
- (1..iterations).each { count ->
- Thread.sleep(interval * 1000)
- def elapsedSeconds = count * interval
- def remainingSeconds = totalWaitTime - elapsedSeconds
- logger.info("Waited for file cache clearing ${elapsedSeconds} seconds,
${remainingSeconds} seconds remaining")
- }
+ // Poll until the cache clear has drained the LRU queue. The HTTP clear
endpoint with sync=true
+ // deletes blocks synchronously, but the queue counters are republished by
the background monitor
+ // thread on its own cadence — so a single fixed-time wait can race the
refresh.
+ pollFileCacheMetric('normal_queue_curr_size', { it == 0.0 },
pollTimeoutSeconds)
+ pollFileCacheMetric('normal_queue_curr_elements', { it == 0.0 },
pollTimeoutSeconds)
def initialNormalQueueCurrSizeResult = sql """select METRIC_VALUE from
information_schema.file_cache_statistics
where METRIC_NAME = 'normal_queue_curr_size' limit 1;"""
@@ -162,7 +181,6 @@ suite("test_file_cache_query_limit", "p0,external") {
assertFalse(initialNormalQueueCurrSizeResult.size() == 0 ||
Double.valueOf(initialNormalQueueCurrSizeResult[0][0]) != 0.0,
INITIAL_NORMAL_QUEUE_CURR_SIZE_NOT_ZERO_MSG)
- // Check normal queue current elements
def initialNormalQueueCurrElementsResult = sql """select METRIC_VALUE from
information_schema.file_cache_statistics
where METRIC_NAME = 'normal_queue_curr_elements' limit 1;"""
logger.info("normal_queue_curr_elements result: " +
initialNormalQueueCurrElementsResult)
@@ -199,13 +217,9 @@ suite("test_file_cache_query_limit", "p0,external") {
// load the table into file cache
sql query_sql
- // Waiting for file cache statistics update
- (1..iterations).each { count ->
- Thread.sleep(interval * 1000)
- def elapsedSeconds = count * interval
- def remainingSeconds = totalWaitTime - elapsedSeconds
- logger.info("Waited for file cache statistics update
${elapsedSeconds} seconds, ${remainingSeconds} seconds remaining")
- }
+ // Poll until the query has populated the cache.
+ pollFileCacheMetric('normal_queue_curr_elements', { it > 0.0 },
pollTimeoutSeconds)
+ pollFileCacheMetric('normal_queue_curr_size', { it > 0.0 },
pollTimeoutSeconds)
def baseNormalQueueCurrElementsResult = sql """select METRIC_VALUE
from information_schema.file_cache_statistics
where METRIC_NAME = 'normal_queue_curr_elements' limit 1;"""
@@ -247,13 +261,9 @@ suite("test_file_cache_query_limit", "p0,external") {
logger.info("File cache clear command output: ${output.toString()}")
assertTrue(exitCode == 0, "File cache clear failed with exit code
${exitCode}. Error: ${errorOutput.toString()}")
- // Waiting for file cache clearing
- (1..iterations).each { count ->
- Thread.sleep(interval * 1000)
- def elapsedSeconds = count * interval
- def remainingSeconds = totalWaitTime - elapsedSeconds
- logger.info("Waited for file cache clearing ${elapsedSeconds} seconds,
${remainingSeconds} seconds remaining")
- }
+ // Poll until the file cache is fully cleared again.
+ pollFileCacheMetric('normal_queue_curr_size', { it == 0.0 },
pollTimeoutSeconds)
+ pollFileCacheMetric('normal_queue_curr_elements', { it == 0.0 },
pollTimeoutSeconds)
// ===== Normal Queue Metrics Check =====
// Check normal queue current size
@@ -337,13 +347,9 @@ suite("test_file_cache_query_limit", "p0,external") {
// load the table into file cache
sql query_sql
- // Waiting for file cache statistics update
- (1..iterations).each { count ->
- Thread.sleep(interval * 1000)
- def elapsedSeconds = count * interval
- def remainingSeconds = totalWaitTime - elapsedSeconds
- logger.info("Waited for file cache statistics update
${elapsedSeconds} seconds, ${remainingSeconds} seconds remaining")
- }
+ // Poll until the query has populated the cache under the new
file_cache_query_limit.
+ pollFileCacheMetric('normal_queue_curr_size', { it > 0.0 },
pollTimeoutSeconds)
+ pollFileCacheMetric('normal_queue_curr_elements', { it > 0.0 },
pollTimeoutSeconds)
// Get updated value of normal queue current elements and max elements
after cache operations
def updatedNormalQueueCurrSizeResult = sql """select METRIC_VALUE from
information_schema.file_cache_statistics
diff --git
a/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy
b/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy
index 0e9407fdfc0..492d6502aa7 100644
--- a/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy
+++ b/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy
@@ -118,11 +118,12 @@ suite("test_hive_query_cache", "p0,external") {
sql """use `tpch1_parquet`"""
qt_tpch_1sf_q09 "${tpch_1sf_q09}"
sql "${tpch_1sf_q09}"
-
- test {
- sql "${tpch_1sf_q09}"
- time 20000
- }
+ // NOTE: enable_sql_cache=false is set above, so a `test { ...
time 20000 }` block here is
+ // NOT testing SQL cache — it is timing a cold TPC-H Q9 over
containerized hive parquet,
+ // which routinely exceeds 20s under load. Run the query without
the time guard; the qt_
+ // above already validates correctness. Cache behavior is verified
in the blocks below
+ // that explicitly set enable_sql_cache=true.
+ sql "${tpch_1sf_q09}"
// test sql cache with empty result
try {
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]