This is an automated email from the ASF dual-hosted git repository.

morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/master by this push:
     new 732aad49040 [fix](regression-test) stabilize 2 muted external_table_p0 
tests (#63646)
732aad49040 is described below

commit 732aad490404823c306e0c5a0a300be1eb5a5526
Author: Mingyu Chen (Rayner) <[email protected]>
AuthorDate: Wed May 27 17:28:14 2026 +0800

    [fix](regression-test) stabilize 2 muted external_table_p0 tests (#63646)
    
    ## Summary
    
    Both tests have been muted on the External Regression pipeline due to
    long-standing flakiness (analysis based on TeamCity build #92687 / id
    953050). Neither is a real product bug — both are test-side robustness
    issues.
    
    ### `test_file_cache_query_limit` (~50% pass rate)
    
    After `POST /api/file_cache?op=clear&sync=true` the test waited exactly
    one `file_cache_background_monitor_interval_ms` window and then asserted
    `normal_queue_curr_size == 0` once. The counters surfaced by
    `information_schema.file_cache_statistics` are republished by the
    background monitor on its own cadence, so a single fixed-time wait races
    the refresh and the assert fails roughly half the time even when the
    cache really is empty.
    
    - Replace the four wait-then-assert blocks (`size == 0` after clear,
    `size > 0` after a query) with `Awaitility`-based polling (already
    imported) on the relevant metric until the predicate holds, with a
    `max(30s, 6 × monitor_interval)` timeout.
    - The original `assertFalse(...)` calls with their metric-specific
    messages are kept as the final guard, so real failures still surface a
    precise reason.
    - The two waits for BE config propagation
    (`enable_file_cache_query_limit` flip) are left untouched — not in the
    failure path.
    
    ### `test_hive_query_cache` (~20–25% fail rate)
    
    The `test { sql ...; time 20000 }` block at L122 ran TPC-H Q9 against
    containerized hive parquet with `enable_sql_cache=false` set above, so
    the 20s upper bound was timing a cold 6-table join, not a cache hit. The
    query routinely exceeds 20s under cluster load.
    
    - Drop the time guard; the `qt_tpch_1sf_q09` above already validates
    correctness, and the cache behavior is exercised in the subsequent
    blocks that explicitly enable sql cache.
    
    ## Test plan
    
    - [ ] Run External Regression pipeline on this PR and confirm both cases
    pass.
    - [ ] After 5+ consecutive green runs, follow up to unmute these cases
    in TeamCity.
    
    🤖 Generated with [Claude Code](https://claude.com/claude-code)
    
    Co-authored-by: Claude Opus 4.7 <[email protected]>
---
 .../cache/test_file_cache_query_limit.groovy       | 64 ++++++++++++----------
 .../hive/test_hive_query_cache.groovy              | 11 ++--
 2 files changed, 41 insertions(+), 34 deletions(-)

diff --git 
a/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy
 
b/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy
index cc6a4a15bf3..a69d2eefb25 100644
--- 
a/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy
+++ 
b/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy
@@ -71,6 +71,26 @@ suite("test_file_cache_query_limit", "p0,external") {
     String hms_port = context.config.otherConfigs.get(hivePrefix + "HmsPort")
     int queryCacheCapacity
 
+    // Poll a file_cache_statistics metric until predicate holds, or until 
timeout.
+    // file_cache_statistics is refreshed by the background monitor on its own 
cadence,
+    // so waiting a single fixed interval (the previous behavior) races the 
refresh and
+    // makes assertions flaky. On timeout we swallow the exception so the 
caller's
+    // assertFalse below can surface its own metric-specific message.
+    def pollFileCacheMetric = { String metricName, Closure predicate, long 
timeoutSeconds ->
+        try {
+            Awaitility.await()
+                    .atMost(timeoutSeconds, TimeUnit.SECONDS)
+                    .pollInterval(1, TimeUnit.SECONDS)
+                    .until {
+                        def r = sql """select METRIC_VALUE from 
information_schema.file_cache_statistics
+                                where METRIC_NAME = '${metricName}' limit 1;"""
+                        return r.size() > 0 && 
predicate(Double.valueOf(r[0][0]))
+                    }
+        } catch (org.awaitility.core.ConditionTimeoutException ignored) {
+            // fall through; the caller's assert will surface the precise 
failure
+        }
+    }
+
     sql """drop catalog if exists ${catalog_name} """
 
     sql """CREATE CATALOG ${catalog_name} PROPERTIES (
@@ -147,14 +167,13 @@ suite("test_file_cache_query_limit", "p0,external") {
     def totalWaitTime = 
(fileCacheBackgroundMonitorIntervalMsResult[0][3].toLong() / 1000) as int
     def interval = 1
     def iterations = totalWaitTime / interval
+    long pollTimeoutSeconds = Math.max(30L, (long) totalWaitTime * 6L)
 
-    // Waiting for file cache clearing
-    (1..iterations).each { count ->
-        Thread.sleep(interval * 1000)
-        def elapsedSeconds = count * interval
-        def remainingSeconds = totalWaitTime - elapsedSeconds
-        logger.info("Waited for file cache clearing ${elapsedSeconds} seconds, 
${remainingSeconds} seconds remaining")
-    }
+    // Poll until the cache clear has drained the LRU queue. The HTTP clear 
endpoint with sync=true
+    // deletes blocks synchronously, but the queue counters are republished by 
the background monitor
+    // thread on its own cadence — so a single fixed-time wait can race the 
refresh.
+    pollFileCacheMetric('normal_queue_curr_size', { it == 0.0 }, 
pollTimeoutSeconds)
+    pollFileCacheMetric('normal_queue_curr_elements', { it == 0.0 }, 
pollTimeoutSeconds)
 
     def initialNormalQueueCurrSizeResult = sql """select METRIC_VALUE from 
information_schema.file_cache_statistics
             where METRIC_NAME = 'normal_queue_curr_size' limit 1;"""
@@ -162,7 +181,6 @@ suite("test_file_cache_query_limit", "p0,external") {
     assertFalse(initialNormalQueueCurrSizeResult.size() == 0 || 
Double.valueOf(initialNormalQueueCurrSizeResult[0][0]) != 0.0,
             INITIAL_NORMAL_QUEUE_CURR_SIZE_NOT_ZERO_MSG)
 
-    // Check normal queue current elements
     def initialNormalQueueCurrElementsResult = sql """select METRIC_VALUE from 
information_schema.file_cache_statistics
             where METRIC_NAME = 'normal_queue_curr_elements' limit 1;"""
     logger.info("normal_queue_curr_elements result: " + 
initialNormalQueueCurrElementsResult)
@@ -199,13 +217,9 @@ suite("test_file_cache_query_limit", "p0,external") {
         // load the table into file cache
         sql query_sql
 
-        // Waiting for file cache statistics update
-        (1..iterations).each { count ->
-            Thread.sleep(interval * 1000)
-            def elapsedSeconds = count * interval
-            def remainingSeconds = totalWaitTime - elapsedSeconds
-            logger.info("Waited for file cache statistics update 
${elapsedSeconds} seconds, ${remainingSeconds} seconds remaining")
-        }
+        // Poll until the query has populated the cache.
+        pollFileCacheMetric('normal_queue_curr_elements', { it > 0.0 }, 
pollTimeoutSeconds)
+        pollFileCacheMetric('normal_queue_curr_size', { it > 0.0 }, 
pollTimeoutSeconds)
 
         def baseNormalQueueCurrElementsResult = sql """select METRIC_VALUE 
from information_schema.file_cache_statistics
             where METRIC_NAME = 'normal_queue_curr_elements' limit 1;"""
@@ -247,13 +261,9 @@ suite("test_file_cache_query_limit", "p0,external") {
     logger.info("File cache clear command output: ${output.toString()}")
     assertTrue(exitCode == 0, "File cache clear failed with exit code 
${exitCode}. Error: ${errorOutput.toString()}")
 
-    // Waiting for file cache clearing
-    (1..iterations).each { count ->
-        Thread.sleep(interval * 1000)
-        def elapsedSeconds = count * interval
-        def remainingSeconds = totalWaitTime - elapsedSeconds
-        logger.info("Waited for file cache clearing ${elapsedSeconds} seconds, 
${remainingSeconds} seconds remaining")
-    }
+    // Poll until the file cache is fully cleared again.
+    pollFileCacheMetric('normal_queue_curr_size', { it == 0.0 }, 
pollTimeoutSeconds)
+    pollFileCacheMetric('normal_queue_curr_elements', { it == 0.0 }, 
pollTimeoutSeconds)
 
     // ===== Normal Queue Metrics Check =====
     // Check normal queue current size
@@ -337,13 +347,9 @@ suite("test_file_cache_query_limit", "p0,external") {
         // load the table into file cache
         sql query_sql
 
-        // Waiting for file cache statistics update
-        (1..iterations).each { count ->
-            Thread.sleep(interval * 1000)
-            def elapsedSeconds = count * interval
-            def remainingSeconds = totalWaitTime - elapsedSeconds
-            logger.info("Waited for file cache statistics update 
${elapsedSeconds} seconds, ${remainingSeconds} seconds remaining")
-        }
+        // Poll until the query has populated the cache under the new 
file_cache_query_limit.
+        pollFileCacheMetric('normal_queue_curr_size', { it > 0.0 }, 
pollTimeoutSeconds)
+        pollFileCacheMetric('normal_queue_curr_elements', { it > 0.0 }, 
pollTimeoutSeconds)
 
         // Get updated value of normal queue current elements and max elements 
after cache operations
         def updatedNormalQueueCurrSizeResult = sql """select METRIC_VALUE from 
information_schema.file_cache_statistics
diff --git 
a/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy 
b/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy
index 0e9407fdfc0..492d6502aa7 100644
--- a/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy
+++ b/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy
@@ -118,11 +118,12 @@ suite("test_hive_query_cache", "p0,external") {
             sql """use `tpch1_parquet`"""
             qt_tpch_1sf_q09 "${tpch_1sf_q09}"
             sql "${tpch_1sf_q09}"
-
-            test {
-                sql "${tpch_1sf_q09}"
-                time 20000
-            }
+            // NOTE: enable_sql_cache=false is set above, so a `test { ... 
time 20000 }` block here is
+            // NOT testing SQL cache — it is timing a cold TPC-H Q9 over 
containerized hive parquet,
+            // which routinely exceeds 20s under load. Run the query without 
the time guard; the qt_
+            // above already validates correctness. Cache behavior is verified 
in the blocks below
+            // that explicitly set enable_sql_cache=true.
+            sql "${tpch_1sf_q09}"
 
             // test sql cache with empty result
             try {


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to