zclllyybb commented on issue #63694:
URL: https://github.com/apache/doris/issues/63694#issuecomment-4544186906

   Breakwater-GitHub-Analysis-Slot: slot_75b5f7a31eb3
   
   Initial triage: this looks like a valid performance/caching enhancement for 
the HMS external table planning path, not a correctness bug.
   
   I checked current `origin/master` locally at `4483daf9f03`. The relevant 
flow matches the report:
   
   - `ExternalTable.getRowCount()` initializes the table and loads the 
row-count cache through `ExternalRowCountCache.getCachedRowCount(...)`.
   - `ExternalRowCountCache.RowCountCacheLoader` calls `table.fetchRowCount()`.
   - `HMSExternalTable.fetchRowCount()` first tries HMS table parameters, then 
falls back to `getRowCountFromFileList()` when the row count is unknown and 
`enable_get_row_count_from_file_list` is enabled.
   - In that fallback, `getFilesForPartitions(...)` explicitly uses 
`cache.getAllPartitionsWithoutCache(...)` and then 
`cache.getFilesByPartitions(..., withCache=false, ...)`.
   - Later scan planning for the same HMS table uses the normal cached path: 
`HiveScanNode.getPartitions()` calls `getAllPartitionsWithCache(...)`, and file 
split generation calls `getFilesByPartitions(...)` with `withCache = 
Config.max_external_file_cache_num > 0`.
   
   So the same Hive partition/file metadata can indeed be fetched once for 
row-count estimation and then fetched again for scan planning in one planning 
process. The issue description is consistent with the current code.
   
   Suggested implementation direction:
   
   - Add an explicit row-count loading mode, for example `QUERY_PLANNING` vs 
`DISPLAY_ONLY`, rather than inferring intent inside `HMSExternalTable`.
   - Use the planning mode from `ExternalTable.getRowCount()` / optimizer stats 
paths, and allow the HMS file-list row-count fallback to populate or reuse the 
normal Hive partition/file caches in that mode.
   - Keep display-oriented paths such as `getCachedRowCount()`, `SHOW TABLE 
STATUS`, `SHOW STATS`, and `information_schema.tables` lightweight. If the 
intent is strictly "cached value only", consider wiring them to 
`ExternalRowCountCache.getCachedRowCountIfPresent(...)` or equivalent 
non-loading behavior; the method already exists but is not currently used.
   - For HMS row-count estimation, make the partition and file metadata APIs 
mode-dependent: cached APIs for query planning, non-cached APIs for 
display/background-only loading.
   
   Recommended validation:
   
   - Add an FE unit test or mock-cache test proving that planning-time 
row-count estimation for an HMS table uses the cached partition/file path.
   - Add a companion test proving that display-only calls do not populate heavy 
Hive partition/file caches just to render row count.
   - If possible, include a simple before/after planning profile or FE log 
snippet for a partitioned HMS table without `numRows` in HMS parameters and 
with `enable_get_row_count_from_file_list=true`, showing that partition/file 
listing happens once after the change.
   
   No additional user logs are required to accept this as an enhancement, but a 
concrete reproduction table shape would help measure impact: Doris 
commit/version, catalog properties, partition count, file count, 
`hive_stats_partition_sample_size`, `max_external_file_cache_num`, 
`fetch_hive_row_count_sync`, and a representative `EXPLAIN` or planning profile.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to