zclllyybb commented on issue #64122:
URL: https://github.com/apache/doris/issues/64122#issuecomment-4622731256

   Breakwater-GitHub-Analysis-Slot: slot_ffafea503c14
   
   I checked the current code path on upstream master and the 3.0 / 4.0 / 4.1 
branches. This report looks like a real sample-statistics validation bug, but 
the fix should avoid persisting a contradictory `ndv=0` + non-null min/max 
statistic as-is.
   
   What is confirmed from the code:
   
   - `OlapAnalysisTask.doSample()` first calls `collectMinMax()`, and that 
`BASIC_STATS_TEMPLATE` scans the table/index without the sample tablet hint. So 
min/max can come from the full table.
   - The sampled stats SQL then uses `${rowCount}` from the OLAP table/index 
row-count metadata, while `null_count` is computed from sampled rows and 
scaled, for example `ROUND(SUM(CASE WHEN col IS NULL THEN 1 ELSE 0 END) * 
${scaleFactor})`.
   - `BaseAnalysisTask.runQuery()` builds `ColStatsData` and calls 
`ColStatsData.isValid()` unconditionally, without knowing whether the row came 
from full analyze or sample analyze.
   - `ColStatsData.isValid()` then applies the full-stat invariant `ndv == 0 && 
min/max is not all null && nullCount != count`, which is not valid for this 
mixed-provenance sampled row.
   
   So the failing shape is plausible: a sampled OLAP analyze can have 
full-table min/max proving at least one non-NULL value exists, while the 
sampled NDV path can still produce `0`, and the scaled `null_count` is not an 
exact value that must equal the metadata `count`. Rejecting the whole collected 
statistic at this point matches the warning and the fallback to unknown/default 
column stats.
   
   Suggested fix direction:
   
   - Split validation by statistic provenance: keep the strict invariant for 
full analyze rows where count/null-count/ndv/min/max come from the same scan, 
but do not use `nullCount != count` as a hard rejection condition for sampled 
rows.
   - For sampled rows where full-table min/max is non-null but sampled NDV is 
`0`, normalize the representation before caching/persisting it, for example by 
clamping NDV to at least `1` or by clearing min/max for that sampled-zero-NDV 
case. A blind bypass is risky because `StatsCalculator.checkNdvValidation()` 
also treats `ndv == 0 && min/max != null` as invalid and can disable join 
reorder later.
   - Add a regression test for an OLAP sample analyze where rare non-NULL 
values are missed or under-estimated by the sample while full-table min/max is 
non-null.
   
   Information that would still help validate the final patch:
   
   - Exact Doris build/commit for the reported 3.x/4.x deployments.
   - The `ANALYZE` statement or auto-analyze settings, especially sample 
rows/percent and whether tablet sampling or `LIMIT` was used.
   - A small reproducible table layout: partition/tablet count, approximate row 
counts, and where the rare non-NULL values are located.
   - FE logs around the analyze job, plus the generated analyze SQL if debug 
logging is available.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to