Biao Geng created FLINK-37315: --------------------------------- Summary: Improve SelectivityEstimator#estimateEquals for columns in VARCHAR type Key: FLINK-37315 URL: https://issues.apache.org/jira/browse/FLINK-37315 Project: Flink Issue Type: Improvement Components: Table SQL / Planner Reporter: Biao Geng
Currently in the planner, we utilize *columnInterval* to estimate a percentage of rows meeting an equality (=) expression (see the [link|https://github.com/apache/flink/blob/05a3e9c578e9efe8755058d1f7f1b8e71c456645/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/metadata/SelectivityEstimator.scala#L546] for more details. While this approach is effective for numeric columns (e.g., {{{}INT{}}}, {{{}BIGINT{}}}), it shows limitations for {{VARCHAR}} type columns. Since character-based columns lack a native concept of "interval", the existing implementation falls back to returning {{defaultEqualsSelectivity}} for these cases. Actually we can still use ndv(number of distinct values) to get a better guess of the {{VARCHAR}} columns(i.e. using 1 / ndv) for equality check. This is also what Spark does in its [planner|https://github.com/apache/spark/blob/0a102327d52712f3947cb90da671ffb18be62626/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L338]. We verified this change using a 10TB TPC-DS benchmark dataset, observing improvements: * *30% reduction* in execution time for queries #13, #48, #85 * *10%* for queries #67, #95 -- This message was sent by Atlassian Jira (v8.20.10#820010)