[ https://issues.apache.org/jira/browse/FLINK-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15815100#comment-15815100 ]
ASF GitHub Bot commented on FLINK-5394: --------------------------------------- Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/3058#discussion_r95372931 --- Diff: flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/dataset/DataSetSort.scala --- @@ -71,6 +72,21 @@ class DataSetSort( ) } + override def estimateRowCount(metadata: RelMetadataQuery): Double = { + val inputRowCnt = metadata.getRowCount(this.getInput) + if (inputRowCnt == null) { + inputRowCnt + } else { + val rowCount = Math.max(inputRowCnt - limitStart, 0D) --- End diff -- Returning a cardinality estimate of `0` is not a good idea because all remaining operations might appear to have no costs at all. Rather be conservative and return `1` which is still low but does not invalidate any subsequent costs. > the estimateRowCount method of DataSetCalc didn't work > ------------------------------------------------------ > > Key: FLINK-5394 > URL: https://issues.apache.org/jira/browse/FLINK-5394 > Project: Flink > Issue Type: Bug > Components: Table API & SQL > Reporter: zhangjing > Assignee: zhangjing > > The estimateRowCount method of DataSetCalc didn't work now. > If I run the following code, > {code} > Table table = tableEnv > .fromDataSet(data, "a, b, c") > .groupBy("a") > .select("a, a.avg, b.sum, c.count") > .where("a == 1"); > {code} > the cost of every node in Optimized node tree is : > {code} > DataSetAggregate(groupBy=[a], select=[a, AVG(a) AS TMP_0, SUM(b) AS TMP_1, > COUNT(c) AS TMP_2]): rowcount = 1000.0, cumulative cost = {3000.0 rows, > 5000.0 cpu, 28000.0 io} > DataSetCalc(select=[a, b, c], where=[=(a, 1)]): rowcount = 1000.0, > cumulative cost = {2000.0 rows, 2000.0 cpu, 0.0 io} > DataSetScan(table=[[_DataSetTable_0]]): rowcount = 1000.0, cumulative > cost = {1000.0 rows, 1000.0 cpu, 0.0 io} > {code} > We expect the input rowcount of DataSetAggregate less than 1000, however the > actual input rowcount is still 1000 because the the estimateRowCount method > of DataSetCalc didn't work. > There are two reasons caused to this: > 1. Didn't provide custom metadataProvider yet. So when DataSetAggregate calls > RelMetadataQuery.getRowCount(DataSetCalc) to estimate its input rowcount > which would dispatch to RelMdRowCount. > 2. DataSetCalc is subclass of SingleRel. So previous function call would > match getRowCount(SingleRel rel, RelMetadataQuery mq) which would never use > DataSetCalc.estimateRowCount. > The question would also appear to all Flink RelNodes which are subclass of > SingleRel. > I plan to resolve this problem by adding a FlinkRelMdRowCount which contains > specific getRowCount of Flink RelNodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)