> On July 30, 2018, 6:38 p.m., Ashutosh Chauhan wrote: > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java > > Line 354 (original), 355 (patched) > > <https://reviews.apache.org/r/68109/diff/1/?file=2065346#file2065346line357> > > > > add a comment here: > > We assume columns are uncorrelated. That is filters on different > > columns will result in filtering out different rows. So, we scale down the > > ndv of a column only when row count is decreased by its own filter. Under > > correlated assumption, we would have scaled down ndv for every column for > > every filter condition. We dont do that. > > This makes our estimate more conservative than need to be which is good > > since this will result in overestimates when we are wrong but avoids OOM > > had we chosen the other assumption. In future, we need to capture > > correlatedness of columns in metadata so that we can account for that.
added a comment about it. yes; I aggree capturing correlations between different columns would be good - but there are around `|columns|**2` of them...I think Calcite has some tools for this.. But I currently feel that the current calculation is too much numRows centric; which makes it a little hard to keep track / and provide correct estimation logic for columns... > On July 30, 2018, 6:38 p.m., Ashutosh Chauhan wrote: > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java > > Line 2633 (original), 2647 (patched) > > <https://reviews.apache.org/r/68109/diff/1/?file=2065346#file2065346line2653> > > > > Add assert newNDV <= newNumRows. actually that would be fail; added code to clamp NDV to maxRows - Zoltan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68109/#review206607 ----------------------------------------------------------- On July 30, 2018, 4:17 p.m., Zoltan Haindrich wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68109/ > ----------------------------------------------------------- > > (Updated July 30, 2018, 4:17 p.m.) > > > Review request for hive and Ashutosh Chauhan. > > > Bugs: HIVE-20260 > https://issues.apache.org/jira/browse/HIVE-20260 > > > Repository: hive-git > > > Description > ------- > > * keep track of used column; and only rescale affected columns > * much more conservative than old logic - possible too much... > * wip patch > > > Diffs > ----- > > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/AnnotateStatsProcCtx.java > 47ee949fbcfa9391c640719a57fab39279c009db > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java > 3c2b0854269d5426153958096a8b5b5ad3612c0f > ql/src/test/queries/clientpositive/stat_estimate_drill.q PRE-CREATION > ql/src/test/queries/clientpositive/stat_estimate_related_col.q > 52da2f759a009daa372a53446e2f0fd4a88152be > ql/src/test/results/clientpositive/stat_estimate_drill.q.out PRE-CREATION > ql/src/test/results/clientpositive/stat_estimate_related_col.q.out > 669adafda3a45f7846face3d99817cd1b9cb3664 > > > Diff: https://reviews.apache.org/r/68109/diff/1/ > > > Testing > ------- > > > Thanks, > > Zoltan Haindrich > >