> I feel there is not much activity on the user/dev hive mailing lists or at > least not much support in answering my questions.
Happy Holidays! ☺ > I am wondering if it's possible to estimate the number of distinct keys and > their distribution in a way or another. > > More concretely, for every stage, it is possible to determine the number of > distinct keys and for each key the number of values before the data is > actually processed? Hive doesn’t do internal sorts or unique checks of any kind - the execution engine can pull that info when it moves internally within a query. Tez publishes those counters during shuffle, yes - but not when reading off the map tasks. You can see how that is used to automatically detect skew issues in Tez branch. https://github.com/apache/tez/blob/master/tez-tools/analyzers/job-analyzer/src/main/java/org/apache/tez/analyzer/plugins/SkewAnalyzer.java#L121 But have you seen "explain analyze <query>;" in Hive-2.2.x branch yet, maybe that is helpful in understanding how to collect statistics from within operators. https://issues.apache.org/jira/browse/HIVE-14362 + https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/tez/explainanalyze_4.q.out Cheers, Gopal