> I feel there is not much activity on the user/dev hive mailing lists or at 
> least not much support in answering my questions. 

Happy Holidays! ☺

> I am wondering if it's possible to estimate the number of distinct keys and 
> their distribution in a way or another. 
> 
> More concretely, for every stage, it is possible to determine the number of 
> distinct keys and for each key the number of values  before the data is 
> actually processed?

Hive doesn’t do internal sorts or unique checks of any kind - the execution 
engine can pull that info when it moves internally within a query. 

Tez publishes those counters during shuffle, yes - but not when reading off the 
map tasks.

You can see how that is used to automatically detect skew issues in Tez branch.

https://github.com/apache/tez/blob/master/tez-tools/analyzers/job-analyzer/src/main/java/org/apache/tez/analyzer/plugins/SkewAnalyzer.java#L121

But have you seen "explain analyze <query>;" in Hive-2.2.x branch yet, maybe 
that is helpful in understanding how to collect statistics from within 
operators.

https://issues.apache.org/jira/browse/HIVE-14362
+
https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/tez/explainanalyze_4.q.out

Cheers,
Gopal


Reply via email to