multiple count distinct in SQL/DataFrame?

Reynold Xin Tue, 06 Oct 2015 17:52:35 -0700

The current implementation of multiple count distinct in a single query is
very inferior in terms of performance and robustness, and it is also hard
to guarantee correctness of the implementation in some of the refactorings
for Tungsten. Supporting a better version of it is possible in the future,
but will take a lot of engineering efforts. Most other Hadoop-based SQL
systems (e.g. Hive, Impala) don't support this feature.


As a result, we are considering removing support for multiple count
distinct in a single query in the next Spark release (1.6). If you use this
feature, please reply to this email. Thanks.

Note that if you don't care about null values, it is relatively easy to
reconstruct a query using joins to support multiple distincts.

multiple count distinct in SQL/DataFrame?

Reply via email to