> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts; … > 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]
I'm hoping this is not rewriting to the approx_distinct() in Presto. > I got similar performance with Hive + LLAP too. This is a logical plan issue, so I don't know if LLAP helps a lot. A count + a count(distinct) is planned as a full shuffle of 100% of rows. Run with set hive.tez.exec.print.summary=true; And see the output row-count of Map 1. > What can be done to get the hive query to run faster in hive? Try with (see if it generates a Reducer 2 + Reducer 3, which is what the speedup comes from). set hive.optimize.distinct.rewrite=true; or try a rewrite select id from accounts group by id having count(1) > 1; Both approaches enable full-speed vectorization for the query. Cheers, Gopal