> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;
…
> 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]

I'm hoping this is not rewriting to the approx_distinct() in Presto.

> I got similar performance with Hive + LLAP too.

This is a logical plan issue, so I don't know if LLAP helps a lot.

A count + a count(distinct) is planned as a full shuffle of 100% of rows.

Run with 

set hive.tez.exec.print.summary=true;

And see the output row-count of Map 1.

> What can be done to get the hive query to run faster in hive?

Try with (see if it generates a Reducer 2 + Reducer 3, which is what the 
speedup comes from).

set hive.optimize.distinct.rewrite=true;

or try a rewrite

select id from accounts group by id having count(1) > 1;

Both approaches enable full-speed vectorization for the query.

Cheers,
Gopal


Reply via email to