Silly question… What about using COUNT() and a GROUP BY() instead?
I’m going from memory…. this may or may not work. Since you want the row_id only in order to de-dupe, right? On Jun 12, 2017, at 3:59 PM, Premal Shah <premal.j.s...@gmail.com<mailto:premal.j.s...@gmail.com>> wrote: Thanx Gopal. Sorry, took me a few days to respond. Here are some findings. hive.optimize.distinct.rewrite is True by default I do see Reducer 2 + 3. However, this might be worth mentioning. The distinct query on an ORC table takes a ton of time. I created a table with the TEXTFILE format from the ORC table and ran the same distinct query on it. That query runs in a few seconds. On running the orcfiledump utility, I see that the column on which I want to run the distinct query is encoded with a DIRECT encoding. When I run distinct on other columns in the table that are encoded with the dictionary encoding, the query runs quickly. This is the schema of another table CREATE TABLE `ip_table`( `ip` string, `id` string, `name` string, `domain` string, `country` string) CLUSTERED BY (ip) INTO 16 BUCKETS ; I've created a gist with the query execution to maintain formatting https://gist.github.com/premal/81b1b00dfffcc8280e04ec177334cb0f Running a count(distinct) query on master id took 3+ hours. It looks like the CPU was busy when running this query. Running a count(distinct) on the name and domain columns took few seconds. This is the output of the ORC File Dump for 1 of the files Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DICTIONARY_V2[245849] Encoding column 3: DICTIONARY_V2[199608] Encoding column 4: DICTIONARY_V2[161352] Encoding column 5: DICTIONARY_V2[188] The table has a total of 66,768,600 rows These are the count distinct values per column ip - 66,768,600 id - 4,291,106 name - 3,007,034 domain - 2,245,715 country - 212 One thing to note is that the id is kinda a fixed length string. It's either 16 chars or 31. The other columns have more variety in the field lengths. Not sure if that matters. Thanx in advance. On Tue, Apr 4, 2017 at 3:27 PM, Gopal Vijayaraghavan <gop...@apache.org<mailto:gop...@apache.org>> wrote: > SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts; … > 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s] I'm hoping this is not rewriting to the approx_distinct() in Presto. > I got similar performance with Hive + LLAP too. This is a logical plan issue, so I don't know if LLAP helps a lot. A count + a count(distinct) is planned as a full shuffle of 100% of rows. Run with set hive.tez.exec.print.summary=true; And see the output row-count of Map 1. > What can be done to get the hive query to run faster in hive? Try with (see if it generates a Reducer 2 + Reducer 3, which is what the speedup comes from). set hive.optimize.distinct.rewrite=true; or try a rewrite select id from accounts group by id having count(1) > 1; Both approaches enable full-speed vectorization for the query. Cheers, Gopal -- Regards, Premal Shah.