Silly question…

What about using COUNT() and a GROUP BY() instead?

I’m going from memory…. this may or may not work.  Since you want the row_id 
only in order to de-dupe, right?

On Jun 12, 2017, at 3:59 PM, Premal Shah 
<premal.j.s...@gmail.com<mailto:premal.j.s...@gmail.com>> wrote:

Thanx Gopal.
Sorry, took me a few days to respond. Here are some findings.

hive.optimize.distinct.rewrite is True by default

I do see Reducer 2 + 3.

However, this might be worth mentioning. The distinct query on an ORC table 
takes a ton of time. I created a table with the TEXTFILE format from the ORC 
table and ran the same distinct query on it.  That query runs in a few seconds.

On running the orcfiledump utility, I see that the column on which I want to 
run the distinct query is encoded with a DIRECT encoding.  When I run distinct 
on other columns in the table that are encoded with the dictionary encoding, 
the query runs quickly.

This is the schema of another table

CREATE TABLE `ip_table`(
  `ip` string,
  `id` string,
  `name` string,
  `domain` string,
  `country` string)
CLUSTERED BY (ip)
INTO 16 BUCKETS
;

I've created a gist with the query execution to maintain formatting

https://gist.github.com/premal/81b1b00dfffcc8280e04ec177334cb0f

Running a count(distinct) query on master id took 3+ hours. It looks like the 
CPU was busy when running this query.
Running a count(distinct) on the name and domain columns took few seconds.

This is the output of the ORC File Dump for 1 of the files

    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DICTIONARY_V2[245849]
    Encoding column 3: DICTIONARY_V2[199608]
    Encoding column 4: DICTIONARY_V2[161352]
    Encoding column 5: DICTIONARY_V2[188]

The table has a total of 66,768,600 rows

These are the count distinct values per column

ip                - 66,768,600
id                - 4,291,106
name          - 3,007,034
domain       - 2,245,715
country       - 212

One thing to note is that the id is kinda a fixed length string. It's either 16 
chars or 31. The other columns have more variety in the field lengths. Not sure 
if that matters.

Thanx in advance.


On Tue, Apr 4, 2017 at 3:27 PM, Gopal Vijayaraghavan 
<gop...@apache.org<mailto:gop...@apache.org>> wrote:
> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;
…
> 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]

I'm hoping this is not rewriting to the approx_distinct() in Presto.

> I got similar performance with Hive + LLAP too.

This is a logical plan issue, so I don't know if LLAP helps a lot.

A count + a count(distinct) is planned as a full shuffle of 100% of rows.

Run with

set hive.tez.exec.print.summary=true;

And see the output row-count of Map 1.

> What can be done to get the hive query to run faster in hive?

Try with (see if it generates a Reducer 2 + Reducer 3, which is what the 
speedup comes from).

set hive.optimize.distinct.rewrite=true;

or try a rewrite

select id from accounts group by id having count(1) > 1;

Both approaches enable full-speed vectorization for the query.

Cheers,
Gopal





--
Regards,
Premal Shah.

Reply via email to