Thanx Gopal.
Sorry, took me a few days to respond. Here are some findings.

hive.optimize.distinct.rewrite is True by default

I do see Reducer 2 + 3.

However, this might be worth mentioning. The distinct query on an ORC table
takes a ton of time. I created a table with the TEXTFILE format from the
ORC table and ran the same distinct query on it.  That query runs in a few
seconds.

On running the orcfiledump utility, I see that the column on which I want
to run the distinct query is encoded with a DIRECT encoding.  When I run
distinct on other columns in the table that are encoded with the dictionary
encoding, the query runs quickly.

This is the schema of another table

CREATE TABLE `ip_table`(
  `ip` string,
  `id` string,
  `name` string,
  `domain` string,
  `country` string)
CLUSTERED BY (ip)
INTO 16 BUCKETS
;

I've created a gist with the query execution to maintain formatting

https://gist.github.com/premal/81b1b00dfffcc8280e04ec177334cb0f

Running a count(distinct) query on master id took 3+ hours. It looks like
the CPU was busy when running this query.
Running a count(distinct) on the name and domain columns took few seconds.

This is the output of the ORC File Dump for 1 of the files

    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DICTIONARY_V2[245849]
    Encoding column 3: DICTIONARY_V2[199608]
    Encoding column 4: DICTIONARY_V2[161352]
    Encoding column 5: DICTIONARY_V2[188]

The table has a total of 66,768,600 rows

These are the count distinct values per column

ip                - 66,768,600
id                - 4,291,106
name          - 3,007,034
domain       - 2,245,715
country       - 212

One thing to note is that the id is kinda a fixed length string. It's
either 16 chars or 31. The other columns have more variety in the field
lengths. Not sure if that matters.

Thanx in advance.


On Tue, Apr 4, 2017 at 3:27 PM, Gopal Vijayaraghavan <gop...@apache.org>
wrote:

> > SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;
> …
> > 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]
>
> I'm hoping this is not rewriting to the approx_distinct() in Presto.
>
> > I got similar performance with Hive + LLAP too.
>
> This is a logical plan issue, so I don't know if LLAP helps a lot.
>
> A count + a count(distinct) is planned as a full shuffle of 100% of rows.
>
> Run with
>
> set hive.tez.exec.print.summary=true;
>
> And see the output row-count of Map 1.
>
> > What can be done to get the hive query to run faster in hive?
>
> Try with (see if it generates a Reducer 2 + Reducer 3, which is what the
> speedup comes from).
>
> set hive.optimize.distinct.rewrite=true;
>
> or try a rewrite
>
> select id from accounts group by id having count(1) > 1;
>
> Both approaches enable full-speed vectorization for the query.
>
> Cheers,
> Gopal
>
>
>


-- 
Regards,
Premal Shah.

Reply via email to