Hi Gopal,
Thanx for the reply.

I just want to clarify a few things.

1. The count distinct ip query runs fast and so it's not a problem
2. I would not expect the ip column to use DICTIONARY encoding too
3. I am more concerned about the count distinct id or count distinct
master_id column which if you look at the gist took 3+ hours. I'll
summarize the stats from the gist

SELECT COUNT(DISTINCT *ip*) FROM table - 71 seconds
SELECT COUNT(DISTINCT *id*) FROM table - 12,399 seconds
SELECT COUNT(DISTINCT *domain*) FROM table - 41 seconds
SELECT COUNT(DISTINCT *name*) FROM table - 116 seconds

While ip is more unique that id, ip runs many times faster than id.

How can I debug this ?


On Mon, Jun 12, 2017 at 4:42 PM, Gopal Vijayaraghavan <gop...@apache.org>
wrote:

> Hi,
>
> I think this is worth fixing because this seems to be triggered by the
> data quality itself - so let me dig in a bit into a couple more scenarios.
>
> > hive.optimize.distinct.rewrite is True by default
>
> FYI, we're tackling the count(1) + count(distinct col) case in the
> Optimizer now (which came up after your original email).
>
> https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#
> ConfigurationProperties-hive.optimize.countdistinct
>
> > On running the orcfiledump utility, I see that the column on which I
> want to run the distinct query is encoded with a DIRECT encoding.  When I
> run distinct on other columns in the table that are encoded with the
> dictionary encoding, the query runs quickly.
>
> So the cut-off for dictionary encoding is that the value repeats at least
> ~2x in each stripe - so very unique patterns won't trigger this.
>
> If the total # of rows of IP == total IP values, I don't expect it to be
> encoded as a dictionary.
>
> Also interesting detail - I prefer to now store IPs as 2 bigint cols.
>
> bigint ip1, bigint ip2
>
> This was primarily driven by the crazy math required to join different
> contractions of the IPv6 formatting.
>
> The two colon contractions are crazy when you want to joins across
> different data sources, if you store as a text string. Maybe 2017 is the
> year of IPv6 :D.
>
> > CLUSTERED BY (ip) INTO 16 BUCKETS
>
> This is something that completely annoys me - CLUSTERED BY does not
> cluster, but that doesn't help you here since IP is unique.
>
> You need SORTED BY (ip) to properly generate clusters in Hive.
>
> > Running a count(distinct) query on master id took 3+ hours. It looks
> like the CPU was busy when running this query.
>
> Can you do me a favour and run some intermediate state data exploratory
> queries, because some part of the slowness is probably triggered due to the
> failure tolerance checkpoints.
>
> count(distinct hash(ip)) from the table?
>
> count count(1) as collisions, hash(ip) from table group by hash(ip) order
> by collisions desc limit 10;
>
> And, if those show many collisions
>
> set tez.runtime.io.sort.mb=640;
> set hive.map.aggr=false;
> set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance
> (i.e retries are more expensive, happy path is faster)
>
> select count(distinct ip) from ip_table;
>
> Cheers,
> Gopal
>
>
>
>


-- 
Regards,
Premal Shah.

Reply via email to