Hi Gopal, Thanx for the reply. I just want to clarify a few things.
1. The count distinct ip query runs fast and so it's not a problem 2. I would not expect the ip column to use DICTIONARY encoding too 3. I am more concerned about the count distinct id or count distinct master_id column which if you look at the gist took 3+ hours. I'll summarize the stats from the gist SELECT COUNT(DISTINCT *ip*) FROM table - 71 seconds SELECT COUNT(DISTINCT *id*) FROM table - 12,399 seconds SELECT COUNT(DISTINCT *domain*) FROM table - 41 seconds SELECT COUNT(DISTINCT *name*) FROM table - 116 seconds While ip is more unique that id, ip runs many times faster than id. How can I debug this ? On Mon, Jun 12, 2017 at 4:42 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > Hi, > > I think this is worth fixing because this seems to be triggered by the > data quality itself - so let me dig in a bit into a couple more scenarios. > > > hive.optimize.distinct.rewrite is True by default > > FYI, we're tackling the count(1) + count(distinct col) case in the > Optimizer now (which came up after your original email). > > https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties# > ConfigurationProperties-hive.optimize.countdistinct > > > On running the orcfiledump utility, I see that the column on which I > want to run the distinct query is encoded with a DIRECT encoding. When I > run distinct on other columns in the table that are encoded with the > dictionary encoding, the query runs quickly. > > So the cut-off for dictionary encoding is that the value repeats at least > ~2x in each stripe - so very unique patterns won't trigger this. > > If the total # of rows of IP == total IP values, I don't expect it to be > encoded as a dictionary. > > Also interesting detail - I prefer to now store IPs as 2 bigint cols. > > bigint ip1, bigint ip2 > > This was primarily driven by the crazy math required to join different > contractions of the IPv6 formatting. > > The two colon contractions are crazy when you want to joins across > different data sources, if you store as a text string. Maybe 2017 is the > year of IPv6 :D. > > > CLUSTERED BY (ip) INTO 16 BUCKETS > > This is something that completely annoys me - CLUSTERED BY does not > cluster, but that doesn't help you here since IP is unique. > > You need SORTED BY (ip) to properly generate clusters in Hive. > > > Running a count(distinct) query on master id took 3+ hours. It looks > like the CPU was busy when running this query. > > Can you do me a favour and run some intermediate state data exploratory > queries, because some part of the slowness is probably triggered due to the > failure tolerance checkpoints. > > count(distinct hash(ip)) from the table? > > count count(1) as collisions, hash(ip) from table group by hash(ip) order > by collisions desc limit 10; > > And, if those show many collisions > > set tez.runtime.io.sort.mb=640; > set hive.map.aggr=false; > set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance > (i.e retries are more expensive, happy path is faster) > > select count(distinct ip) from ip_table; > > Cheers, > Gopal > > > > -- Regards, Premal Shah.