With ORC tables have you tried set hive.vectorized.execution.enabled = true; set hive.vectorized.execution.reduce.enabled = true; SET hive.exec.parallel=true; -- set hive.optimize.ppd=true;
HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 21 June 2017 at 18:24, Premal Shah <premal.j.s...@gmail.com> wrote: > Gopal, > Thanx for the debugging steps. > > Here's the output > > *hive> select count(1) as collisions, hash(ip) from table group by > hash(ip) order by collisions desc limit 10;* > > 4 -1432955330 > 4 -317748560 > 4 -1460629578 > 4 1486313154 > 4 -320519155 > 4 1875999753 > 4 -1410139032 > 4 1596671554 > 4 503687909 > 4 989075923 > > > *hive> select count(1) as collisions, hash(id) from table group by > hash(id) order by collisions desc limit 10;* > > 1711647 -1032220119 > 1439738 -1316837863 > 851204 -330948739 > 838145 535385402 > 512621 165206418 > 308968 -36549075 > 306190 -1568034366 > 302932 -1386594327 > 284935 -1991768757 > 218979 922811836 > > > *hive> select count(1) as collisions, hash(name) from table group by > hash(name) order by collisions desc limit 10;* > > 1712041 -1906315012 > 1439738 -1583171535 > 512630 193448621 > 340485 2094529 > 308988 68745436 > 306240 79997099 > 289465 -1824055323 > 218263 1074334059 > 216464 -466945424 > > > Turning off map side aggregations definitely helped the query on *id . *The > query time went to 1 minute from the earlier 3+ hours. > > Based on the output above, both id and name have a lot of collisions, but > the name query was fast earlier too which is interesting. > > > > > On Wed, Jun 14, 2017 at 10:34 AM, Gopal Vijayaraghavan <gop...@apache.org> > wrote: > >> >> > SELECT COUNT(DISTINCT ip) FROM table - 71 seconds >> > SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds >> >> Ok, I misunderstood your gist. >> >> > While ip is more unique that id, ip runs many times faster than id. >> > >> > How can I debug this ? >> >> Nearly the same way - just replace "ip" with "id" in my exploratory >> queries. >> >> count(distinct hash(id)) from the table? >> >> count count(1) as collisions, hash(id) from table group by hash(id) order >> by collisions desc limit 10; >> >> And, if those show many collisions >> >> set tez.runtime.io.sort.mb=640; >> set hive.map.aggr=false; >> set tez.runtime.pipelined.shuffle=true; // this reduces failure >> tolerance (i.e retries are more expensive, happy path is faster) >> >> select count(distinct id) from ip_table; >> >> Java's hashCode() implementation is pretty horrible (& Hive defaults to >> using it). If you're seeing a high collision count, I think I might know >> what's happening here. >> >> Cheers, >> Gopal >> >> >> > > > -- > Regards, > Premal Shah. >