With ORC tables have you tried

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
SET hive.exec.parallel=true;
--
set hive.optimize.ppd=true;

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 June 2017 at 18:24, Premal Shah <premal.j.s...@gmail.com> wrote:

> Gopal,
> Thanx for the debugging steps.
>
> Here's the output
>
> *hive> select count(1) as collisions, hash(ip) from table group by
> hash(ip) order by collisions desc limit 10;*
>
> 4       -1432955330
> 4       -317748560
> 4       -1460629578
> 4       1486313154
> 4       -320519155
> 4       1875999753
> 4       -1410139032
> 4       1596671554
> 4       503687909
> 4       989075923
>
>
> *hive> select count(1) as collisions, hash(id) from table group by
> hash(id) order by collisions desc limit 10;*
>
> 1711647 -1032220119
> 1439738 -1316837863
> 851204  -330948739
> 838145  535385402
> 512621  165206418
> 308968  -36549075
> 306190  -1568034366
> 302932  -1386594327
> 284935  -1991768757
> 218979  922811836
>
>
> *hive> select count(1) as collisions, hash(name) from table group by
> hash(name) order by collisions desc limit 10;*
>
> 1712041 -1906315012
> 1439738 -1583171535
> 512630  193448621
> 340485  2094529
> 308988  68745436
> 306240  79997099
> 289465  -1824055323
> 218263  1074334059
> 216464  -466945424
>
>
> Turning off map side aggregations definitely helped the query on *id . *The
> query time went to 1 minute from the earlier 3+ hours.
>
> Based on the output above, both id and name have a lot of collisions, but
> the name query was fast earlier too which is interesting.
>
>
>
>
> On Wed, Jun 14, 2017 at 10:34 AM, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>>
>> > SELECT COUNT(DISTINCT ip) FROM table - 71 seconds
>> > SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds
>>
>> Ok, I misunderstood your gist.
>>
>> > While ip is more unique that id, ip runs many times faster than id.
>> >
>> > How can I debug this ?
>>
>> Nearly the same way - just replace "ip" with "id" in my exploratory
>> queries.
>>
>> count(distinct hash(id)) from the table?
>>
>> count count(1) as collisions, hash(id) from table group by hash(id) order
>> by collisions desc limit 10;
>>
>> And, if those show many collisions
>>
>> set tez.runtime.io.sort.mb=640;
>> set hive.map.aggr=false;
>> set tez.runtime.pipelined.shuffle=true; // this reduces failure
>> tolerance (i.e retries are more expensive, happy path is faster)
>>
>> select count(distinct id) from ip_table;
>>
>> Java's hashCode() implementation is pretty horrible (& Hive defaults to
>> using it). If you're seeing a high collision count, I think I might know
>> what's happening here.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>
>
> --
> Regards,
> Premal Shah.
>

Reply via email to