Hi, 

I've noticed something that I find strange with the hash-aggregate feature of 
Postgres. I'm currently running Postgres v8.4.1 on Debian Linux 64-bit.

I have a simple query that when planned either uses hash-aggregates or a sort 
depending on the amount of working memory available. The problem is that when 
it uses the hash-aggregates, the query runs 25% slower than when using the sort 
method.

The table in question contains about 60 columns, many of which are boolean, 
32-bit integers and some are 64-bit integers. Many fields are text - and some 
of these can be quite long (eg 32Kb).



The SQL is as follows: 

explain analyse 
select distinct T1.* 
  from role T1 
 where T1.endDate is null and T1.latest=true and T1.active=true and 
       T1.deceased=false and T1.desk in (BIG LIST OF INTEGERS); 


select version() --> "PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by 
GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10), 64-bit"

show enable_hashagg --> "on" 
set work_mem='8MB' 
show work_mem --> "8MB" 

Explain analyse of the SQL above: 
Unique  (cost=47033.71..48410.27 rows=8881 width=1057) (actual 
time=18.803..38.969 rows=6449 loops=1) 
  ->  Sort  (cost=47033.71..47055.91 rows=8881 width=1057) (actual 
time=18.801..20.560 rows=6449 loops=1) 
        Sort Key: id, version, latest, active, deceased, person, formalnotes, 
informalnotes, description, desk, rolelevel, roletype, promotiondate, 
primaryrole, headofplace, careergrading, startdate, enddate, percentsalary, 
deskf, rolelevelf, roletypef, promotiondatef, primaryrolef, headofplacef, 
careergradingf, startdatef, enddatef, percentsalaryf, descriptionf, deskmv, 
rolelevelmv, roletypemv, promotiondatemv, primaryrolemv, headofplacemv, 
careergradingmv, startdatemv, enddatemv, percentsalarymv, descriptionmv, 
hasattachments, hasrelationships, hasprojects, audwho, audwhen, audcreated, 
costcentre, reportsto, manages, startdateest, enddateest, hasstarperformers, 
projectnames, sourcefrom, sourceto, checkedwho, checkedwhen, checkednotes, 
hasqueries, querytitles

        Sort Method:  quicksort  Memory: 2001kB 
        ->  Bitmap Heap Scan on role t1  (cost=4888.59..42321.27 rows=8881 
width=1057) (actual time=7.041..12.504 rows=6449 loops=1)

              Recheck Cond: (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])) 
              Filter: ((enddate IS NULL) AND latest AND active AND (NOT 
deceased)) 
              ->  Bitmap Index Scan on role_ix2  (cost=0.00..4886.37 rows=10984 
width=0) (actual time=6.948..6.948 rows=9296 loops=1)

                    Index Cond: ((latest = true) AND (active = true) AND 
(deceased = false) AND (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])))

Total runtime: 40.777 ms 



This execution of the query used a sort to perform the "distinct". 



Now for the second run: 

select version() --> "PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by 
GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10), 64-bit"

show enable_hashagg --> "on" 
set work_mem='64MB' 
show work_mem --> "64MB" 

Explain analyse of the SQL above: 
HashAggregate  (cost=43675.63..43764.44 rows=8881 width=1057) (actual 
time=46.556..55.694 rows=6449 loops=1) 
  ->  Bitmap Heap Scan on role t1  (cost=4888.59..42321.27 rows=8881 
width=1057) (actual time=7.179..13.023 rows=6449 loops=1)

        Recheck Cond: (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])) 
        Filter: ((enddate IS NULL) AND latest AND active AND (NOT deceased)) 
        ->  Bitmap Index Scan on role_ix2  (cost=0.00..4886.37 rows=10984 
width=0) (actual time=7.086..7.086 rows=9296 loops=1)

              Index Cond: ((latest = true) AND (active = true) AND (deceased = 
false) AND (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])))

Total runtime: 57.536 ms 




I've tested this with v8.4.4 as well with the same results. I also tested the 
same query with our previous production version of Postgres (v8.3.8) and that 
version only appears to use sorting not hash-aggregates.



Obviously, I can re-write the query to use a "distinct on (...)" clause to 
improve performance - which is what I've done, but my question is: Why is the 
hash-aggregate slower than the sort?


Is it something to do with the number of columns? ie. When sorting, the first 
few columns defined on the table (id, version) make the row unique - but when 
using the hash-aggregate feature, presumably every column needs to be hashed 
which takes longer especially for long text fields?

Thanks, 
--Jatinder 





Coalition Development Ltd 1st  Floor, One Newhams Row, London, United Kingdom, 
SE1 3UZ
Registration Number - 04328897 Registered Office - Direct Control 3rd Floor, 
Marvic House, Bishops Road, London, United Kingdom, SW6 7AD

Reply via email to