A hash join modification patch is under review for 8.4 that needs
performance testing.   We would appreciate help with this testing. 

 

A testing version of the patch is attached in addition to testing
instructions and where to retrieve a sample data set.   The basic idea
of the patch is that it reduces disk operations for large multi-batch
hash joins where there is skew in the probe relation.  The patch
collects statistics on performance benefits when using the optimization.

 

--

Ramon Lawrence and Bryce Cutt

Overview
--------

This document provides an overview of how to test the histojoin patch.  The 
patch performs skew optimization for large, multi-batch hash joins.

Installation
------------
The patch should compile cleanly against CVS head.


Execution
---------
The skew optimization can be turned on by:

set enable_hashjoin_usestatmcvs = on;

and off by:

set enable_hashjoin_usestatmcvs = off;

If a hash join has detectable skew in the larger probe relation, then the skew 
optimization will output the amount of skew it sees and the number of tuples it 
will buffer in memory to exploit that skew.  When the hash join completes, it 
will output statistics on the number of tuples actually matched by the 
in-memory (IM) skew partition and the number of tuples in partition 0. The 
improvements in join I/Os is also given.

Sample (from LI-P TPCH 10G 1Z):

Values: 100 Skew: 0.27  Est. tuples: 59986052.00 Batches: 512  Est. Save: 
16114709.99
Total Inner Tuples: 2000000
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778 
Total Output Tuples: 59986052 
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98


Data Set
--------
A sample test data set is TPC-H scale factor 1 GB.  A pg_dump can be downloaded 
from:

http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip

The larger 10 GB data sets are available on request.  You can also download the 
generator itself (works only on Windows) at:

http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip

The only joins with significant skew in the database are Part-LineItem and 
Supplier-LineItem.


Result Notes
------------

1) The percentage benefit increases with the amount of skew.  Relations with no 
skew are not affected.  Relations with minimal skew show no noticeable 
improvement or negative impact.

2) Since disk I/Os in the join is only one part of the query execution time, 
overall execution times do not improve the same amount as the reduction in disk 
I/Os.  For CPU-bound queries, the disk I/O improvement may not have a 
significant effect on the overall time.

3) The relations are quite large.  Thus, queries with SELECT * that join 
several relations are very costly and the generation of the tuples dominates 
the execution time (especially if executing the query through a client such as 
pgAdmin).


Previous Results
----------------

The join with LineItem-Part on TPCH 1G 1Z shows about a 26% improvement in I/Os 
performed during the join and about 5-10% improvement in overall time.  The 
join with LineItem-Supplier is similar.  Data sets with higher skew show even 
better performance.  For example, Lineitem-Part on TPCH 10G 2Z has 90% of probe 
relation tuples matching 100 most common values.  The improvement in I/Os is 
about 90% and time about 50%.

Some sample test queries:

Query #1a:
SELECT * FROM Part, Lineitem WHERE p_partkey = l_partkey;

Query #1b:
SELECT count(*) FROM Part, Lineitem WHERE p_partkey = l_partkey;

Query #2a:
SELECT * FROM Supplier, Lineitem WHERE s_suppkey = l_suppkey;

Query #2b:
SELECT count(*) FROM Supplier, Lineitem WHERE s_suppkey = l_suppkey;

Query #3a:
SELECT * FROM Part, Lineitem, Supplier WHERE p_partkey = l_partkey and 
s_suppkey = l_suppkey;

Query #3b:
SELECT count(*) FROM Part, Lineitem, Supplier WHERE p_partkey = l_partkey and 
s_suppkey = l_suppkey;

Attachment: histojoin_testing.patch
Description: histojoin_testing.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to