Hi,

While testing some stuff, I noticed heapam_estimate_rel_size (or rather
table_block_relation_estimate_size it calls) ignores fillfactor, so that
for a table without statistics it ends up with reltuple estimate much
higher than reality. For example, with fillfactor=10 the estimate is
about 10x higher.

I ran into this while doing some tests with hash indexes, where I use
fillfactor to make the table look bigger (as if the tuples were wider),
and I ran into this:

  drop table hash_test;
  create table hash_test (a int, b text) with (fillfactor=10);
  insert into hash_test select 1 + ((i - 1) / 10000), md5(i::text)
         from generate_series(1, 1000000) s(i);
  -- analyze hash_test;
  create index hash_test_idx on hash_test using hash (a);

  select pg_size_pretty(pg_relation_size('hash_test_idx'));

If you run it like this (without the analyze), the index will be 339MB.
With the analyze, it's 47MB.

This only happens if there are no relpages/reltuples statistics yet, in
which case table_block_relation_estimate_size estimates density from
tuple width etc.

So it seems the easiest "fix" is to do ANALYZE before creating the index
(and not after it, as I had in my scripts). But I wonder how many people
fail to realize this - it sure took me a while to realize the indexes
are too large and even longer what is causing it. I wouldn't be very
surprised if many people had bloated hash indexes after bulk loads.

So maybe we should make table_block_relation_estimate_size smarter to
also consider the fillfactor in the "no statistics" branch, per the
attached patch.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a5e6c92f35..3a26846f01 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -737,11 +737,19 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
 		 * and (c) different table AMs might use different padding schemes.
 		 */
 		int32		tuple_width;
+		int			fillfactor;
+
+		/*
+		 * Without reltuples/relpages, we also need to consider fillfactor.
+		 * The other branch considers it implicitly by calculating density
+		 * from actual relpages/reltuples statistics.
+		 */
+		fillfactor = RelationGetFillFactor(rel, HEAP_DEFAULT_FILLFACTOR);
 
 		tuple_width = get_rel_data_width(rel, attr_widths);
 		tuple_width += overhead_bytes_per_tuple;
 		/* note: integer division is intentional here */
-		density = usable_bytes_per_page / tuple_width;
+		density = (usable_bytes_per_page * fillfactor / 100) / tuple_width;
 	}
 	*tuples = rint(density * (double) curpages);
 

Reply via email to