Re: Experimenting with hash tables inside pg_dump

Andres Freund Thu, 21 Oct 2021 16:38:20 -0700

Hi,

On 2021-10-21 18:27:25 -0400, Tom Lane wrote:
> Today, pg_dump does a lot of internal lookups via binary search
> in presorted arrays.  I thought it might improve matters
> to replace those binary searches with hash tables, theoretically
> converting O(log N) searches into O(1) searches.  So I tried making
> a hash table indexed by CatalogId (tableoid+oid) with simplehash.h,
> and replacing as many data structures as I could with that.


That does sound like a good idea in theory...


> This makes the code shorter and (IMO anyway) cleaner, but
> 
> (a) the executable size increases by a few KB --- apparently, even
> the minimum subset of simplehash.h's functionality is code-wasteful.

Hm. Surprised a bit by that. In an optimized build the difference is a
smaller, at least.

optimized:
   text    data     bss     dec     hex filename
 448066    7048    1368  456482   6f722 src/bin/pg_dump/pg_dump
 447530    7048    1496  456074   6f58a src/bin/pg_dump/pg_dump.orig

debug:
   text    data     bss     dec     hex filename
 516883    7024    1352  525259   803cb src/bin/pg_dump/pg_dump
 509819    7024    1480  518323   7e8b3 src/bin/pg_dump/pg_dump.orig

The fact that optimization plays such a role makes me wonder if a good chunk
of the difference is the slightly more complicated find{Type,Func,...}ByOid()
functions.


> (b) I couldn't measure any change in performance at all.  I tried
> it on the regression database and on a toy DB with 10000 simple
> tables.  Maybe on a really large DB you'd notice some difference,
> but I'm not very optimistic now.

Did you measure runtime of pg_dump, or how much CPU it used?  I think a lot of
the time the backend is a bigger bottleneck than pg_dump...

For the regression test DB the majority of the time seems to be spent below
two things:
1) libpq
2) sortDumpableObjects().

I don't think 2) hits the binary search / hashtable path?


It does seem interesting that a substantial part of the time is spent in/below
PQexec() and PQfnumber(). Especially the latter shouldn't be too hard to
optimize away...


Greetings,

Andres Freund

Re: Experimenting with hash tables inside pg_dump

Reply via email to