Hello Tom!

I noticed you are improving pg_dump just now.

Some time ago I experimented with a customer database dump in parallel 
directory mode -F directory -j (2-4)

I noticed it took quite long to complete.

Further investigation showed that in this mode with multiple jobs the tables 
are processed in decreasing size order, which makes sense to avoid a long tail 
of a big table in one of the jobs prolonging overall dump time.

Exactly one table took very long, but seemed to be of moderate size.

But the size-determination fails to consider the size of toast tables and this 
table had a big associated toast-table of bytea column(s).
Even with an analyze at loading time there where no size information of the 
toast-table in the catalog tables.

I thought of the following alternatives to ameliorate:

1. Using pg_table_size() function in the catalog query
Pos: This reflects the correct size of every relation
Neg: This goes out to disk and may take a huge impact on databases with very 
many tables

2. Teaching vacuum to set the toast-table size like it sets it on normal tables

3. Have a command/function for occasionly setting the (approximate) size of 
toast tables 

I think with further work under the way (not yet ready), pg_dump can really 
profit from parallel/not compressing mode, especially considering the huge 
amount of bytea/blob/string data in many big customer scenarios.

Thoughts?

Hans Buschmann


Reply via email to