Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)

Thomas Simpson Tue, 23 Jul 2024 07:42:04 -0700

Hi Andrew,

This is very interesting.

I had started looking at pg_dumpall trying to work out an approach. Inoticed parallel.c essentially already does all the thread creation andcoordination that I knew would be needed. Given that is a solvedproblem, I started to look further (continued below).



On 22-Jul-2024 11:50, Andrew Dunstan wrote:

On 2024-07-19 Fr 9:46 AM, Thomas Simpson wrote:
Hi Scott,
I realize some of the background was snipped on what I sent to thehacker list, I'll try to fill in the details.
Short background is very large database ran out of space duringvacuum full taking down the server. There is a replica which wasapplying the WALs and so it too ran out of space. On restart afterclearing some space, the database came back up but left over thein-progress rebuild files. I've cleared that replica and am using itas my rebuild target just now.
Trying to identify the 'orphan' files and move them away always ledto the database spotting the supposedly unused files having gone andrefusing to start, so I had no successful way to clean up and getspace back.
Last resort after discussion is pg_dumpall & reload. I'm doing thisvia a network pipe (netcat) as I do not have the vast amount ofstorage necessary for the dump file to be stored (in any format).
On 19-Jul-2024 09:26, Scott Ribe wrote:
Do you actually have 100G networking between the nodes? Because ifnot, a single CPU should be able to saturate 10G.
Servers connect via 10G WAN; sending is not the issue, it'sapplication of the incoming stream on the destination which isbottlenecked.
Likewise the receiving end would need disk capable of keeping up.Which brings up the question, why not write to disk, but directly tothe destination rather than write locally then copy?
In this case, it's not a local write, it's piped via netcat.
Do you require dump-reload because of suspected corruption? That's atough one. But if not, if the goal is just to get up and running ona new server, why not pg_basebackup, streaming replica, promote?That depends on the level of data modification activity being lowenough that pg_basebackup can keep up with WAL as it's generated andapply it faster than new WAL comes in, but given that your server iscurrently keeping up with writing that much WAL and flushing thatmany changes, seems likely it would keep up as long as the networkconnection is fast enough. Anyway, in that scenario, you don't needto care how long pg_basebackup takes.
If you do need a dump/reload because of suspected corruption, theonly thing I can think of is something like doing it a table at atime--partitioning would help here, if practical.
The basebackup is, to the best of my understanding, essentially justcopying the database files. Since the failed vacuum has left extrafiles, my expectation is these too would be copied, leaving me in thesame position I started in. If I'm wrong, please tell me as thatwould be vastly quicker - it is how I originally set up the replicaand it took only a few hours on the 10G link.
The inability to get a clean start if I move any files out the wayleads me to be concerned for some underlying corruption/issue and therecommendation earlier in the discussion was opt for dump/reload asthe fail-safe.
Resigned to my fate, my thoughts were to see if there is a way toimprove the dump-reload approach for the future. Since dump-reloadis the ultimate upgrade suggestion in the documentation, it seemsworthwhile to see if there is a way to improve the performance ofthat especially as very large databases like mine are a thing withPostgreSQL. From a quick review of pg_dump.c (I'm no expert on itobviously), it feels like it's already doing most of what needs doneand the addition is some sort of multi-thread coordination with arestore client to ensure each thread can successfully complete eachtask it has before accepting more work. I realize that's actuallydifficult to implement.
There is a plan for a non-text mode for pg_dumpall. I have startedwork on it, and hope to have a WIP patch in a month or so. It's not myintention to parallelize it for the first cut, but it could definitelybe parallelizable in future. However, it will require writing to disksomewhere, albeit that the data will be compressed. It's well nighimpossible to parallelize text format dumps.
Restoration of custom and directory format dumps has long beenparallelized. Parallel dumps require directory format, and so willnon-text pg_dumpall.


My general approach (which I'm sure is naive) was:

Add to pg_dumpall the concept of backup phase and I have the basic hooksin place. 0 = role grants etc. The stuff before dumping actualdatabases. I intercepted the fprintf(OPF to a hook function that fornormal run just ends up doing the same as fprintf but for my parallelmode, it has a hook to send the info via the network (still to be donebut I think I may need to alter the fprintf stuff with more granularityof what is being processed at each output to help this part, such asoutputRoleCreate, outputComment etc.).

Each subsequent phase is a whole database - increment at each pg_dumpcall. The actual pg_dump is to get a new format, -F N for network;based around directory dump as the base, my intention was to makemultiple network pipes to send the data in place of the files within thedirectory. Essentially relying on whatever is already done to organizeparallel dumps to disk to be sufficient for coordinating network streaming.

The restore side needs to do network listen plus some handshaking toconfirm completion of the incoming phases, any necessary dependencytracking on restore etc.

My goal was to actively avoid the disk usage part through thecoordination over the network between dump and restore even though mystarting point is the pg_backup_directory code. Any problem on therestore side would feed back and halt the dump side in error so this isa new failure mode compared with how it works just now.

I'll hold off a bit as I'm very interested in any feedback you have,particularly if you see serious flaws in my though process here.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Thanks

Tom

Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)

Reply via email to