On Fri, May 8, 2020 at 5:27 PM Andres Freund <and...@anarazel.de> wrote: > I wonder if there's cases where recursively forwarding like this will > cause noticable performance effects. The only operation that seems > frequent enough to potentially be noticable would be "chunks" of the > file. So perhaps it'd be good to make sure we read in large enough > chunks?
Yeah, that needs to be tested. Right now the chunk size is 32kB but it might be a good idea to go larger. Another thing is that right now the chunk size is tied to the protocol message size, and I'm not sure whether the size that's optimal for disk reads is also optimal for protocol messages. > This, to me, indicates that we should measure the progress solely based > on how much of the "source" data was processed. The overhead of tar, the > reduction due to compression, shouldn't be included. I don't think it's a particularly bad thing that we include a small amount of progress for sending an empty file, a directory, or a symlink. That could make the results more meaningful if you have a database with lots of empty relations in it. However, I agree that the effect of compression shouldn't be included. To get there, I think we need to redesign the wire protocol. Right now, the server has no way of letting the client know how many uncompressed bytes it's sent, and the client has no way of figuring it out without uncompressing, which seems like something we want to avoid. There are some other problems with the current wire protocol, too: 1. The syntax for the BASE_BACKUP command is large and unwieldy. We really ought to adopt an extensible options syntax, like COPY, VACUUM, EXPLAIN, etc. do, rather than using a zillion ad-hoc bolt-ons, each with bespoke lexer and parser support. 2. The client is sent a list of tablespaces and is supposed to use that to expect an equal number of archives, computing the name for each one on the client side from the tablespace info. However, I think we should be able to support modes like "put all the tablespaces in a single archive" or "send a separate archive for every 256GB" or "ship it all to the cloud and don't send me any archives". To get there, I think we should have the server send the archive name to the clients, and the client should just keep receiving the next archive until it's told that there are no more. Then if there's one archive or ten archives or no archives, the client doesn't have to care. It just receives what the server sends until it hears that there are no more. It also doesn't know how the server is naming the archives; the server can, for example, adjust the archive names based on which compression format is being chosen, without knowledge of the server's naming conventions needing to exist on the client side. I think we should keep support for the current BASE_BACKUP command but either add a new variant using an extensible options, or else invent a whole new command with a different name (BACKUP, SEND_BACKUP, whatever) that takes extensible options. This command should send back all the archives and the backup manifest using a single COPY stream rather than multiple COPY streams. Within the COPY stream, we'll invent a sub-protocol, e.g. based on the first letter of the message, e.g.: t = Tablespace boundary. No further message payload. Indicates, for progress reporting purposes, that we are advancing to the next tablespace. f = Filename. The remainder of the message payload is the name of the next file that will be transferred. d = Data. The next four bytes contain the number of uncompressed bytes covered by this message, for progress reporting purposes. The rest of the message is payload, possibly compressed. Could be empty, if the data is being shipped elsewhere, and these messages are only being sent to update the client's notion of progress. > I've not though enough about the specifics, but I think it looks like > it's going roughly in a better direction. Good to hear. > One thing I wonder about is how stateful the interface is. Archivers > will pretty much always track which file is currently open etc. Somehow > such a repeating state machine seems a bit ugly - but I don't really > have a better answer. I thought about that a bit, too. There might be some way to unify that by having some common context object that's defined by basebackup.c and all archivers get it, so that they have some commonly-desired details without needing bespoke code, but I'm not sure at this point whether that will actually produce a nicer result. Even if we don't have it initially, it seems like it wouldn't be very hard to add it later, so I'm not too stressed about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company