On 4/16/24 06:33, Robert Haas wrote:
On Sun, Apr 14, 2024 at 11:53 PM David Steele <da...@pgmasters.net> wrote:
Since incremental backup is using INCREMENTAL as a keyword (rather than
storing incremental info in the manifest) it is vulnerable to any file
in PGDATA with the pattern INCREMENTAL.*.
Yeah, that's true. I'm not greatly concerned about this, because I
think anyone who creates a file called INCREMENTAL.CONFIG is just
being perverse.
Well, it's INCREMENTAL.* and you might be surprised (though I doubt it)
at what users will do. We've been caught out by wacky file names (and
database names) a few times.
However, we could ignore those files just in subtrees
where they're not expected to occur i.e. only pay attention to them
under base, global, and pg_tblspc. I didn't do this because I thought
someone might eventually want to do something like incremental backup
of SLRU files, and then it would be annoying if there were a bunch of
random pathname restrictions. But if we're really concerned about
this, I think it would be a reasonable fix; you're not really supposed
to decide to store your configuration files in directories that exist
for the purpose of storing relation data files.
I think it would be reasonable to restrict what can be put in base/ and
global/ but users generally feel free to create whatever they want in
the root of PGDATA, despite being strongly encouraged not to.
Anyway, I think it should be fixed or documented as a caveat since it
causes a hard failure on restore.
We could fix the issue by forbidding this file pattern in PGDATA, i.e.
error when it is detected during pg_basebackup, but in my view it would
be better (at least eventually) to add incremental info to the manifest.
That would also allow us to skip storing zero-length files and
incremental stubs (with no changes) as files.
I did think about this, and I do like the idea of being able to elide
incremental stubs. If you have a tremendous number of relations and
very few of them have changed at all, the incremental stubs might take
up a significant percentage of the space consumed by the incremental
backup, which kind of sucks, even if the incremental backup is still
quite small compared to the original database. And, like you, I had
the idea of trying to use the backup_manifest to do it.
But ... I didn't really end up feeling very comfortable with it. Right
now, the backup manifest is something we only use to verify the
integrity of the backup. If we were to do this, it would become a
critical part of the backup.
For my 2c the manifest is absolutely a critical part of the backup. I'm
having a hard time even seeing why we have the --no-manifest option for
pg_combinebackup at all. If the manifest is missing who knows what else
might be missing as well? If present, why wouldn't we use it?
I know Tomas added some optimizations that work best with --no-manifest
but if we can eventually read compressed tars (which I expect to be the
general case) then those optimizations are not very useful.
I don't think I like the idea that
removing the backup_manifest should be allowed to, in effect, corrupt
the backup. But I think I dislike even more the idea that the data
that is used to verify the backup gets mushed together with the backup
data itself. Maybe in practice it's fine, but it doesn't seem very
conceptually clean.
I don't think this is a problem. The manifest can do more than store
verification info, IMO.
There are also some practical considerations that made me not want to
go in this direction: we'd need to make sure that the relevant
information from the backup manifest was available to the
reconstruction process at the right part of the code. When iterating
over a directory, it would need to consider all of the files actually
present in that directory plus any "hallucinated" incremental stubs
from the manifest. I didn't feel confident of my ability to implement
that in the available time without messing anything up.
I think it would be better to iterate over the manifest than files in a
directory. In any case we still need to fix [1], which presents more or
less the same problem.
I think we should consider other possible designs here. For example,
instead of including this in the manifest, we could ship one
INCREMENTAL.STUBS file per directory that contains a list of all of
the incremental stubs that should be created in that directory. My
guess is that something like this will turn out to be way simpler to
code.
So we'd store a mini manifest per directory rather than just put the
info in the main manifest? Sounds like unnecessary complexity to me.
Regards,
-David
[1]
https://www.postgresql.org/message-id/flat/9badd24d-5bd9-4c35-ba85-4c38a2feb73e%40pgmasters.net