[PATCH] add relation and block-level filtering to pg_waldump
Greetings, This patch adds the ability to specify a RelFileNode and optional BlockNum to limit output of pg_waldump records to only those which match the given criteria. This should be more performant than `pg_waldump | grep` as well as more reliable given specific variations in output style depending on how the blocks are specified. This currently affects only the main fork, but we could presumably add the option to filter by fork as well, if that is considered useful. Best, David >From 9194b2cb07172e636030b9b4e979b7f2caf7cbc0 Mon Sep 17 00:00:00 2001 From: David Christensen Date: Thu, 24 Feb 2022 11:00:46 -0600 Subject: [PATCH] Add relation/block filtering to pg_waldump This feature allows you to only output records that are targeting a specific RelFileNode and optional BlockNumber within this relation. Currently only applies this filter to the relation's main fork. --- doc/src/sgml/ref/pg_waldump.sgml | 23 ++ src/bin/pg_waldump/pg_waldump.c | 74 +++- 2 files changed, 96 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml index 5735a161ce..c953703bc8 100644 --- a/doc/src/sgml/ref/pg_waldump.sgml +++ b/doc/src/sgml/ref/pg_waldump.sgml @@ -100,6 +100,29 @@ PostgreSQL documentation + + -k block + --block=block + + +Display only records touching the given block. (Requires also +providing the relation via --relation.) + + + + + + -l tbl/db/rel + --relation=tbl/db/rel + + +Display only records touching the given relation. The relation is +specified via tablespace oid, database oid, and relfilenode separated +by slashes. + + + + -n limit --limit=limit diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1a96..faae547a5c 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -55,6 +55,10 @@ typedef struct XLogDumpConfig bool filter_by_rmgr_enabled; TransactionId filter_by_xid; bool filter_by_xid_enabled; + RelFileNode filter_by_relation; + bool filter_by_relation_enabled; + BlockNumber filter_by_relation_block; + bool filter_by_relation_block_enabled; } XLogDumpConfig; typedef struct Stats @@ -394,6 +398,34 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen, return count; } +/* + * Boolean to return whether the given WAL record matches a specific relation and optional block + */ +static bool +XLogRecordMatchesRelationBlock(XLogReaderState *record, RelFileNode matchRnode, BlockNumber matchBlock) +{ + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blk; + int block_id; + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + if (!XLogRecHasBlockRef(record, block_id)) + continue; + + XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blk); + + if (forknum == MAIN_FORKNUM && + matchRnode.spcNode == rnode.spcNode && + matchRnode.dbNode == rnode.dbNode && + matchRnode.relNode == rnode.relNode && + (matchBlock == InvalidBlockNumber || matchBlock == blk)) + return true; + } + return false; +} + /* * Calculate the size of a record, split into !FPI and FPI parts. */ @@ -767,6 +799,8 @@ usage(void) printf(_(" -b, --bkp-details output detailed information about backup blocks\n")); printf(_(" -e, --end=RECPTR stop reading at WAL location RECPTR\n")); printf(_(" -f, --follow keep retrying after reaching end of WAL\n")); + printf(_(" -k, --block=N only show records matching a given relation block (requires -l)\n")); + printf(_(" -l, --relation=N/N/N only show records that touch a specific relation\n")); printf(_(" -n, --limit=N number of records to display\n")); printf(_(" -p, --path=PATHdirectory in which to find log segment files or a\n" " directory with a ./pg_wal that contains such files\n" @@ -802,12 +836,14 @@ main(int argc, char **argv) static struct option long_options[] = { {"bkp-details", no_argument, NULL, 'b'}, + {"block", required_argument, NULL, 'k'}, {"end", required_argument, NULL, 'e'}, {"follow", no_argument, NULL, 'f'}, {"help", no_argument, NULL, '?'}, {"limit", required_argument, NULL, 'n'}, {"path", required_argument, NULL, 'p'}, {"quiet", no_argument, NULL, 'q'}, + {"relation", required_argument, NULL, 'l'}, {"rmgr", required_argument, NULL, 'r'}, {"start", required_argument, NULL,
Re: [PATCH] add relation and block-level filtering to pg_waldump
Added to commitfest as: https://commitfest.postgresql.org/37/3565/
Re: [PATCH] add relation and block-level filtering to pg_waldump
> Cool. I think we can report an error instead of reading wal files, > if the tablespace, database, or relation is invalid. Does there any > WAL record that has invalid tablespace, database, or relation OID? The only sort of validity check we could do here is range checking for the underlying data types (which we certainly could/should add if it’s known to never be valid for the underlying types); non-existence of objects is a no-go, since that depends purely on the WAL range you are looking at and you’d have to, you know, scan it to see if it existed before marking as invalid. :) Thanks, David
Re: [PATCH] add relation and block-level filtering to pg_waldump
On Fri, Feb 25, 2022 at 7:33 AM Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > Thanks for the patch. This is not adding something that users can't do > right now, but definitely improves the usability of the pg_waldump as > it avoids external filterings. Also, it can give the stats/info at > table and block level. So, +1 from my side. > Thanks for the feedback; I will be incorporating most of this into a new version, with a couple of responses below. > 3) Crossing 80 char limit > This is neither here nor there, but have we as a project considered increasing that to something more modern? I know a lot of current projects consider 132 to be a more reasonable limit. (Will reduce it down to that for now, but consider this a vote towards increasing that limit.) > 5) I would also see a need for "filter by FPW" i.e. list all WAL > records with "FPW". > Yes, that wouldn't be too hard to add to this, can add to the next version. We probably ought to also add the fork number as specifiable as well. Other thoughts on could be some wildcard value in the relation part, so `1234/23456/*` could filter WAL to a specific database only, say, or some other multiple specifier, like `--block=1234,123456,121234`. (I honestly consider this to be more advanced than we'd need to support in this patch, but if probably wouldn't be too hard to add to it.) Thanks, David
Re: [PATCH] add relation and block-level filtering to pg_waldump
On Fri, Feb 25, 2022 at 7:08 AM Japin Li wrote: > > On Fri, 25 Feb 2022 at 20:48, David Christensen < > david.christen...@crunchydata.com> wrote: > >> Cool. I think we can report an error instead of reading wal files, > >> if the tablespace, database, or relation is invalid. Does there any > >> WAL record that has invalid tablespace, database, or relation OID? > > > > The only sort of validity check we could do here is range checking for > the underlying data types > > (which we certainly could/should add if it’s known to never be valid for > the underlying types); > > The invalid OID I said here is such as negative number and zero, for those > parameters, we do not need to read the WAL files, since it always invalid. > Agreed. Can add some additional range validation to the parsed values. David
Re: [PATCH] add relation and block-level filtering to pg_waldump
Bharath Rupireddy writes: > On Fri, Feb 25, 2022 at 12:36 AM David Christensen > wrote: >> >> Greetings, >> >> This patch adds the ability to specify a RelFileNode and optional BlockNum >> to limit output of >> pg_waldump records to only those which match the given criteria. This >> should be more performant >> than `pg_waldump | grep` as well as more reliable given specific variations >> in output style >> depending on how the blocks are specified. >> >> This currently affects only the main fork, but we could presumably add the >> option to filter by fork >> as well, if that is considered useful. > > Thanks for the patch. This is not adding something that users can't do > right now, but definitely improves the usability of the pg_waldump as > it avoids external filterings. Also, it can give the stats/info at > table and block level. So, +1 from my side. Attached is V2 with additional feedback from this email, as well as the specification of the ForkNumber and FPW as specifiable options. Best, David >From 1b04f04317d364006371bdab0db9086f79138b25 Mon Sep 17 00:00:00 2001 From: David Christensen Date: Fri, 25 Feb 2022 12:52:56 -0600 Subject: [PATCH] Add additional filtering options to pg_waldump This feature allows you to only output records that are targeting a specific RelFileNode and optional BlockNumber within this relation, while specifying which ForkNum you want to filter to. We also add the independent ability to filter via Full Page Write. --- doc/src/sgml/ref/pg_waldump.sgml | 48 src/bin/pg_waldump/pg_waldump.c | 128 ++- 2 files changed, 175 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml index 5735a161ce..f157175764 100644 --- a/doc/src/sgml/ref/pg_waldump.sgml +++ b/doc/src/sgml/ref/pg_waldump.sgml @@ -100,6 +100,44 @@ PostgreSQL documentation + + -k block + --block=block + + +Display only records touching the given block. (Requires also +providing the relation via --relation.) + + + + + + --fork=fork + + +When using the --relation filter, output only records +from the given fork. The valid values here are 0 +for the main fork, 1 for the Free Space +Map, 2 for the Visibility Map, +and 3 for the Init fork. If unspecified, defaults +to the main fork. + + + + + + -l tbl/db/rel + --relation=tbl/db/rel + + +Display only records touching the given relation. The relation is +specified via tablespace OID, database OID, and relfilenode separated +by slashes, for instance, 1234/12345/12345. This +is the same format used for relations in the WAL dump output. + + + + -n limit --limit=limit @@ -183,6 +221,16 @@ PostgreSQL documentation + + -w + --fullpage + + + Filter records to only those which have full page writes. + + + + -x xid --xid=xid diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1a96..a527cd4dc6 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -55,6 +55,12 @@ typedef struct XLogDumpConfig bool filter_by_rmgr_enabled; TransactionId filter_by_xid; bool filter_by_xid_enabled; + RelFileNode filter_by_relation; + bool filter_by_relation_enabled; + BlockNumber filter_by_relation_block; + bool filter_by_relation_block_enabled; + ForkNumber filter_by_relation_forknum; + bool filter_by_fpw; } XLogDumpConfig; typedef struct Stats @@ -394,6 +400,56 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen, return count; } +/* + * Boolean to return whether the given WAL record matches a specific relation and optional block + */ +static bool +XLogRecordMatchesRelationBlock(XLogReaderState *record, RelFileNode matchRnode, BlockNumber matchBlock, ForkNumber matchFork) +{ + int block_id; + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blk; + + if (!XLogRecHasBlockRef(record, block_id)) + continue; + + XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blk); + + if (forknum == matchFork && + matchRnode.spcNode == rnode.spcNode && + matchRnode.dbNode == rnode.dbNode && + matchRnode.relNode == rnode.relNode && + (matchBlock == InvalidBlockNumber || matchBlock == blk)) + return true; + } + + return false; +} + +/* + * Boolean to return whether the given WAL record contains a full page write + */ +static bool +XLogRecordHasFPW(XLogReaderState *record
Re: [PATCH] add relation and block-level filtering to pg_waldump
On Sun, Mar 20, 2022 at 11:56 PM Thomas Munro wrote: > On Mon, Mar 21, 2022 at 4:36 PM Thomas Munro > wrote: > > On Sat, Feb 26, 2022 at 7:58 AM David Christensen > > wrote: > > > Attached is V2 with additional feedback from this email, as well as > the specification of the > > > ForkNumber and FPW as specifiable options. > > > > Trivial fixup needed after commit 3f1ce973. > > [04:30:50.630] pg_waldump.c:963:26: error: format ‘%u’ expects > argument of type ‘unsigned int *’, but argument 3 has type ‘ForkNumber > *’ [-Werror=format=] > [04:30:50.630] 963 | if (sscanf(optarg, "%u", > &config.filter_by_relation_forknum) != 1 || > [04:30:50.630] | ~^ ~~ > [04:30:50.630] | | | > [04:30:50.630] | | ForkNumber * > [04:30:50.630] | unsigned int * > > And now that this gets to the CompilerWarnings CI task, it looks like > GCC doesn't like an enum as a scanf %u destination (I didn't see that > warning locally when I compiled the above fixup because clearly Clang > is cool with it...). Probably needs a temporary unsigned int to > sscanf into first. > Do you need me to fix this, or are you incorporating that into a V4 of this patch? (Similar to your fixup prior in this thread?)
Re: [PATCH] add relation and block-level filtering to pg_waldump
Updated to include the V3 fixes as well as the unsigned int/enum fix. > v4-0001-Add-additional-filtering-options-to-pg_waldump.patch Description: Binary data
Re: [PATCH] add relation and block-level filtering to pg_waldump
On Mon, Mar 21, 2022 at 4:39 PM Thomas Munro wrote: [snip] I guess you did this because init fork references aren't really > expected in the WAL, but I think it's more consistent to allow up to > MAX_FORKNUM, not least because your documentation mentions 3 as a > valid value. So I adjust this to allow MAX_FORKNUM. Make sense? > Makes sense, but I think I'd actually thought it was +1 of the max forks, so you give me more credit than I deserve in this case... :-) > Here are some more details I noticed, as a likely future user of this > very handy feature, which I haven't changed, because they seem more > debatable and you might disagree... > > 1. I think it'd be less surprising if the default value for --fork > wasn't 0... why not show all forks? > Agreed; made it default to all, with the ability to filter down if desired. > 2. I think it'd be less surprising if --fork without --relation > either raised an error (like --block without --relation), or were > allowed, with the meaning "show me this fork of all relations". > Agreed; reworked to support the use case of only showing target forks. > 3. It seems funny to have no short switch for --fork when everything > else has one... what about -F? > Good idea; I'd hadn't seen capitals in the getopt list so didn't consider them, but I like this. Enclosed is v6, incorporating these fixes and docs tweaks. Best, David v6-0001-Add-additional-filtering-options-to-pg_waldump.patch Description: Binary data
Re: [PATCH] pgbench: add multiconnect option
On Sat, Mar 19, 2022 at 11:43 AM Fabien COELHO wrote: > > Hi Sami, > > > Pgbench is a simple benchmark tool by design, and I wonder if adding > > a multiconnect feature will cause pgbench to be used incorrectly. > > Maybe, but I do not see how it would be worse that what pgbench already > allows. > I agree that pgbench is simple; perhaps really too simple when it comes to being able to measure much more than basic query flows. What pgbench does have in its favor is being distributed with the core distribution. I think there is definitely space for a more complicated benchmarking tool that exercises more scenarios and more realistic query patterns and scenarios. Whether that is distributed with the core is another question. David
Re: [PATCH] Proof of concept for GUC improvements
> On Mar 21, 2022, at 7:53 PM, Tom Lane wrote: > > Andres Freund writes: >> My impression is that there's not a lot of enthusiasm for the concept? If >> that's true we maybe ought to mark the CF entry as rejected? > > Yeah, I'm kind of leaning that way too. I don't see how we can > incorporate the symbolic values into any existing display paths > without breaking applications that expect the old output. > That being the case, it seems like we'd have "two ways to do it" > indefinitely, which would add enough confusion that I'm not > sure there's a net gain. In particular, I foresee novice questions > along the lines of "I set foo to disabled, why is it showing > as zero?" Yeah, my main motivation here was about trying to have less special values in the config files, but I guess it would effectively be making everything effectively an enum and still relies on knowing just what the specific magic values are, no not really a net gain in this department. For the record, I thought this would have a fairly low chance of getting in, was mainly curious what level of effort it would take to get something like this going and get some feedback on the actual idea. > If we'd done it like this from the beginning, it'd have been > great, but retrofitting it now is a lot less appealing. Yeah, agreed on this. As far as I’m concerned we can reject. Thanks, David
Re: psql \df choose functions by their arguments
> * Removed the tab-complete bit, it was too fragile and unhelpful I can’t speak for the specific patch, but tab completion of proc args for \df, \ef and friends has long been a desired feature of mine, particularly when you are dealing with functions with huge numbers of arguments and the same name which I have (sadly) come across many times in the wild. Removing this because it was brittle is fine, but would be good to see if we could figure out a way to have this kind of feature in psql IMHO. Best, David
Re: [PATCH] Add `truncate` option to subscription commands
Hi, At this time I do not have time to make the necessary changes for this commitfest so I am voluntarily withdrawing this patch, but will revisit at a future time. Best, David On Wed, Oct 28, 2020 at 1:06 PM Rahila Syed wrote: > > Hi David, > > The feature seems useful to me. The code will need to be refactored due to > changes in commit : b05fe7b442 > > Please see the following comments. > 1. Is there a specific reason behind having new relstate for truncate? > The current state flow is INIT->DATATSYNC->SYNCWAIT->CATCHUP->SYNCDONE->READY. > Can we accommodate the truncate in either INIT or DATASYNC? > > 2. + StartTransactionCommand(); > + rel = > table_open(MyLogicalRepWorker->relid, RowExclusiveLock); > + > + rels = lappend(rels, rel); > + relids = lappend_oid(relids, > MyLogicalRepWorker->relid); > + > + ExecuteTruncateGuts(rels, relids, NIL, > DROP_RESTRICT, false); > + CommitTransactionCommand(); > > Truncate is being performed in a separate transaction as data copy, I think > that leaves a window > open for concurrent transactions to modify the data after truncate and before > copy. > > 3. Regarding the truncate of the referenced table, I think one approach can > be to perform the following: > i. lock the referencing and referenced tables against writes > ii. drop the foriegn key constraints, > iii.truncate > iv. sync > v. recreate the constraints > vi. release lock. > However, I am not sure of the implications of locking these tables on the > main apply process. > > > Thank you, > > > On Mon, Oct 12, 2020 at 11:31 PM David Christensen wrote: >> >> > On Oct 11, 2020, at 10:00 PM, Amit Kapila wrote: >> > >> > On Mon, Oct 12, 2020 at 3:44 AM David Christensen >> > wrote: >> >> >> >> >> >> On Oct 11, 2020, at 1:14 PM, Euler Taveira >> >> wrote: >> >> >> >> >> >> On Fri, 9 Oct 2020 at 15:54, David Christensen wrote: >> >>> >> >>> >> >>> Enclosed find a patch to add a “truncate” option to subscription >> >>> commands. >> >>> >> >>> When adding new tables to a subscription (either via `CREATE >> >>> SUBSCRIPTION` or `REFRESH PUBLICATION`), tables on the target which are >> >>> being newly subscribed will be truncated before the data copy step. >> >>> This saves explicit coordination of a manual `TRUNCATE` on the target >> >>> tables and allows the results of the initial data sync to be the same as >> >>> on the publisher at the time of sync. >> >>> >> >>> To preserve compatibility with existing behavior, the default value for >> >>> this parameter is `false`. >> >>> >> >> >> >> Truncate will fail for tables whose foreign keys refer to it. If such a >> >> feature cannot handle foreign keys, the usefulness will be restricted. >> >> >> >> >> >> This is true for existing “truncate” with FKs, so doesn’t seem to be any >> >> different to me. >> >> >> > >> > What would happen if there are multiple tables and truncate on only >> > one of them failed due to FK check? Does it give an error in such a >> > case, if so will the other tables be truncated? >> >> Currently each SyncRep relation is sync’d separately in its own worker >> process; we are doing the truncate at the initialization step of this, so >> it’s inherently in its own transaction. I think if we are going to do any >> sort of validation on this, it would have to be at the point of the CREATE >> SUBSCRIPTION/REFRESH PUBLICATION where we have the relation list and can do >> sanity-checking there. >> >> Obviously if someone changes the schema at some point between when it does >> this and when relation syncs start there is a race condition, but the same >> issue would affect other data sync things, so I don’t care to solve that as >> part of this patch. >> >> >> Hypothetically if you checked all new tables and could verify if there >> >> were FK cycles only already in the new tables being added then “truncate >> >> cascade” would be fine. Arguably if they had existing tables that were >> >> part of an FK that wasn’t fully replicated they wer
Re: [PATCH] Proof of concept for GUC improvements
> Hi, > > According to the cfbot, the patch doesn't apply anymore and needs a > rebase: http://cfbot.cputube.org/patch_36_3290.log V4 rebased attached. special-guc-values-v4.patch Description: Binary data
CREATE ROLE IF NOT EXISTS
Greetings -hackers, Enclosed is a patch that implements CREATE ROLE IF NOT EXISTS (along with the same support for USER/GROUP). This is a fairly straightforward approach in that we do no validation of anything other than existence, with the user needing to ensure that permissions/grants are set up in the proper way. Comments? Best, David CREATE-ROLE-IF-NOT-EXISTS.patch Description: Binary data
Re: CREATE ROLE IF NOT EXISTS
On Tue, Oct 19, 2021 at 4:29 PM Isaac Morland wrote: > On Tue, 19 Oct 2021 at 16:12, David Christensen < > david.christen...@crunchydata.com> wrote: > >> Greetings -hackers, >> >> Enclosed is a patch that implements CREATE ROLE IF NOT EXISTS (along with >> the same support for USER/GROUP). This is a fairly straightforward >> approach in that we do no validation of anything other than existence, with >> the user needing to ensure that permissions/grants are set up in the proper >> way. >> > > One little tricky aspect that occurs to me is the ALTER ROLE to set the > role flag options: it really needs to mention *all* the available options > if it is to leave the role in a specific state regardless of how it started > out. For example, if the existing role has BYPASSRLS but you want the > default NOBYPASSRLS you have to say so explicitly. > > Because of this, I think my preference, based just on thinking about > setting the flag options, would be for CREATE OR REPLACE. > > However, I'm wondering about the role name options: IN ROLE, ROLE, ADMIN. > With OR REPLACE should they replace the set of memberships or augment it? > Either seems potentially problematic to me. By contrast it’s absolutely > clear what IF NOT EXISTS should do with these. > > So I’m not sure what I think overall. > Sure, the ambiguity here for merging options was exactly the reason I went with the IF NOT EXISTS route. Whatever concerns with merging already exist with ALTER ROLE, so nothing new is introduced by this functionality, at least that was my original thought. David
Re: [PATCH] Proof of concept for GUC improvements
> On Nov 3, 2021, at 5:35 AM, Daniel Gustafsson wrote: > > >> >>> On 15 Oct 2021, at 23:54, Cary Huang wrote: >> >> I scanned through the GUC list and found that the following parameters can >> potentially be categorized in the "special_disabled0" group, just for your >> reference. > > > This patch no longer applies, can you please submit a rebased version? Also, > do you have any thoughts on Cary's suggestion in the above review? Hi, enclosed is a v3, which includes the rebase as well as the additional int GUCs mentioned in Cary’s review. I can add support for Reals (required for the JIT and several other ones), but wanted to get feedback/buy-in on the overall idea first. Best, David special-guc-values-v3.patch Description: Binary data
Re: CREATE ROLE IF NOT EXISTS
> > > This fails the roleattributes test in "make check", with what seems to be a > trivial change in the output. Can you please submit a rebased version > fixing > the test? > Updated version attached. David CREATE-ROLE-IF-NOT-EXISTS-v2.patch Description: Binary data
Re: CREATE ROLE IF NOT EXISTS
On Mon, Nov 8, 2021 at 1:22 PM Mark Dilger wrote: > > On Nov 8, 2021, at 10:38 AM, Stephen Frost wrote: > > > I don't quite follow this. The entire point of Alice writing a script > > that uses IF NOT EXISTS is to have that command not fail if, indeed, > > that role already exists, but for the rest of the script to be run. > > That there's some potential attacker with CREATEROLE running around > > creating roles that they think someone *else* might create is really > > stretching things to a very questionable level- especially with > > CREATEROLE where Charlie could just CREATE a new role which is a member > > of Bob anyway after the fact and then GRANT that role to themselves. > > I don't see why this is "stretching things to a very questionable level". > It might help this discussion if you could provide pseudo-code or similar > for adding roles which is well-written and secure, and which benefits from > this syntax. I would expect the amount of locking and checking for > pre-existing roles that such logic would require would make the IF NOT > EXIST option useless. Perhaps I'm wrong? > The main motivator for me writing this was trying to increase idempotency for things like scripting, where you want to be able to minimize the effort required to get things into a particular state. I agree with Stephen that whether or not this is a best practices approach, this is something that is being done in the wild via DO blocks or similar, so providing a tool to handle this better seems useful on its own. This originally came from me looking into the failures to load certain `pg_dump` or `pg_dumpall` output when generated with the `--clean` flag, which necessarily cannot work, as it fails with the error `current user cannot be dropped`. Not that I am promoting the use of `pg_dumpall --clean`, as there are clearly better solutions here, but something which generates unusable output does not seem that useful. Instead, you could generate `CREATE ROLE IF NOT EXISTS username` statements and emit `ALTER ROLE ...`, which is what it is already doing (modulo `IF NOT EXISTS`). This seems to introduce no further security vectors compared to field work and increases utility in some cases, so seems generally useful to me. If CINE semantics are at issue, what about the CREATE OR REPLACE semantics with some sort of merge into the existing role? I don't care strongly about which approach is taken, just think the overall "make this role exist in this form" without an error is useful in my own work, and CINE was easier to implement as a first pass. Best, David
Re: CREATE ROLE IF NOT EXISTS
On Tue, Nov 9, 2021 at 9:55 AM Mark Dilger wrote: > > On Nov 9, 2021, at 7:36 AM, David Christensen < > david.christen...@crunchydata.com> wrote: > > > > If CINE semantics are at issue, what about the CREATE OR REPLACE > semantics with some sort of merge into the existing role? I don't care > strongly about which approach is taken, just think the overall "make this > role exist in this form" without an error is useful in my own work, and > CINE was easier to implement as a first pass. > > CREATE OR REPLACE might be a better option, not with the "merge into the > existing role" part, but rather as drop+create. If a malicious actor has > already added other roles to the role, or created a table with a malicious > trigger definition, the drop part will fail, which is good from a security > viewpoint. Of course, the drop portion will also fail under other > conditions which don't entail any security concerns, but maybe they could > be addressed in a series of follow-on patches? > > I understand this idea is not as useful for creating idempotent scripts, > but maybe it gets you part of the way there? Well, the CREATE OR REPLACE via just setting the role's attributes explicitly based on what you passed it could work (not strictly DROP + CREATE, in that you're keeping existing ownerships, etc, and can avoid cross-db permissions/ownership checks). Seems like some sort of merge logic could be in order, as you wouldn't really want to lose existing permissions granted to a role, but you want to ensure that /at least/ the permissions granted exist for this role. David
Re: CREATE ROLE IF NOT EXISTS
On Tue, Nov 9, 2021 at 10:22 AM Stephen Frost wrote: > Greetings, > > * David Christensen (david.christen...@crunchydata.com) wrote: > > Well, the CREATE OR REPLACE via just setting the role's attributes > > explicitly based on what you passed it could work (not strictly DROP + > > CREATE, in that you're keeping existing ownerships, etc, and can avoid > > cross-db permissions/ownership checks). Seems like some sort of merge > > logic could be in order, as you wouldn't really want to lose existing > > permissions granted to a role, but you want to ensure that /at least/ the > > permissions granted exist for this role. > > What happens with role attributes that aren't explicitly mentioned > though? Do those get reset to 'default' or are they left as-is? > Since we have the ability to specify explicit negative options (NOCREATEDB vs CREATEDB, etc), I'd say leave as-is if not specified, otherwise ensure it matches what you included in the command. Would also ensure forward compatibility if new permissions/attributes were introduced, as we don't want to explicitly require that all permissions be itemized to utilize. > I suspect that most implementations will end up just explicitly setting > all of the role attributes, of course, because they want the role to > look like how it is defined to in whatever manifest is declaring the > role, but we should still think about how we want this to work if we're > going in this direction. > Agreed. > In terms of least-surprise, I do tend to think that the answer is "only > care about what is explicitly put into the command"- that is, if it > isn't in the CREATE ROLE statement then it gets left as-is. Not sure > how others feel about that though. > This is also what would make the most sense to me. David
Re: CREATE ROLE IF NOT EXISTS
Modulo other issues/discussions, here is a version of this patch that implements CREATE OR REPLACE ROLE just by handing off to AlterRole if it's determined that the role already exists; presumably any/all additional considerations would need to be added in both places were there a separate code path for this. It might be worth refactoring the AlterRole into a helper if there are any deviations in messages, etc, but could be a decent approach to handling the problem (which arguably would have similar restrictions/requirements in ALTER ROLE itself) in a single location. Best, David CREATE-OR-REPLACE-ROLE.patch Description: Binary data
Re: CREATE ROLE IF NOT EXISTS
On Mon, Nov 22, 2021 at 6:49 AM Daniel Gustafsson wrote: > > On 10 Nov 2021, at 18:14, David Christensen < > david.christen...@crunchydata.com> wrote: > > > Modulo other issues/discussions, here is a version of this patch.. > > This patch fails to compile since you renamed the if_not_exists member in > CreateRoleStmt but still set it in the parser. > D'oh! Enclosed is a fixed/rebased version. Best, David CREATE-OR-REPLACE-ROLE-v2.patch Description: Binary data
Documentation patch for ALTER SUBSCRIPTION
Greetings, Enclosed find a documentation patch that clarifies the behavior of ALTER SUBSCRIPTION … REFRESH PUBLICATION with new tables; I ran into a situation today where the docs were not clear that existing tables would not be re-copied, so remedying this situation. Should apply back to Pg 10. Best, David 0001-Be-explicit-about-the-behavior-of-REFRESH-PUBLICATIO.patch Description: Binary data -- David Christensen Senior Software and Database Engineer End Point Corporation da...@endpoint.com 785-727-1171 signature.asc Description: Message signed with OpenPGP
Re: Adding comments to help understand psql hidden queries
On Thu, Feb 1, 2024 at 4:34 PM Greg Sabino Mullane wrote: > > The use of the --echo-hidden flag in psql is used to show people the way psql > performs its magic for its backslash commands. None of them has more magic > than "\d relation", but it suffers from needing a lot of separate queries to > gather all of the information it needs. Unfortunately, those queries can get > overwhelming and hard to figure out which one does what, especially for those > not already very familiar with the system catalogs. Attached is a patch to > add a small SQL comment to the top of each SELECT query inside > describeOneTableDetail. All other functions use a single query, and thus need > no additional context. But "\d mytable" has the potential to run over a dozen > SQL queries! The new format looks like this: > > / QUERY */ > /* Get information about row-level policies */ > SELECT pol.polname, pol.polpermissive, > CASE WHEN pol.polroles = '{0}' THEN NULL ELSE > pg_catalog.array_to_string(array(select rolname from pg_catalog.pg_roles > where oid = any (pol.polroles) order by 1),',') END, > pg_catalog.pg_get_expr(pol.polqual, pol.polrelid), > pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid), > CASE pol.polcmd > WHEN 'r' THEN 'SELECT' > WHEN 'a' THEN 'INSERT' > WHEN 'w' THEN 'UPDATE' > WHEN 'd' THEN 'DELETE' > END AS cmd > FROM pg_catalog.pg_policy pol > WHERE pol.polrelid = '134384' ORDER BY 1; > // > > Cheers, > Greg Thanks, this looks like some helpful information. In the same vein, I'm including a patch which adds information about the command that generates the given query as well (atop your commit). This will modify the query line to include the command itself: / QUERY (\dRs) */ Best, David 0001-Add-output-of-the-command-that-got-us-here-to-the-QU.patch Description: Binary data
Constant Splitting/Refactoring
As part of the reserved page space/page features[1] work, there is a need to convert some of the existing constants into variables, since the size of those will no longer be fixed at compile time. At FOSDEM, there was some interest expressed about seeing what a cleanup patch like that would look like. Enclosed is a series of cleanup patches for HEAD. This splits each of the 4 constants that care about page size into Cluster-specific and -Limit variants, the first intended to become a variable in time, and the second being the maximum value such a variable may take (largely used for static allocations). Since these patches define these symbols to have the same values they previously had, there are no changes in functionality. These were largely mechanical changes, and as such we should perhaps consider making the same changes to back-branches to make it so context lines and the like would be the same, simplifying maintainer's efforts when applying code in back branches that touch similar areas. The script I have to make these changes is simple, and could be run against the back branches with only the comments surrounding Calc() pieces needing to be adjusted once. These were based on HEAD as a few minutes ago, 902900b308f, and pass all tests. Thanks, David [1] https://commitfest.postgresql.org/47/3986/ v1-0002-Split-MaxHeapTupleSize-into-runtime-and-max-value.patch Description: Binary data v1-0001-Split-MaxHeapTuplesPerPage-into-runtime-and-max-v.patch Description: Binary data v1-0004-Split-MaxTIDsPerBTreePage-into-runtime-and-max-va.patch Description: Binary data v1-0003-Split-MaxIndexTuplesPerPage-into-runtime-and-max-.patch Description: Binary data
Re: Adding comments to help understand psql hidden queries
Hi Jim, Thanks for the feedback. Enclosed is a v2 of this series(?) rebased and with that warning fixed; @Greg Sabino Mullane I just created a commit on your behalf with the message to hackers. I'm also creating a commit-fest entry for this thread. Best, David v2-0002-Add-output-of-the-command-that-got-us-here-to-the.patch Description: Binary data v2-0001-Include-SQL-comments-on-echo-hidden-output.patch Description: Binary data
Re: Adding comments to help understand psql hidden queries
Created the CF entry in commitfest 48 but didn't see it was already in 47; closing the CFEntry in 48. (Doesn't appear to be a different status than "Withdrawn"...)
Re: Avoiding inadvertent debugging mode for pgbench
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:tested, passed Did a quick review of this one; CFbot is now happy, local regression tests all pass. I think the idea here is sane; it's particularly confusing for some tools included in the main distribution to have different semantics, and this seems low on the breakage risk here, so worth the tradeoffs IMO. The new status of this patch is: Ready for Committer
Re: Adding comments to help understand psql hidden queries
On Fri, Mar 22, 2024 at 9:47 AM Greg Sabino Mullane wrote: > > On Thu, Mar 21, 2024 at 6:20 PM Peter Eisentraut wrote: >> >> lines are supposed to align vertically. With your patch, the first line >> would have variable length depending on the command. > > > Yes, that is a good point. Aligning those would be quite tricky, what if we > just kept a standard width for the closing query? Probably the 24 stars we > currently have to match "QUERY", which it appears nobody has changed for > translation purposes yet anyway. (If I am reading the code correctly, it > would be up to the translators to maintain the vertical alignment). I think it's easier to keep the widths balanced than constant (patch version included here), but if we needed to squeeze the opening string to a standard width that would be possible without too much trouble. The internal comment strings seem to be a bit more of a pain if we wanted all of the comments to be the same width, as we'd need a table or something so we can compute the longest string width, etc; doesn't seem worth the convolutions IMHO. No changes to Greg's patch, just keeping 'em both so cfbot stays happy. David v3-0002-Add-output-of-the-command-that-got-us-here-to-the.patch Description: Binary data v3-0001-Include-SQL-comments-on-echo-hidden-output.patch Description: Binary data
Re: Adding comments to help understand psql hidden queries
I got Greg's blessing on squashing the commits down, and now including a v4 with additional improvements on the output formatting front. Main changes: - all generated comments are the same width - width has been bumped to 80 - removed _() functions for consumers of the new output functions This patch adds two new helper functions, OutputComment() and OutputCommentStars() to output and wrap the comments to the appropriate widths. Everything should continue to work just fine if we want to adjust the width by just adjusting the MAX_COMMENT_WIDTH symbol. I removed _() in the output of the query/stars since there'd be no sensible existing translations for the constructed string, which included the query string itself. If we need it for the "QUERY" string, this could be added fairly easily, but the existing piece would have been nonsensical and never used in practice. Thanks, David v4-0001-Improve-SQL-comments-on-echo-hidden-output.patch Description: Binary data
Re: Adding comments to help understand psql hidden queries
On Thu, Apr 4, 2024 at 9:32 AM Peter Eisentraut wrote: > > On 03.04.24 19:16, David Christensen wrote: > > I removed _() in the output of the query/stars since there'd be no > > sensible existing translations for the constructed string, which > > included the query string itself. If we need it for the "QUERY" > > string, this could be added fairly easily, but the existing piece > > would have been nonsensical and never used in practice. > > "QUERY" is currently translated. Your patch loses that. I see; enclosed is v5 which fixes this. The effective diff from the last one is: -char *label = "QUERY"; +char *label = _("QUERY"); and -label = psprintf("QUERY (\\%s)", curcmd); +label = psprintf(_("QUERY (\\%s)"), curcmd); Best, David v5-0001-Improve-SQL-comments-on-echo-hidden-output.patch Description: Binary data
Re: Constant Splitting/Refactoring
This should target PG 18, but that status is not available in the CF app yet, so just making a note here.
Re: Constant Splitting/Refactoring
Here is a version 2 of this patch, rebased atop 97d85be365. As before, this is a cleanup/prerequisite patch series for the page features/reserved page size patches[1]. (Said patch series is going to be updated shortly.) This splits each of the 4 constants that care about page size into Cluster-specific and -Limit variants, the first intended to become a variable in time, and the second being the maximum value such a variable may take (largely used for static allocations). Since these patches define these symbols to have the same values they previously had, there are no changes in functionality. These were largely mechanical changes, and as such we should perhaps consider making the same changes to back-branches to make it so context lines and the like would be the same, simplifying maintainer's efforts when applying code in back branches that touch similar areas. The script I have to make these changes is simple, and could be run against the back branches with only the comments surrounding Calc() pieces needing to be adjusted once. Thanks, David [1] https://commitfest.postgresql.org/47/3986/ v2-0001-Split-MaxHeapTuplesPerPage-into-runtime-and-max-v.patch Description: Binary data v2-0002-Split-MaxHeapTupleSize-into-runtime-and-max-value.patch Description: Binary data v2-0004-Split-MaxTIDsPerBTreePage-into-runtime-and-max-va.patch Description: Binary data v2-0003-Split-MaxIndexTuplesPerPage-into-runtime-and-max-.patch Description: Binary data
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
Hi Bharath, I can get one sent in tomorrow. Thanks, David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Thu, Dec 15, 2022 at 12:36 AM Michael Paquier wrote: > > On Wed, Dec 14, 2022 at 04:44:34PM -0600, David Christensen wrote: > > I can get one sent in tomorrow. This v10 should incorporate your feedback as well as Bharath's. > -XLogRecordHasFPW(XLogReaderState *record) > +XLogRecordHasFPI(XLogReaderState *record) > This still refers to a FPW, so let's leave that out as well as any > renamings of this kind.. Reverted those changes. > + if (config.save_fpi_path != NULL) > + { > + /* Create the dir if it doesn't exist */ > + if (pg_mkdir_p(config.save_fpi_path, pg_dir_create_mode) < 0) > + { > + pg_log_error("could not create output directory \"%s\": %m", > +config.save_fpi_path); > + goto bad_argument; > + } > + } > It seems to me that you could allow things to work even if the > directory exists and is empty. See for example > verify_dir_is_empty_or_create() in pg_basebackup.c. The `pg_mkdir_p()` supports an existing directory (and I don't think we want to require it to be empty first), so this only errors when it can't create a directory for some reason. > +my $file_re = > + > qr/^([0-9A-F]{8})-([0-9A-F]{8})[.][0-9]+[.][0-9]+[.][0-9]+[.][0-9]+(?:_vm|_init|_fsm|_main)?$/; > This is artistic to parse for people not used to regexps (I do, a > little). Perhaps this could use a small comment with an example of > name or a reference describing this format? Added a description of what this is looking for. > +# Set umask so test directories and files are created with default > permissions > +umask(0077); > Incorrect copy-paste coming from elsewhere like the TAP tests of > pg_basebackup with group permissions? Doesn't > PostgreSQL::Test::Utils::tempdir give already enough protection in > terms of umask() and permissions? I'd expect that's where that came from. Removed. > + if (config.save_fpi_path != NULL) > + { > + /* We fsync our output directory only; since these files are not part > +* of the production database we do not require the performance hit > +* that fsyncing every FPI would entail, so are doing this as a > +* compromise. */ > + > + fsync_fname(config.save_fpi_path, true); > + } > Why is it necessary to flush anything at all, then? I personally don't think it is, but added it per Bharath's request. Removed in this revision. > +my $relation = $node->safe_psql('postgres', > + q{SELECT format('%s/%s/%s', CASE WHEN reltablespace = 0 THEN > dattablespace ELSE reltablespace END, pg_database.oid, > pg_relation_filenode(pg_class.oid)) FROM pg_class, pg_database WHERE > relname = 'test_table' AND datname = current_database()} > Could you rewrite that to be on multiple lines? Sure, reformated. > +diag "using walfile: $walfile"; > You should avoid the use of "diag", as this would cause extra output > noise with a simple make check. Had been using it for debugging and didn't realize it'd cause issues. Removed both instances. > +$node->safe_psql('postgres', "SELECT > pg_drop_replication_slot('regress_pg_waldump_slot')") > That's not really necessary, the nodes are wiped out at the end of the > test. Removed. > +$node->safe_psql('postgres', < +SELECT 'init' FROM > pg_create_physical_replication_slot('regress_pg_waldump_slot', true, false); > +CREATE TABLE test_table AS SELECT generate_series(1,100) a; > +CHECKPOINT; -- required to force FPI for next writes > +UPDATE test_table SET a = a + 1; > Using an EOF to execute a multi-line query would be a first. Couldn't > you use the same thing as anywhere else? 009_twophase.pl just to > mention one. (Mentioned by Bharath upthread, where he asked for an > extra opinion so here it is.) Fair enough, while idiomatic perl to me, not a hill to die on; converted to a standard multiline string. Best, David v10-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-st.patch Description: Binary data
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Wed, Dec 21, 2022 at 5:47 AM Bharath Rupireddy wrote: > > On Fri, Dec 16, 2022 at 4:47 AM David Christensen > wrote: > > > > On Thu, Dec 15, 2022 at 12:36 AM Michael Paquier > > wrote: > > > > > > On Wed, Dec 14, 2022 at 04:44:34PM -0600, David Christensen wrote: > > > > I can get one sent in tomorrow. > > > > This v10 should incorporate your feedback as well as Bharath's. > > Thanks for the patch. Here're some minor comments: > > 1. +my $node = PostgreSQL::Test::Cluster->new('primary'); > Can the name be other than 'primary' because we don't create a standby > for this test? Something like - 'node_a' or 'node_extract_fpi' or some > other. Sure, no issues. > 2. +$node->init(extra => ['-k'], allows_streaming => 1); > When enabled with allows_streaming, there are a bunch of things that > happen to the node while initializing, I don't think we need all of > them for this. I think the "allows_streaming" was required to ensure the WAL files were preserved properly, and was the approach we ended up taking rather than trying to fail the archive_command or other approaches I'd taken earlier. I'd rather keep this if we can, unless you can propose a different approach that would continue to work in the same way. > 3. +$node->init(extra => ['-k'], allows_streaming => 1); > Can we use --data-checksums instead of -k for more readability? > Perhaps, a comment on why we need that option helps greatly. Yeah, can spell out; don't recall exactly why we needed it offhand, but will confirm or remove if insignificant. > 4. > +page = (Page) buf.data; > + > +if (!XLogRecHasBlockRef(record, block_id)) > +continue; > + > +if (!XLogRecHasBlockImage(record, block_id)) > +continue; > + > +if (!RestoreBlockImage(record, block_id, page)) > +continue; > Can you shift page = (Page) buf.data; just before the last if > condition RestoreBlockImage() so that it doesn't get executed for the > other two continue statements? Sure; since it was just setting a pointer value I didn't consider it to be a hotspot for optimization. Best, David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Mon, Dec 19, 2022 at 12:23 AM Michael Paquier wrote: > > On Thu, Dec 15, 2022 at 05:17:46PM -0600, David Christensen wrote: > > On Thu, Dec 15, 2022 at 12:36 AM Michael Paquier > > wrote: > > This v10 should incorporate your feedback as well as Bharath's. > > Thanks for the new version. I have minor comments. > > >> It seems to me that you could allow things to work even if the > >> directory exists and is empty. See for example > >> verify_dir_is_empty_or_create() in pg_basebackup.c. > > > > The `pg_mkdir_p()` supports an existing directory (and I don't think > > we want to require it to be empty first), so this only errors when it > > can't create a directory for some reason. > > Sure, but things can also be made so as we don't fail if the directory > exists and is empty? This would be more consistent with the base > directories created by pg_basebackup and initdb. I guess I'm feeling a little dense here; how is this failing if there is an existing empty directory? > >> +$node->safe_psql('postgres', < >> +SELECT 'init' FROM > >> pg_create_physical_replication_slot('regress_pg_waldump_slot', true, > >> false); > >> +CREATE TABLE test_table AS SELECT generate_series(1,100) a; > >> +CHECKPOINT; -- required to force FPI for next writes > >> +UPDATE test_table SET a = a + 1; > >> Using an EOF to execute a multi-line query would be a first. Couldn't > >> you use the same thing as anywhere else? 009_twophase.pl just to > >> mention one. (Mentioned by Bharath upthread, where he asked for an > >> extra opinion so here it is.) > > > > Fair enough, while idiomatic perl to me, not a hill to die on; > > converted to a standard multiline string. > > By the way, knowing that we have an option called --fullpage, could be > be better to use --save-fullpage for the option name? Works for me. I think I'd just wanted to avoid reformatting the entire usage message which is why I'd gone with the shorter version. > + OPF = fopen(filename, PG_BINARY_W); > + if (!OPF) > + pg_fatal("couldn't open file for output: %s", filename); > [..] > + if (fwrite(page, BLCKSZ, 1, OPF) != 1) > + pg_fatal("couldn't write out complete full page image to file: > %s", filename); > These should more more generic, as of "could not open file \"%s\"" and > "could not write file \"%s\"" as the file name provides all the > information about what this writes. Sure, will update. Best, David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Fri, Dec 23, 2022 at 12:57 PM David Christensen wrote: > > On Wed, Dec 21, 2022 at 5:47 AM Bharath Rupireddy > wrote: [snip] > > 2. +$node->init(extra => ['-k'], allows_streaming => 1); > > When enabled with allows_streaming, there are a bunch of things that > > happen to the node while initializing, I don't think we need all of > > them for this. > > I think the "allows_streaming" was required to ensure the WAL files > were preserved properly, and was the approach we ended up taking > rather than trying to fail the archive_command or other approaches I'd > taken earlier. I'd rather keep this if we can, unless you can propose > a different approach that would continue to work in the same way. Confirmed that we needed this in order to create the replication slot, so this /is/ required for the test to work. Enclosing v11 with yours and Michael's latest feedback. Best, David v11-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-st.patch Description: Binary data
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
> On Dec 26, 2022, at 1:29 AM, Michael Paquier wrote: > > On Sat, Dec 24, 2022 at 06:23:29PM +0530, Bharath Rupireddy wrote: >> Thanks for the patch. I've made the above change as well as renamed >> the test file name to be save_fpi.pl, everything else remains the same >> as v11. Here's the v12 patch which LGTM. I'll mark it as RfC - >> https://commitfest.postgresql.org/41/3628/. > > I have done a review of that, and here are my notes: > - The variable names were a bit inconsistent, so I have switched most > of the new code to use "fullpage". > - The code was not able to handle the case of a target directory > existing but empty, so I have added a wrapper on pg_check_dir(). > - XLogRecordHasFPW() could be checked directly in the function saving > the blocks. Still, there is no need for it as we apply the same > checks again in the inner loop of the routine. > - The new test has been renamed. > - RestoreBlockImage() would report a failure and the code would just > skip it and continue its work. This could point out to a compression > failure for example, so like any code paths calling this routine I > think that we'd better do a pg_fatal() and fail hard. > - I did not understand why there is a reason to make this option > conditional on the record prints or even the stats, so I have moved > the FPW save routine into a separate code path. The other two could > be silenced (or not) using --quiet for example, for the same result as > v12 without impacting the usability of this feature. > - Few tweaks to the docs, the --help output, the comments and the > tests. > - Indentation applied. > > Being able to filter the blocks saved using start/end LSNs or just > --relation is really cool, especially as the file names use the same > order as what's needed for this option. Sounds good, definitely along the ideas of what I’d originally envisioned. Thanks, David
Re: Improving btree performance through specializing by key shape, take 2
Hi Matthias, I'm going to look at this patch series if you're still interested. What was the status of your final performance testing for the 0008 patch alone vs the specialization series? Last I saw on the thread you were going to see if the specialization was required or not. Best, David
Re: Kerberos delegation support in libpq and postgres_fdw
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:not tested Did a code review pass here; here is some feedback. + /* If password was used to connect, make sure it was one provided */ + if (PQconnectionUsedPassword(conn) && dblink_connstr_has_pw(connstr)) + return; Do we need to consider whether these passwords are the same? Is there a different vector where a different password could be acquired from a different source (PGPASSWORD, say) while both of these criteria are true? Seems like it probably doesn't matter that much considering we only checked Password alone in previous version of this code. --- Looks like the pg_gssinfo struct hides the `proxy_creds` def behind: #if defined(ENABLE_GSS) | defined(ENABLE_SSPI) typedef struct { gss_buffer_desc outbuf; /* GSSAPI output token buffer */ #ifdef ENABLE_GSS ... boolproxy_creds;/* GSSAPI Delegated/proxy credentials */ #endif } pg_gssinfo; #endif Which means that the later check in `be_gssapi_get_proxy()` we have: /* * Return if GSSAPI delegated/proxy credentials were included on this * connection. */ bool be_gssapi_get_proxy(Port *port) { if (!port || !port->gss) return NULL; return port->gss->proxy_creds; } So in theory it'd be possible to have SSPI enabled but GSS disabled and we'd fail to compile in that case. (It may be that this routine is never *actually* called in that case, just noting compile-time considerations.) I'm not seeing guards in the actual PQ* routines, but don't think I've done an exhaustive search. --- gss_accept_deleg + +Forward (delegate) GSS credentials to the server. The default is +disable which means credentials will not be forwarded +to the server. Set this to enable to have +credentials forwarded when possible. When is this not possible? Server policy? External factors? --- Only superusers may connect to foreign servers without password -authentication, so always specify the password option -for user mappings belonging to non-superusers. +authentication or using gssapi proxied credentials, so specify the +password option for user mappings belonging to +non-superusers who are not able to proxy GSSAPI credentials. s/gssapi/GSSAPI/; this is kind of confusing, as this makes it sound like only superuser may use GSSAPI proxied credentials, which I disbelieve to be true. Additionally, it sounds like you're wanting to explicitly maintain a denylist for users to not be allowed proxying; is that correct? --- libpq/auth.c: if (proxy != NULL) { pg_store_proxy_credential(proxy); port->gss->proxy_creds = true; } Per GSS docs, seems like we should be comparing to GSS_C_NO_CREDENTIAL and validating that the gflags has the `deleg_flag` bit set before considering whether there are valid credentials; in practice this might be the same effect (haven't looked at what that symbol actually resolves to, but NULL would be sensible). Are there other cases we might need to consider here, like valid credentials, but they are expired? (GSS_S_CREDENTIALS_EXPIRED) --- + /* +* Set KRB5CCNAME for this backend, so that later calls to gss_acquire_cred +* will find the proxied credentials we stored. +*/ So I'm not seeing this in other use in the code; I assume this is just used by the krb5 libs? Similar q's for the other places the pg_gss_accept_deleg are used. --- +int +PQconnectionUsedGSSAPI(const PGconn *conn) +{ + if (!conn) + return false; + if (conn->gssapi_used) + return true; + else + return false; +} Micro-gripe: this routine seems like could be simpler, though the compiler probably has the same thing to say for either, so maybe code clarity is better as written: int PQconnectionUsedGSSAPI(const PGconn *conn) { return conn && conn->gssapi_used; } --- Anything required for adding meson support? I notice src/test/kerberos has Makefile updated, but no meson.build files are changed. --- Two tests in src/test/kerberos/t/001_auth.pl at :535 and :545 have the same test description: + 'succeeds with GSS-encrypted access required and hostgssenc hba and credentials not forwarded', Since the first test has only `gssencmode` defined (so implicit `gssdeleg` value) and the second has `gssdeleg=disable` I'd suggest that the test on :545 should have its description updated to add the word "explicitly": 'succeeds with GSS-encrypted access required and hostgssenc hba and credentials
Re: Kerberos delegation support in libpq and postgres_fdw
On Wed, Apr 5, 2023 at 3:30 PM Stephen Frost wrote: > Greetings, > > * David Christensen (david...@pgguru.net) wrote: > > Did a code review pass here; here is some feedback. > > Thanks! > > > + /* If password was used to connect, make sure it was one provided > */ > > + if (PQconnectionUsedPassword(conn) && > dblink_connstr_has_pw(connstr)) > > + return; > > > > Do we need to consider whether these passwords are the same? Is there > a different vector where a different password could be acquired from a > different source (PGPASSWORD, say) while both of these criteria are true? > Seems like it probably doesn't matter that much considering we only checked > Password alone in previous version of this code. > > Note that this patch isn't really changing how these checks are being > done but more moving them around and allowing a GSSAPI-based approach > with credential delegation to also be allowed. > > That said, as noted in the comments above dblink_connstr_check(): > > * For non-superusers, insist that the connstr specify a password, except > * if GSSAPI credentials have been proxied (and we check that they are used > * for the connection in dblink_security_check later). This prevents a > * password or GSSAPI credentials from being picked up from .pgpass, a > * service file, the environment, etc. We don't want the postgres user's > * passwords or Kerberos credentials to be accessible to non-superusers. > > The point of these checks is, indeed, to ensure that environmental > values such as a .pgpass or variable don't end up getting picked up and > used (or, if they do, we realize it post-connection and then throw away > the connection). > > libpq does explicitly prefer to use the password passed in as part of > the connection string and won't attempt to look up passwords in a > .pgpass file or similar if a password has been included in the > connection string. > The case I think I was thinking of was (manufactured) when we connected to a backend with one password but the dblink or postgresql_fdw includes an explicit password to a different server. But now I'm thinking that this PQconnectionUsedPassword() is checking the outgoing connection for dblink itself, not the connection of the backend that connected to the main server, so I think this objection is moot, like you say. > Looks like the pg_gssinfo struct hides the `proxy_creds` def behind: > > > > #if defined(ENABLE_GSS) | defined(ENABLE_SSPI) > > typedef struct > > { > > gss_buffer_desc outbuf; /* GSSAPI output token > buffer */ > > #ifdef ENABLE_GSS > > ... > > bool proxy_creds;/* GSSAPI Delegated/proxy > credentials */ > > #endif > > } pg_gssinfo; > > #endif > > ... right, proxy_creds only exists (today anyway) if ENABLE_GSS is set. > > > Which means that the later check in `be_gssapi_get_proxy()` we have: [analysis snipped] Fairly confident the analysis here is wrong, further, the cfbot seems to > agree that there isn't a compile failure here: > > https://cirrus-ci.com/task/6589717672624128 > > [20:19:15.985] gss: NO > > (we always build with SSPI on Windows, per > src/include/port/win32_port.h). > Cool; since we have coverage for that case seems like my concern was unwarranted. [snip] > > > > > > Only superusers may connect to foreign servers without password > > -authentication, so always specify the password > option > > -for user mappings belonging to non-superusers. > > +authentication or using gssapi proxied credentials, so specify the > > +password option for user mappings belonging to > > +non-superusers who are not able to proxy GSSAPI credentials. > > > > > > > > s/gssapi/GSSAPI/; this is kind of confusing, as this makes it sound like > only superuser may use GSSAPI proxied credentials, which I disbelieve to be > true. Additionally, it sounds like you're wanting to explicitly maintain a > denylist for users to not be allowed proxying; is that correct? > > Updated to GSSAPI and reworded in the updated patch (attached). > Certainly open to suggestions on how to improve the documentation here. > There is no 'denylist' for users when it comes to GSSAPI proxied > credentials. If there's a use-case for that then it could be added in > the future. > Okay, I think your revisions here seem more clear, thanks. > > > --- > > > > libpq/auth.c: > > > > if (proxy != NULL) > >
Re: Kerberos delegation support in libpq and postgres_fdw
Reviewed v8; largely looking good, though I notice this hunk, which may arguably be a bug fix, but doesn't appear to be relevant to this specific patch, so could probably be debated independently (and if a bug, should probably be backpatched): diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c index 4229d2048c..11d41979c6 100644 --- a/contrib/postgres_fdw/option.c +++ b/contrib/postgres_fdw/option.c @@ -288,6 +288,9 @@ InitPgFdwOptions(void) {"sslcert", UserMappingRelationId, true}, {"sslkey", UserMappingRelationId, true}, + /* gssencmode is also libpq option, same to above. */ + {"gssencmode", UserMappingRelationId, true}, + {NULL, InvalidOid, false} }; That said, should "gssdeleg" be exposed as a user mapping? (This shows up in postgresql_fdw; not sure if there are other places that would be relevant, like in dblink somewhere as well, just a thought.) Best, David
Re: Kerberos delegation support in libpq and postgres_fdw
Ok, based on the interdiff there, I'm happy with that last change. Marking as Ready For Committer. Best, David
Re: [PATCH] add relation and block-level filtering to pg_waldump
> On Mar 24, 2022, at 6:43 AM, Thomas Munro wrote: > > On Fri, Mar 25, 2022 at 12:26 AM Thomas Munro wrote: >>> On Fri, Mar 25, 2022 at 12:01 AM Peter Eisentraut >>> wrote: >>> Or even: Why are we exposing fork *numbers* in the user interface? >>> Even low-level tools such as pageinspect use fork *names* in their >>> interface. >> >> I wondered about that but thought it seemed OK for such a low level >> tool. It's a fair point though, especially if other low level tools >> are doing that. Here's a patch to change it. > > Oh, and there's already a name lookup function to use for this. +1 on the semantic names. David
Re: [PATCH] add relation and block-level filtering to pg_waldump
> On Mar 24, 2022, at 4:12 PM, Thomas Munro wrote: > > On Fri, Mar 25, 2022 at 1:43 AM David Christensen > wrote: >>>> On Mar 24, 2022, at 6:43 AM, Thomas Munro wrote: >>> On Fri, Mar 25, 2022 at 12:26 AM Thomas Munro >>> wrote: >>>>> On Fri, Mar 25, 2022 at 12:01 AM Peter Eisentraut >>>>> wrote: >>>>> Or even: Why are we exposing fork *numbers* in the user interface? >>>>> Even low-level tools such as pageinspect use fork *names* in their >>>>> interface. >>>> >>>> I wondered about that but thought it seemed OK for such a low level >>>> tool. It's a fair point though, especially if other low level tools >>>> are doing that. Here's a patch to change it. >>> >>> Oh, and there's already a name lookup function to use for this. >> >> +1 on the semantic names. > > Cool. > > I had another thought while changing that (and also re-alphabetising): > Why don't we switch to -B for --block and -R for --relation? I > gather you used -k and -l because -b and -r were already taken, but > since we already started using upper case for -F, it seems consistent > this way. Or were they chosen for consistency with something else? Works here; was just trying to get semi-memorable ones from the available lowercase ones, but I like your idea here, and it kind of puts them in the same mental space for remembering. > It's also slightly more helpful to a user if the help says > --relation=T/D/R instead of N/N/N (TS/DB/REL would be nicer but > doesn't fit in the space). 0001-Improve-command-line-switches-in-pg_waldump-option.patch Description: Binary data
[PATCH] Teach pg_waldump to extract FPIs from the WAL
Hi -hackers, Enclosed is a patch to allow extraction/saving of FPI from the WAL stream via pg_waldump. Description from the commit: Extracts full-page images from the WAL stream into a target directory, which must be empty or not exist. These images are subject to the same filtering rules as normal display in pg_waldump, which means that you can isolate the full page writes to a target relation, among other things. Files are saved with the filename: with formatting to make things somewhat sortable; for instance: -01C0.1663.1.6117.0 -01000150.1664.0.6115.0 -010001E0.1664.0.6114.0 -01000270.1663.1.6116.0 -01000300.1663.1.6113.0 -01000390.1663.1.6112.0 -01000420.1663.1.8903.0 -010004B0.1663.1.8902.0 -01000540.1663.1.6111.0 -010005D0.1663.1.6110.0 It's noteworthy that the raw images do not have the current LSN stored with them in the WAL stream (as would be true for on-heap versions of the blocks), nor would the checksum be valid in them (though WAL itself has checksums, so there is some protection there). This patch chooses to place the LSN and calculate the proper checksum (if non-zero in the source image) in the outputted block. (This could perhaps be a targetted flag if we decide we don't always want this.) These images could be loaded/inspected via `pg_read_binary_file()` and used in the `pageinspect` suite of tools to perform detailed analysis on the pages in question, based on historical information, and may come in handy for forensics work. Best, David v1-pg_waldump-save-fpi.patch Description: Binary data
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Sat, Apr 23, 2022 at 9:49 AM Matthias van de Meent wrote: > Regardless of my (lack of) opinion on the inclusion of this patch in > PG (I did not significantly review this patch); I noticed that you do > not yet identify the 'fork' of the FPI in the file name. > > A lack of fork identifier in the exported file names would make > debugging much more difficult due to the relatively difficult to > identify data contained in !main forks, so I think this oversight > should be fixed, be it through `_forkname` postfix like normal fork > segments, or be it through `.` numerical in- or postfix in > the filename. > > -Matthias Hi Matthias, great point. Enclosed is a revised version of the patch that adds the fork identifier to the end if it's a non-main fork. Best, David v2-pg_waldump-save-fpi.patch Description: Binary data
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Mon, Apr 25, 2022 at 1:11 AM Michael Paquier wrote: > > On Sat, Apr 23, 2022 at 01:43:36PM -0500, David Christensen wrote: > > Hi Matthias, great point. Enclosed is a revised version of the patch > > that adds the fork identifier to the end if it's a non-main fork. > > Like Alvaro, I have seen cases where this would have been really > handy. So +1 from me, as well, to have more tooling like what you are > proposing. Fine for me to use one file for each block with a name > like what you are suggesting for each one of them. > > + /* we accept an empty existing directory */ > + if (stat(config.save_fpw_path, &st) == 0 && S_ISDIR(st.st_mode)) > + { > I don't think that there is any need to rely on a new logic if there > is already some code in place able to do the same work. See > verify_dir_is_empty_or_create() in pg_basebackup.c, as one example, > that relies on pg_check_dir(). I think that you'd better rely at > least on what pgcheckdir.c offers. Yeah, though I am tending towards what another user had suggested and just accepting an existing directory rather than requiring it to be empty, so thinking I might just skip this one. (Will review the file you've pointed out to see if there is a relevant function though.) > + {"raw-fpi", required_argument, NULL, 'W'}, > I think that we'd better rename this option. "fpi", that is not used > much in the user-facing docs, is additionally not adapted when we have > an other option called -w/--fullpage. I can think of > --save-fullpage. Makes sense. > + PageSetLSN(page, record->ReadRecPtr); > + /* if checksum field is non-zero then we have checksums > enabled, > +* so recalculate the checksum with new LSN (yes, this is > a hack) > +*/ > Yeah, that looks like a hack, but putting in place a page on a cluster > that has checksums enabled would be more annoying with > zero_damaged_pages enabled if we don't do that, so that's fine by me > as-is. Perhaps you should mention that FPWs don't have their > pd_checksum updated when written. Can make a mention; you thinking just a comment in the code is sufficient, or want there to be user-visible docs as well? > + /* we will now extract the fullpage image from the XLogRecord and save > +* it to a calculated filename */ > The format of this comment is incorrect. > > +The LSN of the record with this block, formatted > +as %08x-%08X instead of the > +conventional %X/%X due to filesystem naming > +limits > The last part of the sentence about %X/%X could just be removed. That > could be confusing, at worse. Makes sense. > + PageSetLSN(page, record->ReadRecPtr); > Why is pd_lsn set? This was to make the page as extracted equivalent to the effect of applying the WAL record block on replay (the LSN and checksum both); since recovery calls this function to mark the LSN where the page came from this is why I did this as well. (I do see perhaps a case for --raw output that doesn't munge the page whatsoever, just made comparisons easier.) > git diff --check complains a bit. Can look into this. > This stuff should include some tests. With --end, the tests can > be cheap. Yeah, makes sense, will include some in the next version. David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Mon, Apr 25, 2022 at 2:00 AM Drouvot, Bertrand wrote: > > Hi, > > On 4/25/22 8:11 AM, Michael Paquier wrote: > > On Sat, Apr 23, 2022 at 01:43:36PM -0500, David Christensen wrote: > >> Hi Matthias, great point. Enclosed is a revised version of the patch > >> that adds the fork identifier to the end if it's a non-main fork. > > Like Alvaro, I have seen cases where this would have been really > > handy. So +1 from me, as well, to have more tooling like what you are > > proposing. > > +1 on the idea. > > FWIW, there is an extension doing this [1] but having the feature > included in pg_waldump would be great. Cool, glad to see there is some interest; definitely some overlap in forensics inside and outside the database both, as there are different use cases for both. David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Mon, Apr 25, 2022 at 6:03 AM Bharath Rupireddy wrote: > Thanks for working on this. I'm just thinking if we can use these FPIs > to repair the corrupted pages? I would like to understand more > detailed usages of the FPIs other than inspecting with pageinspect. My main use case was for being able to look at potential corruption, either in the WAL stream, on heap, or in tools associated with the WAL stream. I suppose you could use the page images to replace corrupted on-disk pages (and in fact I think I've heard of a tool or two that try to do that), though don't know that I consider this the primary purpose (and having toast tables and the list, as well as clog would make it potentially hard to just drop-in a page version without issues). Might help in extreme situations though. > Given that others have realistic use-cases (of course I would like to > know more about those), +1 for the idea. However, I would suggest > adding a function to extract raw FPI data to the pg_walinspect > extension that got recently committed in PG 15, the out of which can > directly be fed to pageinspect functions or Yeah, makes sense to have some overlap here; will review what is there and see if there is some shared code base we can utilize. (ISTR some work towards getting these two tools using more of the same code, and this seems like another such instance.) > Few comments: > 1) I think it's good to mention the stored file name format. > + printf(_(" -W, --raw-fpi=path save found full page images to > given path\n")); +1, though I've also thought there could be uses to have multiple possible output formats here (most immediately, there may be cases where we want *each* FPI for a block vs the *latest*, so files name with/without the LSN component seem the easiest way forward here). That would introduce some additional complexity though, so might need to see if others think that makes any sense. > 2) > + for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++) > + { > + /* we will now extract the fullpage image from the XLogRecord and save > + * it to a calculated filename */ > + > + if (XLogRecHasBlockImage(record, block_id)) > > I think we need XLogRecHasBlockRef to be true to check > XLogRecHasBlockImage otherwise, we will see some build farms failing, > recently I've seen this failure for pg_walinspect.. > > for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++) > { > if (!XLogRecHasBlockRef(record, block_id)) > continue; > > if (XLogRecHasBlockImage(record, block_id)) > *fpi_len += XLogRecGetBlock(record, block_id)->bimg_len; > } Good point; my previous patch that got committed here (127aea2a65) probably also needed this treatment. > 3) Please correct the commenting format: > + /* we will now extract the fullpage image from the XLogRecord and save > + * it to a calculated filename */ Ack. > 4) Usually we start errors with lower case letters "could not ." > + pg_fatal("Couldn't open file for output: %s", filename); > + pg_fatal("Couldn't write out complete FPI to file: %s", filename); > And the variable name too: > + FILE *OPF; Ack. > 5) Not sure how the FPIs of TOASTed tables get stored, but it would be > good to check. What would be different here? Are there issues you can think of, or just more from the pageinspect side of things? > 6) Good to specify the known usages of FPIs in the documentation. Ack. Prob good to get additional info/use cases from others, as mine is fairly short. :-) > 7) Isn't it good to emit an error if RestoreBlockImage returns false? > + if (RestoreBlockImage(record, block_id, page)) > + { Ack. > 8) I think I don't mind if a non-empty directory is specified - IMO > better usability is this - if the directory is non-empty, just go add > the FPI files if FPI file exists just replace it, if the directory > isn't existing, create and write the FPI files. > + /* we accept an empty existing directory */ > + if (stat(config.save_fpw_path, &st) == 0 && S_ISDIR(st.st_mode)) > + { Agreed; was mainly trying to prevent accidental expansion inside `pg_wal` when an earlier version of the patch implied `.` as the current dir with an optional path, but I've since made the path non-optional and agree that this is unnecessarily restrictive. > 9) Instead of following: > + if (XLogRecordHasFPW(xlogreader_state)) > + XLogRecordSaveFPWs(xlogreader_state, config.save_fpw_path); > I will just do this in XLogRecordSaveFPWs: > for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++) > { > if (!XLogRecHasBlockRef(record, block_id)) > continue; > > if (XLogRecHasBlockImage(record, block_id)) > { > > } > } Yeah, a little redundant. > 10) Along with pg_pwrite(), can we also fsync the files (of course > users can choose it optionally) so that the writes will be durable for > the OS crashes? Can add; you thinking a separate flag to disable thi
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Mon, Apr 25, 2022 at 9:54 PM Michael Paquier wrote: > On Mon, Apr 25, 2022 at 10:11:10AM -0500, David Christensen wrote: > > On Mon, Apr 25, 2022 at 1:11 AM Michael Paquier wrote: > >> I don't think that there is any need to rely on a new logic if there > >> is already some code in place able to do the same work. See > >> verify_dir_is_empty_or_create() in pg_basebackup.c, as one example, > >> that relies on pg_check_dir(). I think that you'd better rely at > >> least on what pgcheckdir.c offers. > > > > Yeah, though I am tending towards what another user had suggested and > > just accepting an existing directory rather than requiring it to be > > empty, so thinking I might just skip this one. (Will review the file > > you've pointed out to see if there is a relevant function though.) > > OK. FWIW, pg_check_dir() is used in initdb and pg_basebackup because > these care about the behavior to use when specifying a target path. > You could reuse it, but use a different policy depending on its > returned result for the needs of what you see fit in this case. I have a new version of the patch (pending tests) that uses pg_check_dir's return value to handle things appropriately, so at least some code reuse now. It did end up simplifying a lot. > >> + PageSetLSN(page, record->ReadRecPtr); > >> + /* if checksum field is non-zero then we have > >> checksums enabled, > >> +* so recalculate the checksum with new LSN (yes, this > >> is a hack) > >> +*/ > >> Yeah, that looks like a hack, but putting in place a page on a cluster > >> that has checksums enabled would be more annoying with > >> zero_damaged_pages enabled if we don't do that, so that's fine by me > >> as-is. Perhaps you should mention that FPWs don't have their > >> pd_checksum updated when written. > > > > Can make a mention; you thinking just a comment in the code is > > sufficient, or want there to be user-visible docs as well? > > I was thinking about a comment, at least. New patch version has significantly more comments. > > This was to make the page as extracted equivalent to the effect of > > applying the WAL record block on replay (the LSN and checksum both); > > since recovery calls this function to mark the LSN where the page came > > from this is why I did this as well. (I do see perhaps a case for > > --raw output that doesn't munge the page whatsoever, just made > > comparisons easier.) > > Hm. Okay. The argument goes both ways, I guess, depending on what we > want to do with those raw pages. Still you should not need pd_lsn if > the point is to be able to stick the pages back in place to attempt to > get back as much data as possible when loading it back to the shared > buffers? Yeah, I can see that too; I think there's at least enough of an argument for a flag to apply the fixups or just extract only the raw page pre-modification. Not sure which should be the "default" behavior; either `--raw` or `--fixup-metadata` or something could work. (Naming suggestions welcomed.) David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Mon, Apr 25, 2022 at 9:42 PM Michael Paquier wrote: > On Mon, Apr 25, 2022 at 10:24:52AM -0500, David Christensen wrote: > > On Mon, Apr 25, 2022 at 6:03 AM Bharath Rupireddy > > wrote: > >> Thanks for working on this. I'm just thinking if we can use these FPIs > >> to repair the corrupted pages? I would like to understand more > >> detailed usages of the FPIs other than inspecting with pageinspect. > > > > My main use case was for being able to look at potential corruption, > > either in the WAL stream, on heap, or in tools associated with the WAL > > stream. I suppose you could use the page images to replace corrupted > > on-disk pages (and in fact I think I've heard of a tool or two that > > try to do that), though don't know that I consider this the primary > > purpose (and having toast tables and the list, as well as clog would > > make it potentially hard to just drop-in a page version without > > issues). Might help in extreme situations though. > > You could do a bunch of things with those images, even make things > worse if you are not careful enough. True. :-) This does seem like a tool geared towards "expert mode", so maybe we just assume if you need it you know what you're doing? > >> 10) Along with pg_pwrite(), can we also fsync the files (of course > >> users can choose it optionally) so that the writes will be durable for > >> the OS crashes? > > > > Can add; you thinking a separate flag to disable this with default true? > > We expect data generated by tools like pg_dump, pg_receivewal > (depending on the use --synchronous) or pg_basebackup to be consistent > when we exit from the call. FWIW, flushing this data does not seem > like a strong requirement for something aimed at being used page-level > chirurgy or lookups, because the WAL segments should still be around > even if the host holding the archives is unplugged. I have added the fsync to the latest path (forthcoming), but I have no strong preferences here as to the correct/expected behavior. Best, David
Re: Moving forward with TDE [PATCH v3]
On Tue, Oct 31, 2023 at 4:30 PM Bruce Momjian wrote: > On Tue, Oct 31, 2023 at 04:23:17PM -0500, David Christensen wrote: > > Greetings, > > > > I am including an updated version of this patch series; it has been > rebased > > onto 6ec62b7799 and reworked somewhat. > > > > The patches are as follows: > > > > 0001 - doc updates > > 0002 - Basic key management and cipher support > > 0003 - Backend-related changes to support heap encryption > > 0004 - modifications to bin tools and programs to manage key rotation > and add > > other knowledge > > 0005 - Encrypted/authenticated WAL > > > > These are very broad strokes at this point and should be split up a bit > more to > > make things more granular and easier to review, but I wanted to get this > update > > out. > > This lacks temp table file encryption, right? Temporary /files/ are handled in a different patch set and are not included here (not sure of the status of integrating at this point). I believe that this patch should handle temporary heap files just fine since they go through the Page API.
Re: Moving forward with TDE [PATCH v3]
Hi, thanks for the detailed feedback here. I do think it's worth addressing the question Stephen raised as far as what we use for the IV[1]; whether LSN or something else entirely, and if so what. The choice of LSN here is fairly fundamental to the existing implementation, so if we decide to do something different some of this might be moot. Best, David [1] https://www.mail-archive.com/pgsql-hackers@lists.postgresql.org/msg154613.html
Re: Moving forward with TDE [PATCH v3]
On Fri, Nov 3, 2023 at 9:53 PM Andres Freund wrote: > On 2023-11-02 19:32:28 -0700, Andres Freund wrote: > > > From 327e86d52be1df8de9c3a324cb06b85ba5db9604 Mon Sep 17 00:00:00 2001 > > > From: David Christensen > > > Date: Fri, 29 Sep 2023 15:16:00 -0400 > > > Subject: [PATCH v3 5/5] Add encrypted/authenticated WAL > > > > > > When using an encrypted cluster, we need to ensure that the WAL is also > > > encrypted. While we could go with an page-based approach, we use > instead a > > > per-record approach, using GCM for the encryption method and storing > the AuthTag > > > in the xl_crc field. > > What was the reason for this decision? > This was mainly to prevent IV reuse by using a per-record encryption rather than per-page, since partial writes out on the WAL buffer would result in reuse there. This was somewhat of an experiment since authenticated data per record was basically equivalent in function to the CRC. There was a switch here so normal clusters use the crc field with the existing CRC implementation, only encrypted clusters use this alternate approach.
Re: Moving forward with TDE [PATCH v3]
On Tue, Nov 7, 2023 at 6:47 PM Andres Freund wrote: > Hi, > > On 2023-11-06 11:26:44 +0100, Matthias van de Meent wrote: > > On Sat, 4 Nov 2023 at 03:38, Andres Freund wrote: > > > On 2023-11-02 22:09:40 +0100, Matthias van de Meent wrote: > > > > I'm quite surprised at the significant number of changes being made > > > > outside the core storage manager files. I thought that changing out > > > > mdsmgr with an encrypted smgr (that could wrap mdsmgr if so desired) > > > > would be the most obvious change to implement cluster-wide encryption > > > > with the least code touched, as relations don't need to know whether > > > > the files they're writing are encrypted, right? Is there a reason to > > > > not implement this at the smgr level that I overlooked in the > > > > documentation of these patches? > > > > > > You can't really implement encryption transparently inside an smgr > without > > > significant downsides. You need a way to store an initialization vector > > > associated with the page (or you can store that elsewhere, but then > you've > > > doubled the worst cse amount of random reads/writes). The patch uses > the LSN > > > as the IV (which I doubt is a good idea). For authenticated encryption > further > > > additional storage space is required. > > > > I am unaware of any user of the smgr API that doesn't also use the > > buffer cache, and thus implicitly the Page layout with PageHeader > > [^1] > > Everything indeed uses a PageHeader - but there are a number of places > that do > *not* utilize pd_lower/upper/special. E.g. visibilitymap.c just assumes > that > those fields are zero - and changing that wouldn't be trivial / free, > because > we do a lot of bitmasking/shifting with constants derived from > > #define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData)) > > which obviously wouldn't be constant anymore if you could reserve space on > the > page. > While not constants, I was able to get this working with variable values here in a way that did not have the overhead of the original patch for the vismap specifically using Montgomery Multiplication for division/mod. This was actually the heaviest of the changes from moving to runtime-calculated, so we might be able to use this approach in this specific case even if only this change is required for this specific fork. > > The API of smgr is also tailored to page-sized quanta of data > > with mostly relation-level information. I don't see why there would be > > a veil covering the layout of Page for smgr when all other information > > already points to the use of PageHeader and Page layouts. In my view, > > it would even make sense to allow the smgr to get exclusive access to > > some part of the page in the current Page layout. > > > > Yes, I agree that there will be an impact on usable page size if you > > want authenticated encryption, and that AMs will indeed need to > > account for storage space now being used by the smgr - inconvenient, > > but it serves a purpose. That would happen regardless of whether smgr > > or some higher system decides where to store the data for encryption - > > as long as it is on the page, the AM effectively can't use those > > bytes. > > But I'd say that's best solved by making the Page documentation and > > PageInit API explicit about the potential use of that space by the > > chosen storage method (encrypted, plain, ...) instead of requiring the > > various AMs to manually consider encryption when using Postgres' APIs > > for writing data to disk without hitting shared buffers; page space > > management is already a task of AMs, but handling the actual > > encryption is not. > > I don't particularly disagree with any detail here - but to me reserving > space > for nonces etc at PageInit() time pretty much is the opposite of handling > encryption inside smgr. > Originally, I was anticipating that we might want different space amounts reserved on different classes of pages (apart from encryption), so while we'd be storing the default page reserved size in pg_control we'd not be limited to this in the structure of the page calls. We could presumably just move the logic into PageInit() itself if every reserved allocation is the same and individual call sites wouldn't need to know about it. The call sites do have more context as to the requirements of the page or the "type" of page in play, which if we made it dependent on page type would need to get passed in somehow, which was where the reserved_page_size parameter came in to the current patch. > > > Should the AM really care whether the data on disk is encrypted or > > not? I don't think so. When the disk contains encrypted bytes, but > > smgrread() and smgrwrite() both produce and accept plaintext data, > > who's going to complain? Requiring AMs to be mindful about encryption > > on all common paths only adds pitfalls where encryption would be > > forgotten by the developer of AMs in one path or another. > > I agree with that - I think the way the
Re: Moving forward with TDE [PATCH v3]
On Tue, Nov 7, 2023 at 5:49 PM Andres Freund wrote: > Hi, > > On 2023-11-06 09:56:37 -0500, Stephen Frost wrote: > > * Andres Freund (and...@anarazel.de) wrote: > > > I still am quite quite unconvinced that using the LSN as a nonce is a > good > > > design decision. > > > > This is a really important part of the overall path to moving this > > forward, so I wanted to jump to it and have a specific discussion > > around this. I agree that there are downsides to using the LSN, some of > > which we could possibly address (eg: include the timeline ID in the IV), > > but others that would be harder to deal with. > > > The question then is- what's the alternative? > > > > One approach would be to change the page format to include space for an > > explicit nonce. I don't see the community accepting such a new page > > format as the only format we support though as that would mean no > > pg_upgrade support along with wasted space if TDE isn't being used. > > Right. > Hmm, if we /were/ to introduce some sort of page format change, Couldn't that be a use case for modifying the pd_version field? Could v4 pages be read in and written out as v5 pages with different interpretations? > > Ideally, we'd instead be able to support multiple page formats where > > users could decide when they create their cluster what features they > > want- and luckily we even have such an effort underway with patches > > posted for review [1]. > > I think there are some details wrong with that patch - IMO the existing > macros > should just continue to work as-is and instead the places that want the > more > narrow definition should be moved to the new macros and it changes places > that > should continue to use compile time constants - but it doesn't seem like a > fundamentally bad idea to me. I certainly like it much better than making > the > page size runtime configurable. > There had been some discussion about this WRT renaming macros and the like (constants, etc)—I think a new pass eliminating the variable blocksize pieces and seeing if we can minimize churn here is worthwhile, will take a look and see what the minimally-viable set of changes is here. > (I'll try to reply with the above points to [1]) > > > > Certainly, with the base page-special-feature patch, we could have an > option > > for users to choose that they want a better nonce than the LSN, or we > could > > bundle that assumption in with, say, the authenticated-encryption feature > > (if you want authenticated encryption, then you want more from the > > encryption system than the basics, and therefore we presume you also > want a > > better nonce than the LSN). > > I don't think we should support using the LSN as a nonce if we have an > alternative. The cost and complexity overhead is just not worth it. Yes, > it'll be harder for users to migrate to encryption, but adding complexity > elsewhere in the system to get an inferior result isn't worth it. > >From my read, XTS (which I'd see as inferior to authenticated encryption, but better than some other options) could use LSN as an IV without leakage concerns, perhaps mixing in the BlockNumber as well. If we are going to allow multiple encryption types, I think we may need to consider that needs for IVs may be different, so this may need to be something that is selectable per encryption type. I am unclear how much of a requirement this is, but seems like having a design supporting this to be pluggable—even if a static lookup table internally for encryption type, block length, IV source, etc—seems the most future proof if we had to retire an encryption method or prevent creation of specific methods, say. > > Another approach would be a separate fork, but that then has a number of > > downsides too- every write has to touch that too, and a single page of > > nonces would cover a pretty large number of pages also. > > Yea, the costs of doing so is nontrivial. If you were trying to implement > encryption on the smgr level - which I doubt you should but am not certain > about - my suggestion would be to interleave pages with metadata like > nonces > and AEAD with the data pages. I.e. one metadata page would be followed by > (BLCKSZ - SizeOfPageHeaderData) / (sizeof(nonce) + sizeof(AEAD)) > pages containing actual relation data. That way you still get decent > locality > during scans and writes. > Hmm, this is actually an interesting idea, I will think about this a bit. > Relation forks were a mistake, we shouldn't use them in more places. > > > I think it'd be much better if we also encrypted forks, rather than just > the > main fork... > I believe the existing code should just work by modifying the PageNeedsToBeEncrypted macro; I will test that and see if anything blows up. David
Re: [PATCHES] Post-special page storage TDE support
On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost wrote: > Greetings, > > * Andres Freund (and...@anarazel.de) wrote: > > On 2023-05-09 17:08:26 -0500, David Christensen wrote: > > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001 > > > From: David Christensen > > > Date: Tue, 9 May 2023 16:56:15 -0500 > > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure > > > > > > This space is reserved for extended data on the Page structure which > will be ultimately used for > > > encrypted data, extended checksums, and potentially other things. > This data appears at the end of > > > the Page, after any `pd_special` area, and will be calculated at > runtime based on specific > > > ControlFile features. > > > > > > No effort is made to ensure this is backwards-compatible with existing > clusters for `pg_upgrade`, as > > > we will require logical replication to move data into a cluster with > > > different settings here. > > > > The first part of the last paragraph makes it sound like pg_upgrade > won't be > > supported across this commit, rather than just between different > settings... > Yeah, that's vague, but you picked up on what I meant. > > I think as a whole this is not an insane idea. A few comments: > > Thanks for all the feedback! > > > - Why is it worth sacrificing space on every page to indicate which > features > > were enabled? I think there'd need to be some convincing reasons for > > introducing such overhead. > > In conversations with folks (my memory specifically is a discussion with > Peter G, added to CC, and my apologies to Peter if I'm misremembering) > there was a pretty strong push that a page should be able to 'stand > alone' and not depend on something else (eg: pg_control, or whatever) to > provide info needed be able to interpret the page. For my part, I don't > have a particularly strong feeling on that, but that's what lead to this > design. > Unsurprisingly, I agree that it's useful to keep these features on the page itself; from a forensic standpoint this seems much easier to interpret what is happening here, as well it would allow you to have different features on a given page or type of page depending on need. The initial patch utilizes pg_control to store the cluster page features, but there's no reason it couldn't be dependent on fork/page type or stored in pg_tablespace to utilize different features. Thanks, David
Re: [PATCHES] Post-special page storage TDE support
On Tue, Nov 7, 2023 at 6:20 PM Andres Freund wrote: > Hi, > > On 2023-05-09 17:08:26 -0500, David Christensen wrote: > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001 > > From: David Christensen > > Date: Tue, 9 May 2023 16:56:15 -0500 > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure > > > > This space is reserved for extended data on the Page structure which > will be ultimately used for > > encrypted data, extended checksums, and potentially other things. This > data appears at the end of > > the Page, after any `pd_special` area, and will be calculated at runtime > based on specific > > ControlFile features. > > > > No effort is made to ensure this is backwards-compatible with existing > clusters for `pg_upgrade`, as > > we will require logical replication to move data into a cluster with > > different settings here. > > The first part of the last paragraph makes it sound like pg_upgrade won't > be > supported across this commit, rather than just between different > settings... > Thanks for the review. > I think as a whole this is not an insane idea. A few comments: > > - IMO the patch touches many places it shouldn't need to touch, because of > essentially renaming a lot of existing macro names to *Limit, > necessitating modifying a lot of users. I think instead the few places > that > care about the runtime limit should be modified. > > As-is the patch would cause a lot of fallout in extensions that just do > things like defining an on-stack array of Datums or such - even though > all > they'd need is to change the define to the *Limit one. > > Even leaving extensions aside, it must makes reviewing (and I'm sure > maintaining) the patch very tedious. > You make a good point, and I think you're right that we could teach the places that care about runtime vs compile time differences about the changes while leaving other callers alone. The *Limit ones were introduced since we need constant values here from the Calc...() macros, but could try keeping the existing *Limit with the old name and switching things around. I suspect there will be the same amount of code churn, but less mechanical. > - I'm a bit worried about how the extra special page will be managed - if > there are multiple features that want to use it, who gets to put their > data > at what offset? > > After writing this I saw that 0002 tries to address this - but I don't > like > the design. It introduces runtime overhead that seems likely to be > visible. > Agreed this could be optimized. > - Checking for features using PageGetFeatureOffset() seems the wrong > design to > me - instead of a branch for some feature being disabled, perfectly > predictable for the CPU, we need to do an external function call every > time > to figure out that yet, checksums are *still* disabled. > This is probably not a supported approach (it felt a little icky), but I'd played around with const pointers to structs of const elements, where the initial values of a global var was populated early on (so set once and never changed post init), and the compiler didn't complain and things seemed to work ok; not sure if this approach might help balance the early mutability and constant lookup needs: typedef struct PageFeatureOffsets { const Size feature0offset; const Size feature1offset; ... } PageFeatureOffsets; PageFeatureOffsets offsets = {0}; const PageFeatureOffsets *exposedOffsets = &offsets; void InitOffsets() { *((Size*)&offsets.feature0offset) = ...; *((Size*)&offsets.feature1offset) = ...; ... } - Recomputing offsets every time in PageGetFeatureOffset() seems too > expensive. The offsets can't change while running as > PageGetFeatureOffset() > have enough information to distinguish between different kinds of > relations > Yes, this was a simple approach for ease of implementation; there is certainly a way to precompute a lookup table from the page feature bitmask into the offsets themselves or otherwise precompute, turn from function call into inline/macro, etc. > - so why do we need to recompute offsets on every single page? I'd > instead > add a distinct offset variable for each feature. > This would work iff there is a single page feature set across all pages in the cluster; I'm not sure we don't want more flexibility here. > - Modifying every single PageInit() call doesn't make sense to me. That'll > just create a lot of breakage for - as far as I can tell - no win. > This was a placeholder to allow different features depending on page type; to keep things simple for now I just used the same values
Re: [PATCHES] Post-special page storage TDE support
On Mon, Nov 13, 2023 at 2:27 PM Andres Freund wrote: > Hi, > > On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote: > > On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost wrote: > > > In conversations with folks (my memory specifically is a discussion > with > > > Peter G, added to CC, and my apologies to Peter if I'm misremembering) > > > there was a pretty strong push that a page should be able to 'stand > > > alone' and not depend on something else (eg: pg_control, or whatever) > to > > > provide info needed be able to interpret the page. For my part, I > don't > > > have a particularly strong feeling on that, but that's what lead to > this > > > design. > > > > The term that I have used in the past is "self-contained". Meaning > > capable of being decoded more or less as-is, without any metadata, by > > tools like pg_filedump. > > I'm not finding that very convincing - without cluster wide data, like > keys, a > tool like pg_filedump isn't going to be able to do much with encrypted > pages. Given the need to look at some global data, figuring out the offset > at > which data starts based on a value in pg_control isn't meaningfully worse > than > having the data on each page. > > Storing redundant data in each page header, when we've wanted space in the > page header for plenty other things, just doesn't seem a good use of said > space. > This scheme would open up space per page that would now be available for plenty of other things; the encoding in the header and the corresponding available space in the footer would seem to open up quite a few options now, no?
Re: [PATCHES] Post-special page storage TDE support
On Mon, Nov 13, 2023 at 2:52 PM Andres Freund wrote: > > > This scheme would open up space per page that would now be available for > > plenty of other things; the encoding in the header and the corresponding > > available space in the footer would seem to open up quite a few options > > now, no? > > Sure, if you're willing to rewrite the whole cluster to upgrade and > willing to > permanently sacrifice some data density. If the stored data is actually > specific to the page - that is the place to put the data. If not, then the > tradeoff is much more complicated IMO. > > Of course this isn't a new problem - storing the page size on each page was > just silly, it's never going to change across the cluster and even more > definitely not going to change within a single relation. > Crazy idea; since stored pagesize is already a fixed cost that likely isn't going away, what if instead of the pd_checksum field, we instead reinterpret pd_pagesize_version; 4 would mean "no page features", but anything 5 or higher could be looked up as an external page feature set, with storage semantics outside of the realm of the page itself (other than what the page features code itself needs to know); i.e,. move away from the on-page bitmap into a more abstract representation of features which could be something along the lines of what you were suggesting, including extension support. It seems like this could also support adding/removing features on page read/write as long as there was sufficient space in the reserved_page space; read the old feature set on page read, convert to the new feature set which will write out the page with the additional/changed format. Obviously there would be bookkeeping to be done in terms of making sure all pages had been converted from one format to another, but for the page level this would be straightforward. Just thinking aloud here... David
Re: [PATCHES] Post-special page storage TDE support
On Tue, Nov 7, 2023 at 6:20 PM Andres Freund wrote: > Hi, > > - IMO the patch touches many places it shouldn't need to touch, because of > essentially renaming a lot of existing macro names to *Limit, > necessitating modifying a lot of users. I think instead the few places > that > care about the runtime limit should be modified. > > As-is the patch would cause a lot of fallout in extensions that just do > things like defining an on-stack array of Datums or such - even though > all > they'd need is to change the define to the *Limit one. > > Even leaving extensions aside, it must makes reviewing (and I'm sure > maintaining) the patch very tedious. > Hi Andres et al, So I've been looking at alternate approaches to this issue and considering how to reduce churn, and I think we still need the *Limit variants. Let's take a simple example: Just looking at MaxHeapTuplesPerPage and breaking down instances in the code, loosely partitioning into whether it's used as an array index or other usage (doesn't discriminate against code vs comments, unfortunately) we get the following breakdown: $ git grep -hoE [[]?MaxHeapTuplesPerPage | sort | uniq -c 18 [MaxHeapTuplesPerPage 51 MaxHeapTuplesPerPage This would be 18 places where we would need at adjust in a fairly mechanical fashion to add the MaxHeapTuplesPerPageLimit instead of MaxHeapTuplesPerPage vs some significant fraction of non-comment--even if you assumed half were in comments, there would presumably need to be some sort of adjustments in verbage since we are going to be changing some of the interpretation. I am working on a patch to cleanup some of the assumptions that smgr makes currently about its space usage and how the individual access methods consider it, as they should only be calculating things based on how much space is available after smgr is done with it. That has traditionally been BLCKSZ - SizeOfPageHeaderData, but this patch (included) factors that out into a single expression that we can now use in access methods, so we can then reserve additional page space and not need to adjust the access methods furter. Building on top of this patch, we'd define something like this to handle the #defines that need to be dynamic: extern Size reserved_page_space; #define PageUsableSpace (BLCKSZ - SizeOfPageHeaderData - reserved_page_space) #define MaxHeapTuplesPerPage CalcMaxHeapTuplesPerPage(PageUsableSpace) #define MaxHeapTuplesPerPageLimit CalcMaxHeapTuplesPerPage(BLCKSZ - SizeOfPageHeaderData) #define CalcMaxHeapTuplesPerPage(freesize) ((int) ((freesize) / \ (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData In my view, extensions that are expecting to need no changes when it comes to changing how these are interpreted are better off needing to only change the static allocation in a mechanical sense than revisit any other uses of code; this seems more likely to guarantee a correct result than if you exceed the page space and start overwriting things you weren't because you're not aware that you need to check for dynamic limits on your own. Take another thing which would need adjusting for reserving page space, MaxHeapTupleSize: $ git grep -ohE '[[]?MaxHeapTupleSize' | sort | uniq -c 3 [MaxHeapTupleSize 16 MaxHeapTupleSize Here there are 3 static arrays which would need to be adjusted vs 16 other instances. If we kept MaxHeapTupleSize interpretation the same and didn't adjust an extension it would compile just fine, but with too large of a length compared to the smaller PageUsableSpace, so you could conceivably overwrite into the reserved space depending on what you were doing. (since by definition the reserved_page_space >= 0, so PageUsableSpace will always be <= BLCKSZ - SizeOfPageHeaderData, so any expression based on it as a basis will be smaller). In short, I think the approach I took originally actually will reduce errors out-of-core, and while churn is still necessary churn. I can produce a second patch which implements this calc/limit atop this first one as well. Thanks, David 0001-Create-PageUsableSpace-to-represent-space-post-smgr.patch Description: Binary data
Re: [PATCHES] Post-special page storage TDE support
Hi again! Per some offline discussion with Stephen, I've continued to work on some modifications here; this particular patchset is intended to facilitate review by highlighting the mechanical nature of many of these changes. As such, I have taken the following approach to this rework: 0001 - Create PageUsableSpace to represent space post-smgr 0002 - Add support for fast, non-division-based div/mod algorithms 0003 - Use fastdiv code in visibility map 0004 - Make PageUsableSpace incorporate variable-sized limit 0005 - Add Calc, Limit and Dynamic forms of all variable constants 0006 - Split MaxHeapTuplesPerPage into Limit and Dynamic variants 0007 - Split MaxIndexTuplesPerPage into Limit and Dynamic variants 0008 - Split MaxHeapTupleSize into Limit and Dynamic variants 0009 - Split MaxTIDsPerBTreePage into Limit and Dynamic variant 0001 - 0003 have appeared in this thread or in other forms on the list already, though 0001 refactors things slightly more aggressively, but makes StaticAssert() to ensure that this change is still sane. 0004 adds the ReservedPageSpace variable, and also redefines the previous BLCKSZ - SizeOfPageHeaderDate as PageUsableSpaceMax; there are a few related fixups. 0005 adds the macros to compute the former constants while leaving their original definitions to evaluate to the same place (the infamous Calc* and *Limit, plus we invite *Dynamic to the party as well; the names are terrible and there must be something better) 0006 - 0009 are all the same approach; we undefine the old constant name and modify the existing uses of this symbol to be either the *Limit or *Dynamic, depending on if the changed available space would impact the calculations. Since we are touching every use of this symbol, this facilitates review of the impact, though I would contend that almost every piece I've spot-checked seems like it really does need to know about the runtime limit. Perhaps there is more we could do here. I could also see a variable per constant rather than recalculating this every time, in which case the *Dynamic would just be the variable and we'd need a hook to initialize this or otherwise set on first use. There are a number of additional things remaining to be done to get this to fully work, but I did want to get some of this out there for review. Still to do (almost all in some form in original patch, so just need to extract the relevant pieces): - set reserved-page-size via initdb - load reserved-page-size from pg_control - apply to the running cluster - some form of compatibility for these constants in common and ensuring bin/ works - some toast-related changes (this requires a patch to support dynamic relopts, which I can extract, as the existing code is using a constant lookup table) - probably some more pieces I'm forgetting Thanks, David v2-0005-Add-Calc-Limit-and-Dynamic-forms-of-all-variable-.patch Description: Binary data v2-0001-Create-PageUsableSpace-to-represent-space-post-sm.patch Description: Binary data v2-0002-Add-support-for-fast-non-division-based-div-mod-a.patch Description: Binary data v2-0003-Use-fastdiv-code-in-visibility-map.patch Description: Binary data v2-0004-Make-PageUsableSpace-incorporate-variable-sized-l.patch Description: Binary data v2-0007-Split-MaxIndexTuplesPerPage-into-Limit-and-Dynami.patch Description: Binary data v2-0008-Split-MaxHeapTupleSize-into-Limit-and-Dynamic-var.patch Description: Binary data v2-0006-Split-MaxHeapTuplesPerPage-into-Limit-and-Dynamic.patch Description: Binary data v2-0009-Split-MaxTIDsPerBTreePage-into-Limit-and-Dynamic-.patch Description: Binary data
Re: Initdb-time block size specification
Enclosed are TPC-H results for 1GB shared_buffers, 64MB work_mem on a 64GB laptop with SSD storage; everything else is default settings. TL;DR: unpatched version: 17.30 seconds, patched version: 17.15; there are some slight variations in runtime, but seems to be within the noise level at this point. tpch.unpatched.rst Description: Binary data tpch.patched.rst Description: Binary data
Re: Initdb-time block size specification
> + * pg_fastmod - calculates the modulus of a 32-bit number against a constant > + * divisor without using the division operator > + */ > +static inline uint32 pg_fastmod(uint32 n, uint32 divisor, uint64 fastinv) > +{ > +#ifdef HAVE_INT128 > + uint64_t lowbits = fastinv * n; > + return ((uint128)lowbits * divisor) >> 64; > +#else > + return n % divisor; > +#endif > +} > > Requiring 128-bit arithmetic to avoid serious regression is a non-starter as > written. Software that relies on fast 128-bit multiplication has to do > backflips to get that working on multiple platforms. But I'm not sure it's > necessary -- if the max block number is UINT32_MAX and max block size is > UINT16_MAX, can't we just use 64-bit multiplication? I was definitely hand-waving additional implementation here for non-native 128 bit support; the modulus algorithm as presented requires 4 times the space as the divisor, so a uint16 implementation should work for all 64-bit machines. Certainly open to other ideas or implementations, this was the one I was able to find initially. If the 16bit approach is all that is needed in practice we can also see about narrowing the domain and not worry about making this a general-purpose function. Best, David
Re: Initdb-time block size specification
> I was definitely hand-waving additional implementation here for > non-native 128 bit support; the modulus algorithm as presented > requires 4 times the space as the divisor, so a uint16 implementation > should work for all 64-bit machines. Certainly open to other ideas or > implementations, this was the one I was able to find initially. If > the 16bit approach is all that is needed in practice we can also see > about narrowing the domain and not worry about making this a > general-purpose function. Here's a patch atop the series which converts to 16-bit uints and passes regressions, but I don't consider well-vetted at this point. David 0001-Switch-to-using-only-16-bit-to-avoid-128-bit-require.patch Description: Binary data
Re: Initdb-time block size specification
On Tue, Sep 5, 2023 at 9:04 AM Robert Haas wrote: > > On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra > wrote: > > Except that no one is forcing you to actually go 130mph or 32mph, right? > > You make it seem like this patch forces people to use some other page > > size, but that's clearly not what it's doing - it gives you the option > > to use smaller or larger block, if you chose to. Just like increasing > > the speed limit to 130mph doesn't mean you can't keep going 65mph. > > > > The thing is - we *already* allow using different block size, except > > that you have to do custom build. This just makes it easier. > > Right. Which is worth doing if it doesn't hurt performance and is > likely to be useful to a lot of people, and is not worth doing if it > will hurt performance and be useful to relatively few people. My bet > is on the latter. Agreed that this doesn't make sense if there are major performance regressions, however the goal here is patch evaluation to measure this against other workloads and see if this is the case; from my localized testing things were within acceptable noise levels with the latest version. I agree with Tomas' earlier thoughts: we already allow different block sizes, and if there are baked-in algorithmic assumptions about block size (which there probably are), then identifying those or places in the code where we need additional work or test coverage will only improve things overall for those non-standard block sizes. Best, David
Re: Initdb-time block size specification
On Tue, Sep 5, 2023 at 2:52 PM Hannu Krosing wrote: > > Something I also asked at this years Unconference - Do we currently > have Build Farm animals testing with different page sizes ? > > I'd say that testing all sizes from 4KB up (so 4, 8, 16, 32) should be > done at least before each release if not continuously. > > -- Cheers > > Hannu The regression tests currently have a lot of breakage when running against non-standard block sizes, so I would assume the answer here is no. I would expect that we will want to add regression test variants or otherwise normalize results so they will work with differing block sizes, but have not done that yet.
Re: DELETE CASCADE
On Sun, Jan 30, 2022 at 9:47 PM Julien Rouhaud wrote: > Hi, > > On Tue, Jan 25, 2022 at 10:26:53PM +0800, Julien Rouhaud wrote: > > > > It's been almost 4 months since your last email, and almost 2 weeks > since the > > notice that this patch doesn't apply anymore. Without update in the next > > couple of days this patch will be closed as Returned with Feedback per > the > > commitfest rules. > > Closed. > Sounds good; when I get time to look at this again I will resubmit (if people think the base functionality is worth it, which is still a topic of discussion). David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Fri, Nov 4, 2022 at 1:38 PM Justin Pryzby wrote: > > > As I recall, that's due to relying on "cp". And "rsync", which > > > shouldn't be assumed to exist by regression tests). Will poke around other TAP tests to see if there's a more consistent interface, what perl version we can assume and available modules, etc. If there's not some trivial wrapper at this point so all TAP tests could use it regardless of OS, it would definitely be good to introduce such a method. > > I will work on supporting the windows compatibility here. Is there some > > list of guidelines for what you can and can’t use? I don’t have a windows > > machine available to develop on. > > I think a lot (most?) developers here don't have a windows environment > available, so now have been using cirrusci's tests to verify. If you > haven't used cirrusci directly (not via cfbot) before, start at: > src/tools/ci/README Thanks, good starting point. > > Was it failing on windows? I was attempting to skip it as I recall. > > I don't see anything about skipping, and cirrus's logs from 2 > commitfests ago were pruned. I looked at this patch earlier this year, > but never got around to replacing the calls to rsync and cp. Ah, it's skipped (not fixed) in my git repo, but never got around to submitting that version through email. That explains it. David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
Hi Justin et al, Enclosed is v5 of this patch which now passes the CirrusCI checks for all supported OSes. I went ahead and reworked the test a bit so it's a little more amenable to the OS-agnostic approach for testing. Best, David v5-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch Description: Binary data
Re: [PATCHES] Post-special page storage TDE support
Looking into some CF bot failures which didn't show up locally. Will send a v3 when resolved.
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Tue, Nov 8, 2022 at 4:45 PM Justin Pryzby wrote: > > On Mon, Nov 07, 2022 at 05:01:01PM -0600, David Christensen wrote: > > Hi Justin et al, > > > > Enclosed is v5 of this patch which now passes the CirrusCI checks for > > all supported OSes. I went ahead and reworked the test a bit so it's a > > little more amenable to the OS-agnostic approach for testing. > > Great, thanks. > > This includes the changes that I'd started a few months ago. > Plus adding the test which was missing for meson. Cool, will review, thanks. > +format: > LSN.TSOID.DBOID.RELNODE.BLKNO > > I'd prefer if the abbreviations were "reltablespace" and "datoid" Sure, no issues there. > Also, should the test case call pg_relation_filenode() rather than using > relfilenode directly ? Is it a problem that the test code assumes > pagesize=8192 ? Both good points. Is pagesize just exposed via `current_setting('block_size')` or is there a different approach? David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
Enclosed is v6, which squashes your refactor and adds the additional recent suggestions. Thanks! v6-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch Description: Binary data
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Wed, Nov 9, 2022 at 6:30 AM Bharath Rupireddy wrote: > > On Wed, Nov 9, 2022 at 5:08 AM David Christensen > wrote: > > > > Enclosed is v6, which squashes your refactor and adds the additional > > recent suggestions. > > Thanks for working on this feature. Here are some comments for now. I > haven't looked at the tests, I will take another look at the code and > tests after these and all other comments are addressed. > > 1. For ease of review, please split the test patch to 0002. Per later discussion seems like new feature tests are fine in the same patch, yes? > 2. I'm unable to understand the use-case for --fixup-fpi option. > pg_waldump is supposed to be just WAL reader, and must not return any > modified information, with --fixup-fpi option, the patch violates this > principle i.e. it sets page LSN and returns. Without actually > replaying the WAL record on the page, how is it correct to just set > the LSN? How will it be useful? ISTM, we must ignore this option > unless there's a strong use-case. How I was envisioning this was for cases like extreme surgery for corrupted pages, where you extract the page from WAL but it has lsn and checksum set so you could do something like `dd if=fixup-block of=relation ...`, so it *simulates* the replay of said fullpage blocks in cases where for some reason you can't play the intermediate records; since this is always a fullpage block, it's capturing what would be the snapshot so you could manually insert somewhere as needed without needing to replay (say if dealing with an incomplete or corrupted WAL stream). > 3. > +if (fork >= 0 && fork <= MAX_FORKNUM) > +{ > +if (fork) > +sprintf(forkname, "_%s", forkNames[fork]); > +else > +forkname[0] = 0; > +} > +else > +pg_fatal("found invalid fork number: %u", fork); > > Isn't the above complex? What's the problem with something like below? > Why do we need if (fork) - else block? > > if (fork >= 0 && fork <= MAX_FORKNUM) > sprintf(forkname, "_%s", forkNames[fork]); > else > pg_fatal("found invalid fork number: %u", fork); This was to suppress any suffix for main forks, but yes, could simplify and include the `_` in the suffix name. Will include such a change. > 3. > +charpage[BLCKSZ] = {0}; > I think when writing to a file, we need PGAlignedBlock rather than a > simple char array of bytes, see the description around PGAlignedBlock > for why it is so. Easy enough change, and makes sense. > 4. > +if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ) > Why pg_pwrite(), why not just fwrite()? If fwrite() is used, you can > avoid fileno() system calls, no? Do you need the file position to > remain the same after writing, hence pg_pwrite()? I don't recall the original motivation, TBH. > 5. > +if (!RestoreBlockImage(record, block_id, page)) > +continue; > + > +/* we have our extracted FPI, let's save it now */ > After extracting the page from the WAL record, do we need to perform a > checksum on it? That is there in fixup mode (or should be). Are you thinking this should also be set if not in fixup mode? That defeats the purpose of the raw page extract, which is to see *exactly* what the WAL stream has. > 6. > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) > Use pg_dir_create_mode instead of hard-coded 0007? Sure. > 7. > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) > +fsync(fileno(OPF)); > +fclose(OPF); > Since you're creating the directory in case it's not available, you > need to fsync the directory too? Sure. > 8. > +case 'W': > +case 'X': > +config.fixup_fpw = (option == 'X'); > +config.save_fpw_path = pg_strdup(optarg); > +break; > Just set config.fixup_fpw = false before the switch block starts, > like the other variables, and then perhaps doing like below is more > readable: > case 'W': > config.save_fpw_path = pg_strdup(optarg); > case 'X': >config.fixup_fpw = true; >config.save_fpw_path = pg_strdup(optarg); Like separate opt processing with their own `break` statement? Probably a bit more readable/conventional. > 9. > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) > Should we use pg_mkdir_p() instead of mkdir()? Sure. Thanks, David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
> > 6. > > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0) > > Use pg_dir_create_mode instead of hard-coded 0007? > > I think I thought of that when I first looked at the patch ... but, I'm > not sure, since it says: > > src/include/common/file_perm.h-/* Modes for creating directories and files IN > THE DATA DIRECTORY */ > src/include/common/file_perm.h:extern PGDLLIMPORT int pg_dir_create_mode; Looks like it's pretty evenly split in src/bin: $ git grep -o -E -w '(pg_mkdir_p|mkdir)' '**.c' | sort | uniq -c 3 initdb/initdb.c:mkdir 3 initdb/initdb.c:pg_mkdir_p 1 pg_basebackup/bbstreamer_file.c:mkdir 2 pg_basebackup/pg_basebackup.c:pg_mkdir_p 1 pg_dump/pg_backup_directory.c:mkdir 1 pg_rewind/file_ops.c:mkdir 4 pg_upgrade/pg_upgrade.c:mkdir So if that is the preferred approach I'll go ahead and use it. > I was wondering if there's any reason to do "CREATE DATABASE". The vast > majority of TAP tests don't. > > $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head >1435 safe_psql('postgres', > 335 safe_psql( > 23 safe_psql($connect_db, If there was a reason, I don't recall offhand; I will test removing it and if things still work will consider it good enough. David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Wed, Nov 9, 2022 at 2:08 PM David Christensen wrote: > Justin sez: > > I was wondering if there's any reason to do "CREATE DATABASE". The vast > > majority of TAP tests don't. > > > > $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head > >1435 safe_psql('postgres', > > 335 safe_psql( > > 23 safe_psql($connect_db, > > If there was a reason, I don't recall offhand; I will test removing it > and if things still work will consider it good enough. Things blew up when I did that; rather than hunt it down, I just left it in. :-) Enclosed is v7, with changes thus suggested thus far. v7-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch Description: Binary data
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Thu, Nov 10, 2022 at 2:33 AM Bharath Rupireddy wrote: > > On Thu, Nov 10, 2022 at 1:31 AM David Christensen > wrote: > > > > Thanks for providing the v7 patch, please see my comments and responses below. Hi Bharath, Thanks for the feedback. > > > 2. I'm unable to understand the use-case for --fixup-fpi option. > > > pg_waldump is supposed to be just WAL reader, and must not return any > > > modified information, with --fixup-fpi option, the patch violates this > > > principle i.e. it sets page LSN and returns. Without actually > > > replaying the WAL record on the page, how is it correct to just set > > > the LSN? How will it be useful? ISTM, we must ignore this option > > > unless there's a strong use-case. > > > > How I was envisioning this was for cases like extreme surgery for > > corrupted pages, where you extract the page from WAL but it has lsn > > and checksum set so you could do something like `dd if=fixup-block > > of=relation ...`, so it *simulates* the replay of said fullpage blocks > > in cases where for some reason you can't play the intermediate > > records; since this is always a fullpage block, it's capturing what > > would be the snapshot so you could manually insert somewhere as needed > > without needing to replay (say if dealing with an incomplete or > > corrupted WAL stream). > > Recovery sets the page LSN after it replayed the WAL record on the > page right? Recovery does this - base_page/FPI + > apply_WAL_record_and_then_set_applied_WAL_record's_LSN = > new_version_of_page. Essentially, in your patch, you are just setting > the WAL record LSN with the page contents being the base page's. I'm > still not sure what's the real use-case here. We don't have an > independent function in postgres, given a base page and a WAL record > that just replays the WAL record and output's the new version of the > page, so I think what you do in the patch with fixup option seems > wrong to me. Well if it's not the same output then I guess you're right and there's not a use for the `--fixup` mode. By the same token, I'd say calculating/setting the checksum also wouldn't need to be done, we should just include the page as included in the WAL stream. > > > 5. > > > +if (!RestoreBlockImage(record, block_id, page)) > > > +continue; > > > + > > > +/* we have our extracted FPI, let's save it now */ > > > After extracting the page from the WAL record, do we need to perform a > > > checksum on it? > > I think you just need to do the following, this will ensure the > authenticity of the page that pg_waldump returns. > if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, blk)) > { > pg_fatal("page checksum failed"); > } The WAL already has a checksum, so not certain this makes sense on its own. Also I'm inclined to make it a warning if it doesn't match rather than a fatal. (I'd also have to verify that the checksum is properly set on the page prior to copying the FPI into WAL, which I'm pretty sure it is but not certain.) > > > case 'W': > > > config.save_fpw_path = pg_strdup(optarg); > > > case 'X': > > >config.fixup_fpw = true; > > >config.save_fpw_path = pg_strdup(optarg); > > > > Like separate opt processing with their own `break` statement? > > Probably a bit more readable/conventional. > > Yes. Moot with the removal of the --fixup mode. > Some more comments: > > 1. > +PGAlignedBlockzerobuf; > Essentially, it's not a zero buffer, please rename the variable to > something like 'buf' or 'page_buf' or someother? Sure. > 2. > +if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ) > Replace pg_pwrite with fwrite() and avoid fileno() system calls that > should suffice here, AFICS, we don't need pg_pwrite. Sure. > 3. > +if (config.save_fpw_path != NULL) > +{ > +/* Create the dir if it doesn't exist */ > +if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0) > I think you still need pg_check_dir() here, how about something like below? I was assuming pg_mkdir_p() acted just like mkdir -p, where it's just an idempotent action, so an existing dir is just treated the same. What's the benefit here? Would assume if a non-dir file existed at that path or other permissions issues arose we'd just get an error from pg_mkdir_p(). (Will review the code there and confirm.) > 4. > +/* Create the dir if it doesn
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Fri, Nov 11, 2022 at 8:15 AM Justin Pryzby wrote: > > On Wed, Nov 09, 2022 at 02:37:29PM -0600, David Christensen wrote: > > On Wed, Nov 9, 2022 at 2:08 PM David Christensen > > wrote: > > > Justin sez: > > > > I was wondering if there's any reason to do "CREATE DATABASE". The vast > > > > majority of TAP tests don't. > > > > > > > > $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head > > > >1435 safe_psql('postgres', > > > > 335 safe_psql( > > > > 23 safe_psql($connect_db, > > > > > > If there was a reason, I don't recall offhand; I will test removing it > > > and if things still work will consider it good enough. > > > > Things blew up when I did that; rather than hunt it down, I just left it > > in. :-) > > > +$primary->safe_psql('db1', < > It worked for me when I removed the 3 references to db1. > That's good for efficiency of the test. I did figure that out later; fixed in git. > > +my $blocksize = 8192; > > I think this should be just "my $blocksize;" rather than setting a value > which is later overwriten. Yep. Fixed in git. David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Fri, Nov 11, 2022 at 4:57 AM Bharath Rupireddy wrote: > > Well if it's not the same output then I guess you're right and there's > > not a use for the `--fixup` mode. By the same token, I'd say > > calculating/setting the checksum also wouldn't need to be done, we > > should just include the page as included in the WAL stream. > > Let's hear from others, we may be missing something here. I recommend > keeping the --fixup patch as 0002, in case if we decide to discard > it's easier, however I'll leave that to you. I've whacked in `git` for now; I can resurrect if people consider it useful. > > > > > 5. > > > > > +if (!RestoreBlockImage(record, block_id, page)) > > > > > +continue; > > > > > + > > > > > +/* we have our extracted FPI, let's save it now */ > > > > > After extracting the page from the WAL record, do we need to perform a > > > > > checksum on it? > > > > > > I think you just need to do the following, this will ensure the > > > authenticity of the page that pg_waldump returns. > > > if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, > > > blk)) > > > { > > > pg_fatal("page checksum failed"); > > > } > > > > The WAL already has a checksum, so not certain this makes sense on its > > own. Also I'm inclined to make it a warning if it doesn't match rather > > than a fatal. (I'd also have to verify that the checksum is properly > > set on the page prior to copying the FPI into WAL, which I'm pretty > > sure it is but not certain.) > > How about having it as an Assert()? Based on empirical testing, the checksums don't match, so asserting/alerting on each block extracted seems next to useless, so going to just remove that. > > > 5. > > > +fsync(fileno(OPF)); > > > +fclose(OPF); > > > I think just the fsync() isn't enough, you still need fsync_fname() > > > and/or fsync_parent_path(), perhaps after for (block_id = 0; block_id > > > <= XLogRecMaxBlockId(record); block_id++) loop. > > > > I'm not sure I get the value of the fsyncs here; if you are using this > > tool at this capacity you're by definition doing some sort of > > transient investigative steps. Since the WAL was fsync'd, you could > > always rerun/recreate as needed in the unlikely event of an OS crash > > in the middle of this investigation. Since this is outside the > > purview of the database operations proper (unlike, say, initdb) seems > > like it's unnecessary (or definitely shouldn't need to be selectable). > > My thoughts are that if we're going to fsync, just do the fsyncs > > unconditionally rather than complicate the interface further. > > -1 for fysnc() per file created as it can create a lot of sync load on > production servers impacting performance. How about just syncing the > directory at the end assuming it doesn't cost as much as fsync() per > FPI file created would? I can fsync the dir if that's a useful compromise. > > > 4. > > > +$primary->append_conf('postgresql.conf', "wal_level='replica'"); > > > +$primary->append_conf('postgresql.conf', 'archive_mode=on'); > > > +$primary->append_conf('postgresql.conf', "archive_command='/bin/false'"); > > > Why do you need to set wal_level to replica, out of the box your > > > cluster comes with replica only no? > > > And why do you need archive_mode on and set the command to do nothing? > > > Why archiving is needed for testing your feature firstly? > > > > I think it had shown "minimal" in my testing; I was purposefully > > failing archives so the WAL would stick around. Maybe a custom > > archive command that just copied a single WAL file into a known > > location so I could use that instead of the current approach would > > work, though not sure how Windows support would work with that. Open > > to other ideas to more cleanly get a single WAL file that isn't the > > last one. (Earlier versions of this test were using /all/ of the > > generated WAL files rather than a single one, so maybe I am > > overcomplicating things for a single WAL file case.) > > Typically we create a physical replication slot at the beginning so > that the server keeps the WAL required for you in pg_wal itself, for > instance, please see pg_walinspect: > > -- Make sure checkpoints don't interfere with the test. > SELECT 'init' FROM > pg_create_physical_replication_slot('regress_pg_walinspect_slot', > true, false); Will see if I can get something like this to work; I'm currently stopping the server before running the file-based tests, but I suppose there's no reason to do so, so a temporary slot that holds it around until the test is complete is probably fine. David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
Enclosed is v8, which uses the replication slot method to retain WAL as well as fsync'ing the output directory when everything is done. v8-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch Description: Binary data
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Tue, Nov 15, 2022 at 4:41 AM Bharath Rupireddy wrote: > > On Tue, Nov 15, 2022 at 1:29 AM David Christensen > wrote: > > > > Enclosed is v8, which uses the replication slot method to retain WAL > > as well as fsync'ing the output directory when everything is done. > > Thanks. It mostly is in good shape. However, few more comments: > > 1. > +if it does not exist. The images saved will be subject to the same > +filtering and limiting criteria as display records, but in this > +mode pg_waldump will not output any other > +information. > May I know what's the intention of the statement 'The images saved > '? If it's not necessary and convey anything useful to the user, > can we remove it? Basically I mean if you're limiting to a specific relation or rmgr type, etc, it only saves those FPIs. (So filtering is applied first before considering whether to save the FPI or not.) > 2. > +#include "storage/checksum.h" > +#include "storage/checksum_impl.h" > I think we don't need the above includes as we got rid of verifying > page checksums. The patch compiles without them for me. Good catch. > 3. > +char *save_fpw_path; > Can we rename the above variable to save_fpi_path, just to be in sync > with what we expose to the user, the option name 'save-fpi'? Sure. > 4. > +if (config.save_fpw_path != NULL) > +{ > +/* Fsync our output directory */ > +fsync_fname(config.save_fpw_path, true); > +} > I guess adding a comment there as to why we aren't fsyncing for every > file that gets created, but once per the directory at the end. That'd > help clarify doubts that other members might get while looking at the > code. Can do. > 5. > +if (config.save_fpw_path != NULL) > +{ > +/* Fsync our output directory */ > +fsync_fname(config.save_fpw_path, true); > +} > So, are we sure that we don't want to fsync for time_to_stop exit(0) > cases, say when CTRL+C'ed. Looks like we handle time_to_stop safely > meaning exiting with return code 0, shouldn't we fsync the directory? We can. Like I've said before, since these aren't production parts of the cluster I don't personally have much of an opinion if fsync() is appropriate at all, so don't have strong feelings here. > 6. > +else if (config.save_fpw_path) > Let's use the same convention to check non-NULLness, > config.save_fpw_path != NULL. Good catch. > 7. > +CHECKPOINT; > +SELECT pg_switch_wal(); > +UPDATE test_table SET a = a + 1; > +SELECT pg_switch_wal(); > I don't think switching WAL after checkpoint is necessary here, > because the checkpoint ensures all the WAL gets flushed to disk. > Please remove it. The point is to ensure we have a clean WAL segment that we know will contain the relation we are filtering by. Will test if this still holds without the extra pg_switch_wal(), but that's the rationale. > PS: I've seen the following code: > +my $walfile = [sort { $a <=> $b } glob("$waldir/00*")]->[1]; # we > want the second WAL file, which will be a complete WAL file with > full-page writes for our specific relation. I don't understand the question. > 8. > +$node->safe_psql('postgres', < +EOF > Why EOF is used here? Can't we do something like below to execute > multiple statements? > $node->safe_psql( > 'postgres', qq[ > SELECT data FROM pg_logical_slot_get_changes('regression_slot1', NULL, > NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); > SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, > NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); > SELECT data FROM pg_logical_slot_get_changes('regression_slot3', NULL, > NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); > SELECT data FROM pg_logical_slot_get_changes('regression_slot4', NULL, > NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); > ]); > > Same here: > +$node->safe_psql('postgres', < +SELECT pg_drop_replication_slot('regress_pg_waldump_slot'); > +EOQ As a long-time perl programmer, heredocs seem more natural and easier to read rather than a string that accomplishes the same function. If there is an established project style I'll stick with it, but it just rolled out that way. :-) > 9. > +my $walfile = [sort { $a <=> $b } glob("
Re: Moving forward with TDE
> On Nov 15, 2022, at 1:08 PM, Jacob Champion wrote: > > On Mon, Oct 24, 2022 at 9:29 AM David Christensen > wrote: >> I would love to open a discussion about how to move forward and get >> some of these features built out. The historical threads here are >> quite long and complicated; is there a "current state" other than the >> wiki that reflects the general thinking on this feature? Any major >> developments in direction that would not be reflected in the code from >> May 2021? > > I don't think the patchset here has incorporated the results of the > discussion [1] that happened at the end of 2021. For example, it looks > like AES-CTR is still in use for the pages, which I thought was > already determined to be insufficient. Good to know about the next steps, thanks. > The following next steps were proposed in that thread: > >> 1. modify temporary file I/O to use a more centralized API >> 2. modify the existing cluster file encryption patch to use XTS with a >> IV that uses more than the LSN >> 3. add XTS regression test code like CTR >> 4. create WAL encryption code using CTR > > Does this patchset need review before those steps are taken (or was > there additional conversation/work that I missed)? This was just a refresh of the old patches on the wiki to work as written on HEAD. If there are known TODOs here this then that work is still needing to be done. I was going to take 2) and Stephen was going to work on 3); I am not sure about the other two but will review the thread you pointed to. Thanks for pointing that out. David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
Enclosed is v9. - code style consistency (FPI instead of FPW) internally. - cleanup of no-longer needed checksum-related pieces from code and tests. - test cleanup/simplification. - other comment cleanup. Passes all CI checks. Best, David v9-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch Description: Binary data
ScanSourceDatabasePgClass
Hi Robert, In 9c08aea6a you introduce the block-by-block strategy for creating a copy of the database. In the main loop, this utilizes this call: buf = ReadBufferWithoutRelcache(rlocator, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy, false); Here, the last parameter is "false" for the permanence factor of this relation. Since we know that pg_class is in fact a permanent relation, this ends up causing issues for the TDE patches that I am working on updating, due using the opposite value when calculating the page's IV and thus failing the decryption when trying to create a database based on template0. Is there a reason why this needs to be "false" here? I recognize that this routine is accessing the table outside of a formal connection, so there might be more subtle effects that I am not aware of. If so this should be documented. If it's an oversight, I think we should change to be "true" to match the actual permanence state of the relation. I did test changing it to true and didn't notice any adverse effects in `make installcheck-world`, but let me know if there is more to this story than meets the eye. (I did review the original discussion at https://www.postgresql.org/message-id/flat/CA%2BTgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T%3DCYCvRyFHywSBUQ%40mail.gmail.com and did not notice any discussion of this specific parameter choice.) Thanks, David
Re: Moving forward with TDE
Hi Jacob, Thanks, I've added this patch in my tree [1]. (For now, just adding fixes and the like atop the original separate patches, but will eventually get things winnowed down into probably the same 12 parts the originals were reviewed in. Best, David [1] https://github.com/pgguru/postgres/tree/tde
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Wed, Nov 16, 2022 at 3:30 AM Bharath Rupireddy wrote: > 1. > -if (config.filter_by_fpw && !XLogRecordHasFPW(xlogreader_state)) > +if (config.filter_by_fpw && !XLogRecordHasFPI(xlogreader_state)) > These changes are not related to this feature, hence renaming those > variables/function names must be dealt with separately. If required, > proposing another patch can be submitted to change filter_by_fpw to > filter_by_fpi and XLogRecordHasFPW() to XLogRecordHasFPI(). Not required; can revert the changes unrelated to this specific patch. (I'd written the original ones for it, so didn't really think anything of it... :-)) > 2. > +/* We fsync our output directory only; since these files are not part > + * of the production database we do not require the performance hit > + * that fsyncing every FPI would entail, so are doing this as a > + * compromise. */ > The commenting style doesn't match the standard that we follow > elsewhere in postgres, please refer to other multi-line comments. Will fix. > 3. > +fsync_fname(config.save_fpi_path, true); > +} > It looks like fsync_fname()/fsync() in general isn't recursive, in the > sense that it doesn't fsync the files under the directory, but the > directory only. So, the idea of directory fsync doesn't seem worth it. > We either 1) get rid of fsync entirely or 2) fsync all the files after > they are created and the directory at the end or 3) do option (2) with > --no-sync option similar to its friends. Since option (2) is a no go, > we can either choose option (1) or option (2). My vote at this point > is for option (1). Agree to remove. > 4. > +($walfile_name, $blocksize) = split '\|' => > $node->safe_psql('postgres',"SELECT pg_walfile_name(pg_switch_wal()), > current_setting('block_size')"); > +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name; > I think there's something wrong with this, no? pg_switch_wal() can, at > times, return end+1 of the prior segment (see below snippet) and I'm > not sure if such a case can happen here. > > * The return value is either the end+1 address of the switch record, > * or the end+1 address of the prior segment if we did not need to > * write a switch record because we are already at segment start. > */ > XLogRecPtr > RequestXLogSwitch(bool mark_unimportant) I think this approach is pretty common to get the walfile name, no? While there might be an edge case here, since the rest of the test is a controlled environment I'm inclined to just not worry about it; this would require the changes prior to this to exactly fill a WAL segment which strikes me as extremely unlikely to the point of impossible in this specific scenario. > 5. > +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name; > +ok(-f $walfile, "Got a WAL file"); > Is this checking if the WAL file is present or not in PGDATA/pg_wal? > If yes, I think this isn't required as pg_switch_wal() ensures that > the WAL is written and flushed to disk. You are correct, probably another artifact of the earlier version. That said, not sure I see the harm in keeping it as a sanity-check. > 6. > +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name; > Isn't "pgdata" hardcoded here? I think you might need to do the following: > $node->data_dir . '/pg_wal/' . $walfile_name;; Can fix. > 7. > +# save filename for later verification > +$files{$file}++; > > +# validate that we ended up with some FPIs saved > +ok(keys %files > 0, 'verify we processed some files'); > Why do we need to store filenames in an array when we later just check > the size of the array? Can't we use a boolean (file_found) or an int > variable (file_count) to verify that we found the file. Another artifact; we were comparing the files output between two separate lists of arbitrary numbers of pages being written out and verifying the raw/fixup versions had the same lists. > 8. > +$node->safe_psql('postgres', < +SELECT 'init' FROM > pg_create_physical_replication_slot('regress_pg_waldump_slot', true, > false); > +CREATE TABLE test_table AS SELECT generate_series(1,100) a; > +CHECKPOINT; -- required to force FPI for next writes > +UPDATE test_table SET a = a + 1; > +EOF > The EOF with append_conf() is being used in 4 files and elsewhere in > the TAP test files (more than 100?) qq[] or quotes is being used. I > have no strong opinion here, I'll leave it to the other reviewers or > committer. I'm inclined to leave it just for (personal) readability, but can change if there's a strong consensus against. > > > 11. > > > +# verify filename formats matches w/--save-fpi > > > +for my $fullpath (glob "$tmp_folder/raw/*") > > > Do we need to look for the exact match of the file that gets created > > > in the save-fpi path? While checking for this is great, it makes the > > > test code non-portable (may not work on Windows or other platforms, > > > no?) and complex? This way, you can get rid of get_block_info(
Re: Moving forward with TDE
Hi Dilip, Thanks for the feedback here. I will review the docs changes and add to my tree. Best, David
[PATCH] expand the units that pg_size_pretty supports on output
Hi folks, Enclosed is a patch that expands the unit output for pg_size_pretty(numeric) going up to Yottabytes; I reworked the existing numeric output code to account for the larger number of units we're using rather than just adding nesting levels. There are also a few other places that could benefit from expanded units, but this is a standalone starting point. Best, David 0001-Expand-the-units-that-pg_size_pretty-numeric-knows-a.patch Description: Binary data
Issue in recent pg_stat_statements?
-hackers, So in doing some recent work on pg_stat_statements, I notice that while the regression test still passes on HEAD, it appears that 4f0b096 (per git bisect) changed/broke how this works compared to historical versions. Essentially, when doing a fresh install of pg_stat_statements on a new fresh db (outside of the regression framework), it's not returning any rows from the view. I didn't see any related documentation changes, so as far as I know, this should still be recording all statements as per normal. My full steps to reproduce from a clean Centos 7 install are attached. I have also been able to reproduce this on OS X and Fedora 33. The TL;DR is: CREATE EXTENSION pg_stat_statements; CREATE TABLE foo (a int, b text); INSERT INTO foo VALUES (1,'a'); SELECT * FROM foo; SELECT * FROM pg_stat_statements; -- returns nothing Settings for pg_stat_statements: postgres=# select name, setting from pg_settings where name like 'pg_stat_statements%'; name| setting ---+- pg_stat_statements.max| 5000 pg_stat_statements.save | on pg_stat_statements.track | top pg_stat_statements.track_planning | off pg_stat_statements.track_utility | on (5 rows) Is this an expected change, or is this in fact broken? In previous revisions, this was showing the INSERT and SELECT at the very least. I'm unclear as to why the regression test is still passing, so want to verify that I'm not doing something wrong in the testing. Best, David #!/bin/bash -e sudo yum install -y git sudo yum groupinstall -y "Development Tools" sudo yum install -y readline-devel zlib-devel 'perl(IPC::Run)' 'perl(Test::More)' 'perl(Time::HiRes)' mkdir -p ~/Checkouts cd ~/Checkouts git clone https://github.com/postgres/postgres postgresql cd ~/Checkouts/postgresql git checkout 4f0b096 make distclean || true ./configure --enable-cassert --enable-debug CFLAGS="-ggdb -O0 -g3 -fno-omit-frame-pointer" --enable-tap-tests make -j10 && \ ( \ cd contrib/pg_stat_statements/; \ make check \ ) cd tmp_install/usr/local/pgsql export PGUSER=postgres export PGHOST=/tmp/ export LD_LIBRARY_PATH=./lib:$LD_LIBRARY_PATH PATH=$HOME/tmp_install/usr/local/pgsql:$PATH bin/initdb -D data -U postgres bin/pg_ctl -D data -l logfile start bin/psql -c 'ALTER SYSTEM SET shared_preload_libraries = $$pg_stat_statements$$' bin/pg_ctl -D data -l logfile restart bin/psql <
Re: Issue in recent pg_stat_statements?
> > > Is this an expected change, or is this in fact broken? In previous > revisions, this was showing the INSERT and SELECT at the very least. I'm > unclear as to why the regression test is still passing, so want to verify > that I'm not doing something wrong in the testing. > > Yes, you want to look into the queryid functionality. See > > https://www.postgresql.org/message-id/flat/35457b09-36f8-add3-1d07-6034fa585ca8%40oss.nttdata.com > > Interface changes may still be coming in 14 for that. Or warnings. > Hmm, I'm unclear as to why you would potentially want to use pg_stat_statements *without* this functionality. At the very least, it violates POLA — I spent the better part of a day thinking this was a bug due to the expected behavior being so obvious I wouldn't have expected any different. In any case, this discussion is better had on a different thread. Thanks at least for explaining what I was seeing. Best, David
Re: Issue in recent pg_stat_statements?
On Mon, Apr 26, 2021 at 12:18 PM Julien Rouhaud wrote: > On Mon, Apr 26, 2021 at 11:40 PM David Christensen > wrote: > >> > >> > Is this an expected change, or is this in fact broken? In previous > revisions, this was showing the INSERT and SELECT at the very least. I'm > unclear as to why the regression test is still passing, so want to verify > that I'm not doing something wrong in the testing. > >> > >> Yes, you want to look into the queryid functionality. See > >> > https://www.postgresql.org/message-id/flat/35457b09-36f8-add3-1d07-6034fa585ca8%40oss.nttdata.com > >> > >> Interface changes may still be coming in 14 for that. Or warnings. > > > > > > Hmm, I'm unclear as to why you would potentially want to use > pg_stat_statements *without* this functionality. > > Using pg_stat_statements with a different query_id semantics without > having to fork pg_stat_statements. > I can see that argument for allowing alternatives, but the current default of nothing seems to be particularly non-useful, so some sensible default value would seem to be in order, or I can predict a whole mess of future user complaints.
Re: [PATCH] expand the units that pg_size_pretty supports on output
A second patch to teach the same units to pg_size_bytes(). Best, David On Wed, Apr 14, 2021 at 11:13 AM David Christensen < david.christen...@crunchydata.com> wrote: > Hi folks, > > Enclosed is a patch that expands the unit output for > pg_size_pretty(numeric) going up to Yottabytes; I reworked the existing > numeric output code to account for the larger number of units we're using > rather than just adding nesting levels. > > There are also a few other places that could benefit from expanded units, > but this is a standalone starting point. > > Best, > > David > 0001-Expand-the-supported-units-in-pg_size_bytes-to-cover.patch Description: Binary data
[PATCH] comment fixes for delayChkptFlags
Enclosed is a trivial fix for a typo and misnamed field I noted when doing some code review. Best, David 0001-fix-comment-typos-for-delayChkptFlags.patch Description: Binary data
Re: [PATCHES] Post-special page storage TDE support
> > Explicitly > > locking (assuming you stay in your lane) should only need to guard > > against access from other > > backends of this type if using shared buffers, so will be use-case > > dependent. > > I'm not sure what you mean here? I'm mainly pointing out that the specific code that manages this feature is the only one who has to worry about modifying said page region. > > This does have a runtime overhead due to moving some offset > > calculations from compile time to > > runtime. It is thought that the utility of this feature will outweigh > > the costs here. > > Have you done some benchmarking to give an idea of how much overhead we're > talking about? Not yet, but I am going to work on this. I suspect the current code could be improved, but will try to get some sort of measurement of the additional overhead. > > Candidates for page features include 32-bit or 64-bit checksums, > > encryption tags, or additional > > per-page metadata. > > > > While we are not currently getting rid of the pd_checksum field, this > > mechanism could be used to > > free up that 16 bits for some other purpose. > > IIUC there's a hard requirement of initdb-time initialization, as there's > otherwise no guarantee that you will find enough free space in each page at > runtime. It seems like a very hard requirement for a full replacement of the > current checksum approach (even if I agree that the current implementation > limitations are far from ideal), especially since there's no technical reason > that would prevent us from dynamically enabling data-checksums without doing > all the work when the cluster is down. As implemented, that is correct; we are currently assuming this specific feature mechanism is set at initdb time only. Checksums are not the primary motivation here, but were something that I could use for an immediate illustration of the feature. That said, presumably you could define a way to set the features per-relation (say with a template field in pg_class) which would propagate to a relation on rewrite, so there could be ways to handle things incrementally, were this an overall goal. Thanks for looking, David
Re: [PATCHES] Post-special page storage TDE support
Hi Matthias, > Did you read the related thread with related discussion from last June, "Re: > better page-level checksums" [0]? In that I argued that space at the end of a > page is already allocated for the AM, and that reserving variable space at > the end of the page for non-AM usage is wasting the AM's performance > potential. Yes, I had read parts of that thread among others, but have given it a re-read. I can see the point you're making here, and agree that if we can allocate between pd_special and pd_upper that could make sense. I am a little unclear as to what performance impacts for the AM there would be if this additional space were ahead or behind the page special area; it seems like if this is something that needs to live on the page *somewhere* just being aligned correctly would be sufficient from the AM's standpoint. Considering that I am trying to make this have zero storage impact if these features are not active, the impact on a cluster with no additional features would be moot from a storage perspective, no? > Apart from that: Is this variable-sized 'metadata' associated with smgr > infrastructure only, or is it also available for AM features? If not; then > this is a strong -1. The amount of tasks smgr needs to do on a page is > generally much less than the amount of tasks an AM needs to do; so in my view > the AM has priority in prime page real estate, not smgr or related > infrastructure. I will confess to a slightly wobbly understanding of the delineation of responsibility here. I was under the impression that by modifying any consumer of PageHeaderData this would be sufficient to cover all AMs for the types of cluster-wide options we'd be concerned about (say extended checksums, multiple page encryption schemes, or other per-page information we haven't yet anticipated). Reading smgr/README and the various access/*/README has not made the distinction clear to me yet. > re: PageFeatures > I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part > of the storage manager (smgr) implementation / can't this be part of the smgr > of the relation? For at least the feature cases I'm anticipating, this would apply to any disk page that may have user data, set (at least initially) at initdb time, so should apply to any pages in the cluster, regardless of AM. > re: use of pd_checksum > I mentioned this in the above-mentioned thread too, in [1], that we could use > pd_checksum as an extra area marker for this storage-specific data, which > would be located between pd_upper and pd_special. I do think that we could indeed use this as an additional in-page pointer, but at least for this version was keeping things backwards-compatible. Peter G (I think) also made some good points about how to include the various status bits on the page somehow in terms of making a page completely self-contained. > Re: patch contents > > 0001: > >+ specialSize = MAXALIGN(specialSize) + reserved_page_size; > > This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or > an assertion that reserved_page_size is MAXALIGNED, would be better. It is currently aligned via the space calculation return value but agree that folding it into an assert or reworking it explicitly is clearer. > > PageValidateSpecialPointer(Page page) > > { > > Assert(page); > > -Assert(((PageHeader) page)->pd_special <= BLCKSZ); > > +AssertPageHeader) page)->pd_special - reserved_page_size) <= > > BLCKSZ); > > This check is incorrect. With your code it would allow pd_special past the > end of the block. If you want to put the reserved_space_size effectively > inside the special area, this check should instead be: > > +Assert(((PageHeader) page)->pd_special <= (BLCKSZ - reserved_page_size)); > > Or, equally valid > > +AssertPageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ); Yup, I think I inverted my logic there; thanks. > > + * +-+-++-+ > > + * | ... tuple2 tuple1 | "special space" | "reserved" | > > + * +---++-+ > > Could you fix the table display if / when you revise the patchset? It seems > to me that the corners don't line up with the column borders. Sure thing. > 0002: > > Make the output of "select_views" test stable > > Changing the reserved_page_size has resulted in non-stable results for this > > test. > > This makes sense, what kind of instability are we talking about? Are there > different results for runs with the same binary, or is this across > compilations? When running with the same compilation/initdb settings, the test results are stable, but differ depending what options you chose, so `make installcheck` output will fail when testing a cluster with different options vs upstream HEAD without these patches, etc. > 0003 and up were not yet reviewed in depth. Thanks, I appreciate the feedback so far.
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Nov 4, 2022, at 9:02 AM, Justin Pryzby wrote: > > On Fri, Nov 04, 2022 at 11:52:59AM +0900, Ian Lawrence Barwick wrote: >> 2022年5月3日(火) 8:45 David Christensen : >>> >>> ...and pushing a couple fixups pointed out by cfbot, so here's v4. >> >> cfbot reports the patch no longer applies [1]. As CommitFest 2022-11 is >> currently underway, this would be an excellent time to update the patch. > > More important than needing to be rebased, the patch has never passed > its current tests on windows. > > As I recall, that's due to relying on "cp". And "rsync", which > shouldn't be assumed to exist by regression tests). I will work on supporting the windows compatibility here. Is there some list of guidelines for what you can and can’t use? I don’t have a windows machine available to develop on. Was it failing on windows? I was attempting to skip it as I recall. Best, David
Re: [PATCH] expand the units that pg_size_pretty supports on output
New versions attached to address the initial CF feedback and rebase on HEAD as of now. 0001-Expand-the-units-that-pg_size_pretty-numeric-knows-a.patch - expands the units that pg_size_pretty() can handle up to YB. 0002-Expand-the-supported-units-in-pg_size_bytes-to-cover.patch - expands the units that pg_size_bytes() can handle, up to YB. 0002-Expand-the-supported-units-in-pg_size_bytes-to-cover.patch Description: Binary data 0001-Expand-the-units-that-pg_size_pretty-numeric-knows-a.patch Description: Binary data