from:"David Christensen"

[PATCH] add relation and block-level filtering to pg_waldump

2022-02-24 Thread David Christensen

Greetings,

This patch adds the ability to specify a RelFileNode and optional BlockNum to 
limit output of
pg_waldump records to only those which match the given criteria.  This should 
be more performant
than `pg_waldump | grep` as well as more reliable given specific variations in 
output style
depending on how the blocks are specified.

This currently affects only the main fork, but we could presumably add the 
option to filter by fork
as well, if that is considered useful.

Best,

David

>From 9194b2cb07172e636030b9b4e979b7f2caf7cbc0 Mon Sep 17 00:00:00 2001
From: David Christensen 
Date: Thu, 24 Feb 2022 11:00:46 -0600
Subject: [PATCH] Add relation/block filtering to pg_waldump

This feature allows you to only output records that are targeting a specific RelFileNode and optional
BlockNumber within this relation.  Currently only applies this filter to the relation's main fork.
---
 doc/src/sgml/ref/pg_waldump.sgml | 23 ++
 src/bin/pg_waldump/pg_waldump.c  | 74 +++-
 2 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml
index 5735a161ce..c953703bc8 100644
--- a/doc/src/sgml/ref/pg_waldump.sgml
+++ b/doc/src/sgml/ref/pg_waldump.sgml
@@ -100,6 +100,29 @@ PostgreSQL documentation
   
  
 
+ 
+  -k block
+  --block=block
+  
+   
+Display only records touching the given block. (Requires also
+providing the relation via --relation.)
+   
+  
+ 
+
+ 
+  -l tbl/db/rel
+  --relation=tbl/db/rel
+  
+   
+Display only records touching the given relation.  The relation is
+specified via tablespace oid, database oid, and relfilenode separated
+by slashes.
+   
+  
+ 
+
  
   -n limit
   --limit=limit
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..faae547a5c 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -55,6 +55,10 @@ typedef struct XLogDumpConfig
 	bool		filter_by_rmgr_enabled;
 	TransactionId filter_by_xid;
 	bool		filter_by_xid_enabled;
+	RelFileNode filter_by_relation;
+	bool		filter_by_relation_enabled;
+	BlockNumber	filter_by_relation_block;
+	bool		filter_by_relation_block_enabled;
 } XLogDumpConfig;
 
 typedef struct Stats
@@ -394,6 +398,34 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	return count;
 }
 
+/*
+ * Boolean to return whether the given WAL record matches a specific relation and optional block
+ */
+static bool
+XLogRecordMatchesRelationBlock(XLogReaderState *record, RelFileNode matchRnode, BlockNumber matchBlock)
+{
+	RelFileNode rnode;
+	ForkNumber	forknum;
+	BlockNumber blk;
+	int			block_id;
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		if (!XLogRecHasBlockRef(record, block_id))
+			continue;
+
+		XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blk);
+
+		if (forknum == MAIN_FORKNUM &&
+			matchRnode.spcNode == rnode.spcNode &&
+			matchRnode.dbNode == rnode.dbNode &&
+			matchRnode.relNode == rnode.relNode &&
+			(matchBlock == InvalidBlockNumber || matchBlock == blk))
+			return true;
+	}
+	return false;
+}
+
 /*
  * Calculate the size of a record, split into !FPI and FPI parts.
  */
@@ -767,6 +799,8 @@ usage(void)
 	printf(_("  -b, --bkp-details  output detailed information about backup blocks\n"));
 	printf(_("  -e, --end=RECPTR   stop reading at WAL location RECPTR\n"));
 	printf(_("  -f, --follow   keep retrying after reaching end of WAL\n"));
+	printf(_("  -k, --block=N  only show records matching a given relation block (requires -l)\n"));
+	printf(_("  -l, --relation=N/N/N   only show records that touch a specific relation\n"));
 	printf(_("  -n, --limit=N  number of records to display\n"));
 	printf(_("  -p, --path=PATHdirectory in which to find log segment files or a\n"
 			 " directory with a ./pg_wal that contains such files\n"
@@ -802,12 +836,14 @@ main(int argc, char **argv)
 
 	static struct option long_options[] = {
 		{"bkp-details", no_argument, NULL, 'b'},
+		{"block", required_argument, NULL, 'k'},
 		{"end", required_argument, NULL, 'e'},
 		{"follow", no_argument, NULL, 'f'},
 		{"help", no_argument, NULL, '?'},
 		{"limit", required_argument, NULL, 'n'},
 		{"path", required_argument, NULL, 'p'},
 		{"quiet", no_argument, NULL, 'q'},
+		{"relation", required_argument, NULL, 'l'},
 		{"rmgr", required_argument, NULL, 'r'},
 		{"start", required_argument, NULL,

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-24 Thread David Christensen

Added to commitfest as:

https://commitfest.postgresql.org/37/3565/

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

> Cool.  I think we can report an error instead of reading wal files,
> if the tablespace, database, or relation is invalid.  Does there any
> WAL record that has invalid tablespace, database, or relation OID?

The only sort of validity check we could do here is range checking for the 
underlying data types (which we certainly could/should add if it’s known to 
never be valid for the underlying types); non-existence of objects is a no-go, 
since that depends purely on the WAL range you are looking at and you’d have 
to, you know, scan it to see if it existed before marking as invalid. :)

Thanks,

David

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

On Fri, Feb 25, 2022 at 7:33 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Thanks for the patch. This is not adding something that users can't do
> right now, but definitely improves the usability of the pg_waldump as
> it avoids external filterings. Also, it can give the stats/info at
> table and block level. So, +1 from my side.
>

Thanks for the feedback; I will be incorporating most of this into a new
version, with a couple of responses below.

> 3) Crossing 80 char limit
>

This is neither here nor there, but have we as a project considered
increasing that to something more modern?  I know a lot of current projects
consider 132 to be a more reasonable limit.  (Will reduce it down to that
for now, but consider this a vote towards increasing that limit.)

> 5) I would also see a need for "filter by FPW" i.e. list all WAL
> records with "FPW".
>

Yes, that wouldn't be too hard to add to this, can add to the next
version.  We probably ought to also add the fork number as specifiable as
well. Other thoughts on could be some wildcard value in the relation part,
so `1234/23456/*` could filter WAL to a specific database only, say, or
some other multiple specifier, like `--block=1234,123456,121234`.  (I
honestly consider this to be more advanced than we'd need to support in
this patch, but if probably wouldn't be too hard to add to it.)

Thanks,

David

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

On Fri, Feb 25, 2022 at 7:08 AM Japin Li  wrote:

>
> On Fri, 25 Feb 2022 at 20:48, David Christensen <
> david.christen...@crunchydata.com> wrote:
> >> Cool.  I think we can report an error instead of reading wal files,
> >> if the tablespace, database, or relation is invalid.  Does there any
> >> WAL record that has invalid tablespace, database, or relation OID?
> >
> > The only sort of validity check we could do here is range checking for
> the underlying data types
> > (which we certainly could/should add if it’s known to never be valid for
> the underlying types);
>
> The invalid OID I said here is such as negative number and zero, for those
> parameters, we do not need to read the WAL files, since it always invalid.
>

Agreed.  Can add some additional range validation to the parsed values.

David

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

Bharath Rupireddy  writes:

> On Fri, Feb 25, 2022 at 12:36 AM David Christensen
>  wrote:
>>
>> Greetings,
>>
>> This patch adds the ability to specify a RelFileNode and optional BlockNum 
>> to limit output of
>> pg_waldump records to only those which match the given criteria.  This 
>> should be more performant
>> than `pg_waldump | grep` as well as more reliable given specific variations 
>> in output style
>> depending on how the blocks are specified.
>>
>> This currently affects only the main fork, but we could presumably add the 
>> option to filter by fork
>> as well, if that is considered useful.
>
> Thanks for the patch. This is not adding something that users can't do
> right now, but definitely improves the usability of the pg_waldump as
> it avoids external filterings. Also, it can give the stats/info at
> table and block level. So, +1 from my side.

Attached is V2 with additional feedback from this email, as well as the 
specification of the
ForkNumber and FPW as specifiable options.

Best,

David

>From 1b04f04317d364006371bdab0db9086f79138b25 Mon Sep 17 00:00:00 2001
From: David Christensen 
Date: Fri, 25 Feb 2022 12:52:56 -0600
Subject: [PATCH] Add additional filtering options to pg_waldump

This feature allows you to only output records that are targeting a specific RelFileNode and optional
BlockNumber within this relation, while specifying which ForkNum you want to filter to.

We also add the independent ability to filter via Full Page Write.
---
 doc/src/sgml/ref/pg_waldump.sgml |  48 
 src/bin/pg_waldump/pg_waldump.c  | 128 ++-
 2 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml
index 5735a161ce..f157175764 100644
--- a/doc/src/sgml/ref/pg_waldump.sgml
+++ b/doc/src/sgml/ref/pg_waldump.sgml
@@ -100,6 +100,44 @@ PostgreSQL documentation
   
  
 
+ 
+  -k block
+  --block=block
+  
+   
+Display only records touching the given block. (Requires also
+providing the relation via --relation.)
+   
+  
+ 
+
+ 
+  --fork=fork
+  
+   
+When using the --relation filter, output only records
+from the given fork.  The valid values here are 0
+for the main fork, 1 for the Free Space
+Map, 2 for the Visibility Map,
+and 3 for the Init fork.  If unspecified, defaults
+to the main fork.
+   
+  
+ 
+
+ 
+  -l tbl/db/rel
+  --relation=tbl/db/rel
+  
+   
+Display only records touching the given relation.  The relation is
+specified via tablespace OID, database OID, and relfilenode separated
+by slashes, for instance, 1234/12345/12345.  This
+is the same format used for relations in the WAL dump output.
+   
+  
+ 
+
  
   -n limit
   --limit=limit
@@ -183,6 +221,16 @@ PostgreSQL documentation

  
 
+ 
+   -w
+   --fullpage
+   
+   
+   Filter records to only those which have full page writes.
+   
+   
+ 
+
  
   -x xid
   --xid=xid
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..a527cd4dc6 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -55,6 +55,12 @@ typedef struct XLogDumpConfig
 	bool		filter_by_rmgr_enabled;
 	TransactionId filter_by_xid;
 	bool		filter_by_xid_enabled;
+	RelFileNode filter_by_relation;
+	bool		filter_by_relation_enabled;
+	BlockNumber	filter_by_relation_block;
+	bool		filter_by_relation_block_enabled;
+	ForkNumber	filter_by_relation_forknum;
+	bool		filter_by_fpw;
 } XLogDumpConfig;
 
 typedef struct Stats
@@ -394,6 +400,56 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	return count;
 }
 
+/*
+ * Boolean to return whether the given WAL record matches a specific relation and optional block
+ */
+static bool
+XLogRecordMatchesRelationBlock(XLogReaderState *record, RelFileNode matchRnode, BlockNumber matchBlock, ForkNumber matchFork)
+{
+	int			block_id;
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		RelFileNode rnode;
+		ForkNumber	forknum;
+		BlockNumber blk;
+
+		if (!XLogRecHasBlockRef(record, block_id))
+			continue;
+
+		XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blk);
+
+		if (forknum == matchFork &&
+			matchRnode.spcNode == rnode.spcNode &&
+			matchRnode.dbNode == rnode.dbNode &&
+			matchRnode.relNode == rnode.relNode &&
+			(matchBlock == InvalidBlockNumber || matchBlock == blk))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Boolean to return whether the given WAL record contains a full page write
+ */
+static bool
+XLogRecordHasFPW(XLogReaderState *record

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-21 Thread David Christensen

On Sun, Mar 20, 2022 at 11:56 PM Thomas Munro 
wrote:

> On Mon, Mar 21, 2022 at 4:36 PM Thomas Munro 
> wrote:
> > On Sat, Feb 26, 2022 at 7:58 AM David Christensen
> >  wrote:
> > > Attached is V2 with additional feedback from this email, as well as
> the specification of the
> > > ForkNumber and FPW as specifiable options.
> >
> > Trivial fixup needed after commit 3f1ce973.
>
> [04:30:50.630] pg_waldump.c:963:26: error: format ‘%u’ expects
> argument of type ‘unsigned int *’, but argument 3 has type ‘ForkNumber
> *’ [-Werror=format=]
> [04:30:50.630] 963 | if (sscanf(optarg, "%u",
> &config.filter_by_relation_forknum) != 1 ||
> [04:30:50.630] | ~^ ~~
> [04:30:50.630] | | |
> [04:30:50.630] | | ForkNumber *
> [04:30:50.630] | unsigned int *
>
> And now that this gets to the CompilerWarnings CI task, it looks like
> GCC doesn't like an enum as a scanf %u destination (I didn't see that
> warning locally when I compiled the above fixup because clearly Clang
> is cool with it...).  Probably needs a temporary unsigned int to
> sscanf into first.
>

Do you need me to fix this, or are you incorporating that into a V4 of this
patch? (Similar to your fixup prior in this thread?)

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-21 Thread David Christensen

Updated to include the V3 fixes as well as the unsigned int/enum fix.

>


v4-0001-Add-additional-filtering-options-to-pg_waldump.patch
Description: Binary data

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-21 Thread David Christensen

On Mon, Mar 21, 2022 at 4:39 PM Thomas Munro  wrote:

[snip]

I guess you did this because init fork references aren't really
> expected in the WAL, but I think it's more consistent to allow up to
> MAX_FORKNUM, not least because your documentation mentions 3 as a
> valid value.  So I adjust this to allow MAX_FORKNUM.  Make sense?
>

Makes sense, but I think I'd actually thought it was +1 of the max forks,
so you give me more credit than I deserve in this case... :-)

> Here are some more details I noticed, as a likely future user of this
> very handy feature, which I haven't changed, because they seem more
> debatable and you might disagree...
>
> 1.  I think it'd be less surprising if the default value for --fork
> wasn't 0... why not show all forks?
>

Agreed; made it default to all, with the ability to filter down if desired.

> 2.  I think it'd be less surprising if --fork without --relation
> either raised an error (like --block without --relation), or were
> allowed, with the meaning "show me this fork of all relations".
>

Agreed; reworked to support the use case of only showing target forks.

> 3.  It seems funny to have no short switch for --fork when everything
> else has one... what about -F?
>

Good idea; I'd hadn't seen capitals in the getopt list so didn't consider
them, but I like this.

Enclosed is v6, incorporating these fixes and docs tweaks.

Best,

David

v6-0001-Add-additional-filtering-options-to-pg_waldump.patch
Description: Binary data

Re: [PATCH] pgbench: add multiconnect option

2022-03-22 Thread David Christensen

On Sat, Mar 19, 2022 at 11:43 AM Fabien COELHO  wrote:

>
> Hi Sami,
>
> > Pgbench is a simple benchmark tool by design, and I wonder if adding
> > a multiconnect feature will cause pgbench to be used incorrectly.
>
> Maybe, but I do not see how it would be worse that what pgbench already
> allows.
>

I agree that pgbench is simple; perhaps really too simple when it comes to
being able to measure much more than basic query flows.  What pgbench does
have in its favor is being distributed with the core distribution.

I think there is definitely space for a more complicated benchmarking tool
that exercises more scenarios and more realistic query patterns and
scenarios.  Whether that is distributed with the core is another question.

David

Re: [PATCH] Proof of concept for GUC improvements

2022-03-22 Thread David Christensen

> On Mar 21, 2022, at 7:53 PM, Tom Lane  wrote:
> 
> Andres Freund  writes:
>> My impression is that there's not a lot of enthusiasm for the concept? If
>> that's true we maybe ought to mark the CF entry as rejected?
> 
> Yeah, I'm kind of leaning that way too.  I don't see how we can
> incorporate the symbolic values into any existing display paths
> without breaking applications that expect the old output.
> That being the case, it seems like we'd have "two ways to do it"
> indefinitely, which would add enough confusion that I'm not
> sure there's a net gain.  In particular, I foresee novice questions
> along the lines of "I set foo to disabled, why is it showing
> as zero?"

Yeah, my main motivation here was about trying to have less special values in 
the config files, but I guess it would effectively be making everything 
effectively an enum and still relies on knowing just what the specific magic 
values are, no not really a net gain in this department. 

For the record, I thought this would have a fairly low chance of getting in, 
was mainly curious what level of effort it would take to get something like 
this going and get some feedback on the actual idea. 

> If we'd done it like this from the beginning, it'd have been
> great, but retrofitting it now is a lot less appealing.

Yeah, agreed on this. As far as I’m concerned we can reject. 

Thanks,

David

Re: psql \df choose functions by their arguments

2020-11-01 Thread David Christensen



> * Removed the tab-complete bit, it was too fragile and unhelpful

I can’t speak for the specific patch, but tab completion of proc args for \df, 
\ef and friends has long been a desired feature of mine, particularly when you 
are dealing with functions with huge numbers of arguments and the same name 
which I have (sadly) come across many times in the wild. 

Removing this because it was brittle is fine, but would be good to see if we 
could figure out a way to have this kind of feature in psql IMHO. 

Best,

David

Re: [PATCH] Add `truncate` option to subscription commands

2020-11-25 Thread David Christensen

Hi,

At this time I do not have time to make the necessary changes for this
commitfest so I am voluntarily withdrawing this patch, but will
revisit at a future time.

Best,

David

On Wed, Oct 28, 2020 at 1:06 PM Rahila Syed  wrote:
>
> Hi David,
>
> The feature seems useful to me.  The code will need to be refactored due to 
> changes in commit : b05fe7b442
>
> Please see the following comments.
> 1. Is there a specific reason behind having new relstate for truncate?
> The current state flow is INIT->DATATSYNC->SYNCWAIT->CATCHUP->SYNCDONE->READY.
> Can we accommodate the truncate in either INIT or DATASYNC?
>
> 2.   +   StartTransactionCommand();
>   +   rel = 
> table_open(MyLogicalRepWorker->relid, RowExclusiveLock);
>   +
>   +   rels = lappend(rels, rel);
>   +   relids = lappend_oid(relids, 
> MyLogicalRepWorker->relid);
>   +
>   +   ExecuteTruncateGuts(rels, relids, NIL, 
> DROP_RESTRICT, false);
>   +   CommitTransactionCommand();
>
> Truncate is being performed in a separate transaction as data copy, I think 
> that leaves a window
> open for concurrent transactions to modify the data after truncate and before 
> copy.
>
> 3. Regarding the truncate of the referenced table, I think one approach can 
> be to perform the following:
> i. lock the referencing and referenced tables against writes
> ii. drop the foriegn key constraints,
> iii.truncate
> iv. sync
> v. recreate the constraints
> vi. release lock.
> However, I am not sure of the implications of locking these tables on the 
> main apply process.
>
>
> Thank you,
>
>
> On Mon, Oct 12, 2020 at 11:31 PM David Christensen  wrote:
>>
>> > On Oct 11, 2020, at 10:00 PM, Amit Kapila  wrote:
>> >
>> > On Mon, Oct 12, 2020 at 3:44 AM David Christensen  
>> > wrote:
>> >>
>> >>
>> >> On Oct 11, 2020, at 1:14 PM, Euler Taveira 
>> >>  wrote:
>> >>
>> >> 
>> >> On Fri, 9 Oct 2020 at 15:54, David Christensen  wrote:
>> >>>
>> >>>
>> >>> Enclosed find a patch to add a “truncate” option to subscription 
>> >>> commands.
>> >>>
>> >>> When adding new tables to a subscription (either via `CREATE 
>> >>> SUBSCRIPTION` or `REFRESH PUBLICATION`), tables on the target which are 
>> >>> being newly subscribed will be truncated before the data copy step.  
>> >>> This saves explicit coordination of a manual `TRUNCATE` on the target 
>> >>> tables and allows the results of the initial data sync to be the same as 
>> >>> on the publisher at the time of sync.
>> >>>
>> >>> To preserve compatibility with existing behavior, the default value for 
>> >>> this parameter is `false`.
>> >>>
>> >>
>> >> Truncate will fail for tables whose foreign keys refer to it. If such a 
>> >> feature cannot handle foreign keys, the usefulness will be restricted.
>> >>
>> >>
>> >> This is true for existing “truncate” with FKs, so doesn’t seem to be any 
>> >> different to me.
>> >>
>> >
>> > What would happen if there are multiple tables and truncate on only
>> > one of them failed due to FK check? Does it give an error in such a
>> > case, if so will the other tables be truncated?
>>
>> Currently each SyncRep relation is sync’d separately in its own worker 
>> process; we are doing the truncate at the initialization step of this, so 
>> it’s inherently in its own transaction. I think if we are going to do any 
>> sort of validation on this, it would have to be at the point of the CREATE 
>> SUBSCRIPTION/REFRESH PUBLICATION where we have the relation list and can do 
>> sanity-checking there.
>>
>> Obviously if someone changes the schema at some point between when it does 
>> this and when relation syncs start there is a race condition, but the same 
>> issue would affect other data sync things, so I don’t care to solve that as 
>> part of this patch.
>>
>> >> Hypothetically if you checked all new tables and could verify if there 
>> >> were FK cycles only already in the new tables being added then “truncate 
>> >> cascade” would be fine. Arguably if they had existing tables that were 
>> >> part of an FK that wasn’t fully replicated they wer

Re: [PATCH] Proof of concept for GUC improvements

2022-01-12 Thread David Christensen

> Hi,
> 
> According to the cfbot, the patch doesn't apply anymore and needs a
> rebase: http://cfbot.cputube.org/patch_36_3290.log

V4 rebased attached. 



special-guc-values-v4.patch
Description: Binary data

CREATE ROLE IF NOT EXISTS

2021-10-19 Thread David Christensen

Greetings -hackers,

Enclosed is a patch that implements CREATE ROLE IF NOT EXISTS (along with
the same support for USER/GROUP).  This is a fairly straightforward
approach in that we do no validation of anything other than existence, with
the user needing to ensure that permissions/grants are set up in the proper
way.

Comments?

Best,

David


CREATE-ROLE-IF-NOT-EXISTS.patch
Description: Binary data

Re: CREATE ROLE IF NOT EXISTS

2021-10-21 Thread David Christensen

On Tue, Oct 19, 2021 at 4:29 PM Isaac Morland 
wrote:

> On Tue, 19 Oct 2021 at 16:12, David Christensen <
> david.christen...@crunchydata.com> wrote:
>
>> Greetings -hackers,
>>
>> Enclosed is a patch that implements CREATE ROLE IF NOT EXISTS (along with
>> the same support for USER/GROUP).  This is a fairly straightforward
>> approach in that we do no validation of anything other than existence, with
>> the user needing to ensure that permissions/grants are set up in the proper
>> way.
>>
>
> One little tricky aspect that occurs to me is the ALTER ROLE to set the
> role flag options: it really needs to mention *all* the available options
> if it is to leave the role in a specific state regardless of how it started
> out. For example, if the existing role has BYPASSRLS but you want the
> default NOBYPASSRLS you have to say so explicitly.
>
> Because of this, I think my preference, based just on thinking about
> setting the flag options, would be for CREATE OR REPLACE.
>
> However, I'm wondering about the role name options: IN ROLE, ROLE, ADMIN.
> With OR REPLACE should they replace the set of memberships or augment it?
> Either seems potentially problematic to me. By contrast it’s absolutely
> clear what IF NOT EXISTS should do with these.
>
> So I’m not sure what I think overall.
>

Sure, the ambiguity here for merging options was exactly the reason I went
with the IF NOT EXISTS route.  Whatever concerns with merging already exist
with ALTER ROLE, so nothing new is introduced by this functionality, at
least that was my original thought.

David

Re: [PATCH] Proof of concept for GUC improvements

2021-11-03 Thread David Christensen

> On Nov 3, 2021, at 5:35 AM, Daniel Gustafsson  wrote:
> 
> 
>> 
>>> On 15 Oct 2021, at 23:54, Cary Huang  wrote:
>> 
>> I scanned through the GUC list and found that the following parameters can
>> potentially be categorized in the "special_disabled0" group, just for your
>> reference.
> 
> 
> This patch no longer applies, can you please submit a rebased version?  Also,
> do you have any thoughts on Cary's suggestion in the above review?

Hi, enclosed is a v3, which includes the rebase as well as the additional int 
GUCs mentioned in Cary’s review. I can add support for Reals (required for the 
JIT and several other ones), but wanted to get feedback/buy-in on the overall 
idea first. 

Best,

David

special-guc-values-v3.patch
Description: Binary data

Re: CREATE ROLE IF NOT EXISTS

2021-11-03 Thread David Christensen

>
>
> This fails the roleattributes test in "make check", with what seems to be a
> trivial change in the output.  Can you please submit a rebased version
> fixing
> the test?
>

Updated version attached.

David


CREATE-ROLE-IF-NOT-EXISTS-v2.patch
Description: Binary data

Re: CREATE ROLE IF NOT EXISTS

2021-11-09 Thread David Christensen

On Mon, Nov 8, 2021 at 1:22 PM Mark Dilger 
wrote:

> > On Nov 8, 2021, at 10:38 AM, Stephen Frost  wrote:
>
> > I don't quite follow this.  The entire point of Alice writing a script
> > that uses IF NOT EXISTS is to have that command not fail if, indeed,
> > that role already exists, but for the rest of the script to be run.
> > That there's some potential attacker with CREATEROLE running around
> > creating roles that they think someone *else* might create is really
> > stretching things to a very questionable level- especially with
> > CREATEROLE where Charlie could just CREATE a new role which is a member
> > of Bob anyway after the fact and then GRANT that role to themselves.
>
> I don't see why this is "stretching things to a very questionable level".
> It might help this discussion if you could provide pseudo-code or similar
> for adding roles which is well-written and secure, and which benefits from
> this syntax.  I would expect the amount of locking and checking for
> pre-existing roles that such logic would require would make the IF NOT
> EXIST option useless.  Perhaps I'm wrong?
>

The main motivator for me writing this was trying to increase idempotency
for things like scripting, where you want to be able to minimize the effort
required to get things into a particular state.  I agree with Stephen that
whether or not this is a best practices approach, this is something that is
being done in the wild via DO blocks or similar, so providing a tool to
handle this better seems useful on its own.

This originally came from me looking into the failures to load certain
`pg_dump` or `pg_dumpall` output when generated with the `--clean` flag,
which necessarily cannot work, as it fails with the error `current user
cannot be dropped`.  Not that I am promoting the use of `pg_dumpall
--clean`, as there are clearly better solutions here, but something which
generates unusable output does not seem that useful.  Instead, you could
generate `CREATE ROLE IF NOT EXISTS username` statements and emit `ALTER
ROLE ...`, which is what it is already doing (modulo `IF NOT EXISTS`).

This seems to introduce no further security vectors compared to field work
and increases utility in some cases, so seems generally useful to me.

If CINE semantics are at issue, what about the CREATE OR REPLACE semantics
with some sort of merge into the existing role?  I don't care strongly
about which approach is taken, just think the overall "make this role exist
in this form" without an error is useful in my own work, and CINE was
easier to implement as a first pass.

Best,

David

Re: CREATE ROLE IF NOT EXISTS

2021-11-09 Thread David Christensen

On Tue, Nov 9, 2021 at 9:55 AM Mark Dilger 
wrote:

> > On Nov 9, 2021, at 7:36 AM, David Christensen <
> david.christen...@crunchydata.com> wrote:
> >
> > If CINE semantics are at issue, what about the CREATE OR REPLACE
> semantics with some sort of merge into the existing role?  I don't care
> strongly about which approach is taken, just think the overall "make this
> role exist in this form" without an error is useful in my own work, and
> CINE was easier to implement as a first pass.
>
> CREATE OR REPLACE might be a better option, not with the "merge into the
> existing role" part, but rather as drop+create.  If a malicious actor has
> already added other roles to the role, or created a table with a malicious
> trigger definition, the drop part will fail, which is good from a security
> viewpoint.  Of course, the drop portion will also fail under other
> conditions which don't entail any security concerns, but maybe they could
> be addressed in a series of follow-on patches?
>
> I understand this idea is not as useful for creating idempotent scripts,
> but maybe it gets you part of the way there?

Well, the CREATE OR REPLACE via just setting the role's attributes
explicitly based on what you passed it could work (not strictly DROP +
CREATE, in that you're keeping existing ownerships, etc, and can avoid
cross-db permissions/ownership checks).  Seems like some sort of merge
logic could be in order, as you wouldn't really want to lose existing
permissions granted to a role, but you want to ensure that /at least/ the
permissions granted exist for this role.

David

Re: CREATE ROLE IF NOT EXISTS

2021-11-09 Thread David Christensen

On Tue, Nov 9, 2021 at 10:22 AM Stephen Frost  wrote:

> Greetings,
>
> * David Christensen (david.christen...@crunchydata.com) wrote:
> > Well, the CREATE OR REPLACE via just setting the role's attributes
> > explicitly based on what you passed it could work (not strictly DROP +
> > CREATE, in that you're keeping existing ownerships, etc, and can avoid
> > cross-db permissions/ownership checks).  Seems like some sort of merge
> > logic could be in order, as you wouldn't really want to lose existing
> > permissions granted to a role, but you want to ensure that /at least/ the
> > permissions granted exist for this role.
>
> What happens with role attributes that aren't explicitly mentioned
> though?  Do those get reset to 'default' or are they left as-is?
>

Since we have the ability to specify explicit negative options
(NOCREATEDB vs CREATEDB, etc), I'd say leave as-is if not specified,
otherwise ensure it matches what you included in the command.  Would also
ensure forward compatibility if new permissions/attributes were introduced,
as we don't want to explicitly require that all permissions be itemized to
utilize.


> I suspect that most implementations will end up just explicitly setting
> all of the role attributes, of course, because they want the role to
> look like how it is defined to in whatever manifest is declaring the
> role, but we should still think about how we want this to work if we're
> going in this direction.
>

Agreed.


> In terms of least-surprise, I do tend to think that the answer is "only
> care about what is explicitly put into the command"- that is, if it
> isn't in the CREATE ROLE statement then it gets left as-is.  Not sure
> how others feel about that though.
>

This is also what would make the most sense to me.

David

Re: CREATE ROLE IF NOT EXISTS

2021-11-10 Thread David Christensen

Modulo other issues/discussions, here is a version of this patch that
implements CREATE OR REPLACE ROLE just by handing off to AlterRole if it's
determined that the role already exists; presumably any/all additional
considerations would need to be added in both places were there a separate
code path for this.

It might be worth refactoring the AlterRole into a helper if there are any
deviations in messages, etc, but could be a decent approach to handling the
problem (which arguably would have similar restrictions/requirements in
ALTER ROLE itself) in a single location.

Best,

David


CREATE-OR-REPLACE-ROLE.patch
Description: Binary data

Re: CREATE ROLE IF NOT EXISTS

2021-11-22 Thread David Christensen

On Mon, Nov 22, 2021 at 6:49 AM Daniel Gustafsson  wrote:

> > On 10 Nov 2021, at 18:14, David Christensen <
> david.christen...@crunchydata.com> wrote:
>
> > Modulo other issues/discussions, here is a version of this patch..
>
> This patch fails to compile since you renamed the if_not_exists member in
> CreateRoleStmt but still set it in the parser.
>

D'oh! Enclosed is a fixed/rebased version.

Best,

David


CREATE-OR-REPLACE-ROLE-v2.patch
Description: Binary data

Documentation patch for ALTER SUBSCRIPTION

2020-01-23 Thread David Christensen

Greetings,

Enclosed find a documentation patch that clarifies the behavior of ALTER 
SUBSCRIPTION … REFRESH PUBLICATION with new tables; I ran into a situation 
today where the docs were not clear that existing tables would not be 
re-copied, so remedying this situation.

Should apply back to Pg 10.

Best,

David



0001-Be-explicit-about-the-behavior-of-REFRESH-PUBLICATIO.patch
Description: Binary data

--
David Christensen
Senior Software and Database Engineer
End Point Corporation
da...@endpoint.com
785-727-1171




signature.asc
Description: Message signed with OpenPGP

Re: Adding comments to help understand psql hidden queries

2024-02-01 Thread David Christensen

On Thu, Feb 1, 2024 at 4:34 PM Greg Sabino Mullane  wrote:
>
> The use of the --echo-hidden flag in psql is used to show people the way psql 
> performs its magic for its backslash commands. None of them has more magic 
> than "\d relation", but it suffers from needing a lot of separate queries to 
> gather all of the information it needs. Unfortunately, those queries can get 
> overwhelming and hard to figure out which one does what, especially for those 
> not already very familiar with the system catalogs. Attached is a patch to 
> add a small SQL comment to the top of each SELECT query inside 
> describeOneTableDetail. All other functions use a single query, and thus need 
> no additional context. But "\d mytable" has the potential to run over a dozen 
> SQL queries! The new format looks like this:
>
> / QUERY */
> /* Get information about row-level policies */
> SELECT pol.polname, pol.polpermissive,
>   CASE WHEN pol.polroles = '{0}' THEN NULL ELSE 
> pg_catalog.array_to_string(array(select rolname from pg_catalog.pg_roles 
> where oid = any (pol.polroles) order by 1),',') END,
>   pg_catalog.pg_get_expr(pol.polqual, pol.polrelid),
>   pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid),
>   CASE pol.polcmd
> WHEN 'r' THEN 'SELECT'
> WHEN 'a' THEN 'INSERT'
> WHEN 'w' THEN 'UPDATE'
> WHEN 'd' THEN 'DELETE'
> END AS cmd
> FROM pg_catalog.pg_policy pol
> WHERE pol.polrelid = '134384' ORDER BY 1;
> //
>
> Cheers,
> Greg

Thanks, this looks like some helpful information. In the same vein,
I'm including a patch which adds information about the command that
generates the given query as well (atop your commit).  This will
modify the query line to include the command itself:

/ QUERY (\dRs) */

Best,

David


0001-Add-output-of-the-command-that-got-us-here-to-the-QU.patch
Description: Binary data

Constant Splitting/Refactoring

2024-02-07 Thread David Christensen

As part of the reserved page space/page features[1] work, there is a need
to convert some of the existing constants into variables, since the size of
those will no longer be fixed at compile time.  At FOSDEM, there was some
interest expressed about seeing what a cleanup patch like that would look
like.

Enclosed is a series of cleanup patches for HEAD.  This splits each of the
4 constants that care about page size into Cluster-specific and -Limit
variants, the first intended to become a variable in time, and the second
being the maximum value such a variable may take (largely used for static
allocations).

Since these patches define these symbols to have the same values
they previously had, there are no changes in functionality.  These were
largely mechanical changes, and as such we should perhaps consider making
the same changes to back-branches to make it so context lines and the like
would be the same, simplifying maintainer's efforts when applying code in
back branches that touch similar areas.

The script I have to make these changes is simple, and could be run against
the back branches with only the comments surrounding Calc() pieces needing
to be adjusted once.

These were based on HEAD as a few minutes ago, 902900b308f, and pass all
tests.
Thanks,

David

[1] https://commitfest.postgresql.org/47/3986/


v1-0002-Split-MaxHeapTupleSize-into-runtime-and-max-value.patch
Description: Binary data


v1-0001-Split-MaxHeapTuplesPerPage-into-runtime-and-max-v.patch
Description: Binary data


v1-0004-Split-MaxTIDsPerBTreePage-into-runtime-and-max-va.patch
Description: Binary data


v1-0003-Split-MaxIndexTuplesPerPage-into-runtime-and-max-.patch
Description: Binary data

Re: Adding comments to help understand psql hidden queries

2024-03-21 Thread David Christensen

Hi Jim,

Thanks for the feedback.  Enclosed is a v2 of this series(?) rebased
and with that warning fixed; @Greg Sabino Mullane I just created a
commit on your behalf with the message to hackers.  I'm also creating
a commit-fest entry for this thread.

Best,

David


v2-0002-Add-output-of-the-command-that-got-us-here-to-the.patch
Description: Binary data


v2-0001-Include-SQL-comments-on-echo-hidden-output.patch
Description: Binary data

Re: Adding comments to help understand psql hidden queries

2024-03-21 Thread David Christensen

Created the CF entry in commitfest 48 but didn't see it was already in 47; 
closing the CFEntry in 48. (Doesn't appear to be a different status than 
"Withdrawn"...)

Re: Avoiding inadvertent debugging mode for pgbench

2024-03-21 Thread David Christensen

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Did a quick review of this one; CFbot is now happy, local regression tests all 
pass.

I think the idea here is sane; it's particularly confusing for some tools 
included in the main distribution to have different semantics, and this seems 
low on the breakage risk here, so worth the tradeoffs IMO.

The new status of this patch is: Ready for Committer

Re: Adding comments to help understand psql hidden queries

2024-03-22 Thread David Christensen

On Fri, Mar 22, 2024 at 9:47 AM Greg Sabino Mullane  wrote:
>
> On Thu, Mar 21, 2024 at 6:20 PM Peter Eisentraut  wrote:
>>
>> lines are supposed to align vertically.  With your patch, the first line
>> would have variable length depending on the command.
>
>
> Yes, that is a good point. Aligning those would be quite tricky, what if we 
> just kept a standard width for the closing query? Probably the 24 stars we 
> currently have to match "QUERY", which it appears nobody has changed for 
> translation purposes yet anyway. (If I am reading the code correctly, it 
> would be up to the translators to maintain the vertical alignment).

I think it's easier to keep the widths balanced than constant (patch
version included here), but if we needed to squeeze the opening string
to a standard width that would be possible without too much trouble.
The internal comment strings seem to be a bit more of a pain if we
wanted all of the comments to be the same width, as we'd need a table
or something so we can compute the longest string width, etc; doesn't
seem worth the convolutions IMHO.

No changes to Greg's patch, just keeping 'em both so cfbot stays happy.

David

v3-0002-Add-output-of-the-command-that-got-us-here-to-the.patch
Description: Binary data

v3-0001-Include-SQL-comments-on-echo-hidden-output.patch
Description: Binary data

Re: Adding comments to help understand psql hidden queries

2024-04-03 Thread David Christensen

I got Greg's blessing on squashing the commits down, and now including
a v4 with additional improvements on the output formatting front.
Main changes:

- all generated comments are the same width
- width has been bumped to 80
- removed _() functions for consumers of the new output functions

This patch adds two new helper functions, OutputComment() and
OutputCommentStars() to output and wrap the comments to the
appropriate widths.  Everything should continue to work just fine if
we want to adjust the width by just adjusting the MAX_COMMENT_WIDTH
symbol.

I removed _() in the output of the query/stars since there'd be no
sensible existing translations for the constructed string, which
included the query string itself.  If we need it for the "QUERY"
string, this could be added fairly easily, but the existing piece
would have been nonsensical and never used in practice.

Thanks,

David


v4-0001-Improve-SQL-comments-on-echo-hidden-output.patch
Description: Binary data

Re: Adding comments to help understand psql hidden queries

2024-04-04 Thread David Christensen

On Thu, Apr 4, 2024 at 9:32 AM Peter Eisentraut  wrote:
>
> On 03.04.24 19:16, David Christensen wrote:
> > I removed _() in the output of the query/stars since there'd be no
> > sensible existing translations for the constructed string, which
> > included the query string itself.  If we need it for the "QUERY"
> > string, this could be added fairly easily, but the existing piece
> > would have been nonsensical and never used in practice.
>
> "QUERY" is currently translated.  Your patch loses that.

I see; enclosed is v5 which fixes this.

The effective diff from the last one is:

-char *label = "QUERY";
+char *label = _("QUERY");

and

-label = psprintf("QUERY (\\%s)", curcmd);
+label = psprintf(_("QUERY (\\%s)"), curcmd);

Best,

David


v5-0001-Improve-SQL-comments-on-echo-hidden-output.patch
Description: Binary data

Re: Constant Splitting/Refactoring

2024-02-14 Thread David Christensen

This should target PG 18, but that status is not available in the CF app yet, 
so just making a note here.

Re: Constant Splitting/Refactoring

2024-03-13 Thread David Christensen

Here is a version 2 of this patch, rebased atop 97d85be365.

As before, this is a cleanup/prerequisite patch series for the page
features/reserved page size patches[1].  (Said patch series is going
to be updated shortly.)

This splits each of the 4 constants that care about page size into
Cluster-specific and -Limit variants, the first intended to become a
variable in time, and the second being the maximum value such a
variable may take (largely used for static
allocations).

Since these patches define these symbols to have the same values they
previously had, there are no changes in functionality.  These were
largely mechanical changes, and as such we should perhaps consider
making the same changes to back-branches to make it so context lines
and the like
would be the same, simplifying maintainer's efforts when applying code
in back branches that touch similar areas.

The script I have to make these changes is simple, and could be run
against the back branches with only the comments surrounding Calc()
pieces needing
to be adjusted once.

Thanks,

David

[1] https://commitfest.postgresql.org/47/3986/


v2-0001-Split-MaxHeapTuplesPerPage-into-runtime-and-max-v.patch
Description: Binary data


v2-0002-Split-MaxHeapTupleSize-into-runtime-and-max-value.patch
Description: Binary data


v2-0004-Split-MaxTIDsPerBTreePage-into-runtime-and-max-va.patch
Description: Binary data


v2-0003-Split-MaxIndexTuplesPerPage-into-runtime-and-max-.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-14 Thread David Christensen

Hi Bharath,

I can get one sent in tomorrow.

Thanks,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-15 Thread David Christensen

On Thu, Dec 15, 2022 at 12:36 AM Michael Paquier  wrote:
>
> On Wed, Dec 14, 2022 at 04:44:34PM -0600, David Christensen wrote:
> > I can get one sent in tomorrow.

This v10 should incorporate your feedback as well as Bharath's.

> -XLogRecordHasFPW(XLogReaderState *record)
> +XLogRecordHasFPI(XLogReaderState *record)
> This still refers to a FPW, so let's leave that out as well as any
> renamings of this kind..

Reverted those changes.

> +   if (config.save_fpi_path != NULL)
> +   {
> +   /* Create the dir if it doesn't exist */
> +   if (pg_mkdir_p(config.save_fpi_path, pg_dir_create_mode) < 0)
> +   {
> +   pg_log_error("could not create output directory \"%s\": %m",
> +config.save_fpi_path);
> +   goto bad_argument;
> +   }
> +   }
> It seems to me that you could allow things to work even if the
> directory exists and is empty.  See for example
> verify_dir_is_empty_or_create() in pg_basebackup.c.

The `pg_mkdir_p()` supports an existing directory (and I don't think
we want to require it to be empty first), so this only errors when it
can't create a directory for some reason.

> +my $file_re =
> +  
> qr/^([0-9A-F]{8})-([0-9A-F]{8})[.][0-9]+[.][0-9]+[.][0-9]+[.][0-9]+(?:_vm|_init|_fsm|_main)?$/;
> This is artistic to parse for people not used to regexps (I do, a
> little).  Perhaps this could use a small comment with an example of
> name or a reference describing this format?

Added a description of what this is looking for.

> +# Set umask so test directories and files are created with default 
> permissions
> +umask(0077);
> Incorrect copy-paste coming from elsewhere like the TAP tests of
> pg_basebackup with group permissions?  Doesn't
> PostgreSQL::Test::Utils::tempdir give already enough protection in
> terms of umask() and permissions?

I'd expect that's where that came from. Removed.

> +   if (config.save_fpi_path != NULL)
> +   {
> +   /* We fsync our output directory only; since these files are not part
> +* of the production database we do not require the performance hit
> +* that fsyncing every FPI would entail, so are doing this as a
> +* compromise. */
> +
> +   fsync_fname(config.save_fpi_path, true);
> +   }
> Why is it necessary to flush anything at all, then?

I personally don't think it is, but added it per Bharath's request.
Removed in this revision.

> +my $relation = $node->safe_psql('postgres',
> +   q{SELECT format('%s/%s/%s', CASE WHEN reltablespace = 0 THEN
> dattablespace ELSE reltablespace END, pg_database.oid,
> pg_relation_filenode(pg_class.oid)) FROM pg_class, pg_database WHERE
> relname = 'test_table' AND datname = current_database()}
> Could you rewrite that to be on multiple lines?

Sure, reformated.

> +diag "using walfile: $walfile";
> You should avoid the use of "diag", as this would cause extra output
> noise with a simple make check.

Had been using it for debugging and didn't realize it'd cause issues.
Removed both instances.

> +$node->safe_psql('postgres', "SELECT 
> pg_drop_replication_slot('regress_pg_waldump_slot')")
> That's not really necessary, the nodes are wiped out at the end of the
> test.

Removed.

> +$node->safe_psql('postgres', < +SELECT 'init' FROM 
> pg_create_physical_replication_slot('regress_pg_waldump_slot', true, false);
> +CREATE TABLE test_table AS SELECT generate_series(1,100) a;
> +CHECKPOINT; -- required to force FPI for next writes
> +UPDATE test_table SET a = a + 1;
> Using an EOF to execute a multi-line query would be a first.  Couldn't
> you use the same thing as anywhere else?  009_twophase.pl just to
> mention one.  (Mentioned by Bharath upthread, where he asked for an
> extra opinion so here it is.)

Fair enough, while idiomatic perl to me, not a hill to die on;
converted to a standard multiline string.

Best,

David


v10-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-st.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-23 Thread David Christensen

On Wed, Dec 21, 2022 at 5:47 AM Bharath Rupireddy
 wrote:
>
> On Fri, Dec 16, 2022 at 4:47 AM David Christensen
>  wrote:
> >
> > On Thu, Dec 15, 2022 at 12:36 AM Michael Paquier  
> > wrote:
> > >
> > > On Wed, Dec 14, 2022 at 04:44:34PM -0600, David Christensen wrote:
> > > > I can get one sent in tomorrow.
> >
> > This v10 should incorporate your feedback as well as Bharath's.
>
> Thanks for the patch. Here're some minor comments:
>
> 1. +my $node =  PostgreSQL::Test::Cluster->new('primary');
> Can the name be other than 'primary' because we don't create a standby
> for this test? Something like - 'node_a' or 'node_extract_fpi' or some
> other.

Sure, no issues.

> 2. +$node->init(extra => ['-k'], allows_streaming => 1);
> When enabled with allows_streaming, there are a bunch of things that
> happen to the node while initializing, I don't think we need all of
> them for this.

I think the "allows_streaming" was required to ensure the WAL files
were preserved properly, and was the approach we ended up taking
rather than trying to fail the archive_command or other approaches I'd
taken earlier. I'd rather keep this if we can, unless you can propose
a different approach that would continue to work in the same way.

> 3. +$node->init(extra => ['-k'], allows_streaming => 1);
> Can we use --data-checksums instead of -k for more readability?
> Perhaps, a comment on why we need that option helps greatly.

Yeah, can spell out; don't recall exactly why we needed it offhand,
but will confirm or remove if insignificant.

> 4.
> +page = (Page) buf.data;
> +
> +if (!XLogRecHasBlockRef(record, block_id))
> +continue;
> +
> +if (!XLogRecHasBlockImage(record, block_id))
> +continue;
> +
> +if (!RestoreBlockImage(record, block_id, page))
> +continue;
> Can you shift  page = (Page) buf.data; just before the last if
> condition RestoreBlockImage() so that it doesn't get executed for the
> other two continue statements?

Sure; since it was just setting a pointer value I didn't consider it
to be a hotspot for optimization.

Best,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-23 Thread David Christensen

On Mon, Dec 19, 2022 at 12:23 AM Michael Paquier  wrote:
>
> On Thu, Dec 15, 2022 at 05:17:46PM -0600, David Christensen wrote:
> > On Thu, Dec 15, 2022 at 12:36 AM Michael Paquier  
> > wrote:
> > This v10 should incorporate your feedback as well as Bharath's.
>
> Thanks for the new version.  I have minor comments.
>
> >> It seems to me that you could allow things to work even if the
> >> directory exists and is empty.  See for example
> >> verify_dir_is_empty_or_create() in pg_basebackup.c.
> >
> > The `pg_mkdir_p()` supports an existing directory (and I don't think
> > we want to require it to be empty first), so this only errors when it
> > can't create a directory for some reason.
>
> Sure, but things can also be made so as we don't fail if the directory
> exists and is empty?  This would be more consistent with the base
> directories created by pg_basebackup and initdb.

I guess I'm feeling a little dense here; how is this failing if there
is an existing empty directory?

> >> +$node->safe_psql('postgres', < >> +SELECT 'init' FROM 
> >> pg_create_physical_replication_slot('regress_pg_waldump_slot', true, 
> >> false);
> >> +CREATE TABLE test_table AS SELECT generate_series(1,100) a;
> >> +CHECKPOINT; -- required to force FPI for next writes
> >> +UPDATE test_table SET a = a + 1;
> >> Using an EOF to execute a multi-line query would be a first.  Couldn't
> >> you use the same thing as anywhere else?  009_twophase.pl just to
> >> mention one.  (Mentioned by Bharath upthread, where he asked for an
> >> extra opinion so here it is.)
> >
> > Fair enough, while idiomatic perl to me, not a hill to die on;
> > converted to a standard multiline string.
>
> By the way, knowing that we have an option called --fullpage, could be
> be better to use --save-fullpage for the option name?

Works for me.  I think I'd just wanted to avoid reformatting the
entire usage message which is why I'd gone with the shorter version.

> +   OPF = fopen(filename, PG_BINARY_W);
> +   if (!OPF)
> +   pg_fatal("couldn't open file for output: %s", filename);
> [..]
> +   if (fwrite(page, BLCKSZ, 1, OPF) != 1)
> +   pg_fatal("couldn't write out complete full page image to file: 
> %s", filename);
> These should more more generic, as of "could not open file \"%s\"" and
> "could not write file \"%s\"" as the file name provides all the
> information about what this writes.

Sure, will update.

Best,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-23 Thread David Christensen

On Fri, Dec 23, 2022 at 12:57 PM David Christensen
 wrote:
>
> On Wed, Dec 21, 2022 at 5:47 AM Bharath Rupireddy
>  wrote:

[snip]

> > 2. +$node->init(extra => ['-k'], allows_streaming => 1);
> > When enabled with allows_streaming, there are a bunch of things that
> > happen to the node while initializing, I don't think we need all of
> > them for this.
>
> I think the "allows_streaming" was required to ensure the WAL files
> were preserved properly, and was the approach we ended up taking
> rather than trying to fail the archive_command or other approaches I'd
> taken earlier. I'd rather keep this if we can, unless you can propose
> a different approach that would continue to work in the same way.

Confirmed that we needed this in order to create the replication slot,
so this /is/ required for the test to work.

Enclosing v11 with yours and Michael's latest feedback.

Best,

David


v11-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-st.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-26 Thread David Christensen

> On Dec 26, 2022, at 1:29 AM, Michael Paquier  wrote:
> 
> On Sat, Dec 24, 2022 at 06:23:29PM +0530, Bharath Rupireddy wrote:
>> Thanks for the patch. I've made the above change as well as renamed
>> the test file name to be save_fpi.pl, everything else remains the same
>> as v11. Here's the v12 patch which LGTM. I'll mark it as RfC -
>> https://commitfest.postgresql.org/41/3628/.
> 
> I have done a review of that, and here are my notes:
> - The variable names were a bit inconsistent, so I have switched most
> of the new code to use "fullpage".
> - The code was not able to handle the case of a target directory
> existing but empty, so I have added a wrapper on pg_check_dir().
> - XLogRecordHasFPW() could be checked directly in the function saving
> the blocks.  Still, there is no need for it as we apply the same
> checks again in the inner loop of the routine.
> - The new test has been renamed.
> - RestoreBlockImage() would report a failure and the code would just
> skip it and continue its work.  This could point out to a compression
> failure for example, so like any code paths calling this routine I
> think that we'd better do a pg_fatal() and fail hard.
> - I did not understand why there is a reason to make this option
> conditional on the record prints or even the stats, so I have moved
> the FPW save routine into a separate code path.  The other two could
> be silenced (or not) using --quiet for example, for the same result as
> v12 without impacting the usability of this feature.
> - Few tweaks to the docs, the --help output, the comments and the
> tests.
> - Indentation applied.
> 
> Being able to filter the blocks saved using start/end LSNs or just
> --relation is really cool, especially as the file names use the same
> order as what's needed for this option.

Sounds good, definitely along the ideas of what I’d originally envisioned.

Thanks,

David

Re: Improving btree performance through specializing by key shape, take 2

2023-01-12 Thread David Christensen

Hi Matthias,

I'm going to look at this patch series if you're still interested.  What was 
the status of your final performance testing for the 0008 patch alone vs the 
specialization series?  Last I saw on the thread you were going to see if the 
specialization was required or not.

Best,

David

Re: Kerberos delegation support in libpq and postgres_fdw

2023-04-03 Thread David Christensen

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Did a code review pass here; here is some feedback.


+   /* If password was used to connect, make sure it was one provided */
+   if (PQconnectionUsedPassword(conn) && dblink_connstr_has_pw(connstr))
+   return;

  Do we need to consider whether these passwords are the same?  Is there a 
different vector where a different password could be acquired from a different 
source (PGPASSWORD, say) while both of these criteria are true?  Seems like it 
probably doesn't matter that much considering we only checked Password alone in 
previous version of this code.

---

Looks like the pg_gssinfo struct hides the `proxy_creds` def behind:

#if defined(ENABLE_GSS) | defined(ENABLE_SSPI)
typedef struct
{
gss_buffer_desc outbuf; /* GSSAPI output token buffer */
#ifdef ENABLE_GSS
...
boolproxy_creds;/* GSSAPI Delegated/proxy credentials */
#endif
} pg_gssinfo;
#endif

Which means that the later check in `be_gssapi_get_proxy()` we have:

/*
 * Return if GSSAPI delegated/proxy credentials were included on this
 * connection.
 */
bool
be_gssapi_get_proxy(Port *port)
{
if (!port || !port->gss)
return NULL;

return port->gss->proxy_creds;
}

So in theory it'd be possible to have SSPI enabled but GSS disabled and we'd 
fail to compile in that case.  (It may be that this routine is never *actually* 
called in that case, just noting compile-time considerations.)  I'm not seeing 
guards in the actual PQ* routines, but don't think I've done an exhaustive 
search.

---

gss_accept_deleg


+   
+Forward (delegate) GSS credentials to the server.  The default is
+disable which means credentials will not be 
forwarded
+to the server.  Set this to enable to have
+credentials forwarded when possible.

When is this not possible?  Server policy?  External factors?

---



 Only superusers may connect to foreign servers without password
-authentication, so always specify the password option
-for user mappings belonging to non-superusers.
+authentication or using gssapi proxied credentials, so specify the
+password option for user mappings belonging to
+non-superusers who are not able to proxy GSSAPI credentials.



s/gssapi/GSSAPI/; this is kind of confusing, as this makes it sound like only 
superuser may use GSSAPI proxied credentials, which I disbelieve to be true.  
Additionally, it sounds like you're wanting to explicitly maintain a denylist 
for users to not be allowed proxying; is that correct?

---

libpq/auth.c:

if (proxy != NULL)
{
pg_store_proxy_credential(proxy);
port->gss->proxy_creds = true;
}

Per GSS docs, seems like we should be comparing to GSS_C_NO_CREDENTIAL and 
validating that the gflags has the `deleg_flag` bit set before considering 
whether there are valid credentials; in practice this might be the same effect 
(haven't looked at what that symbol actually resolves to, but NULL would be 
sensible).

Are there other cases we might need to consider here, like valid credentials, 
but they are expired? (GSS_S_CREDENTIALS_EXPIRED)

---

+   /*
+* Set KRB5CCNAME for this backend, so that later calls to 
gss_acquire_cred
+* will find the proxied credentials we stored.
+*/

So I'm not seeing this in other use in the code; I assume this is just used by 
the krb5 libs?

Similar q's for the other places the pg_gss_accept_deleg are used.

---

+int
+PQconnectionUsedGSSAPI(const PGconn *conn)
+{
+   if (!conn)
+   return false;
+   if (conn->gssapi_used)
+   return true;
+   else
+   return false;
+}

Micro-gripe: this routine seems like could be simpler, though the compiler 
probably has the same thing to say for either, so maybe code clarity is better 
as written:

int
PQconnectionUsedGSSAPI(const PGconn *conn)
{
return conn && conn->gssapi_used;
}

---

Anything required for adding meson support? I notice src/test/kerberos has 
Makefile updated, but no meson.build files are changed.

---

Two tests in src/test/kerberos/t/001_auth.pl at :535 and :545 have the same 
test description:

+   'succeeds with GSS-encrypted access required and hostgssenc hba and 
credentials not forwarded',

Since the first test has only `gssencmode` defined (so implicit `gssdeleg` 
value) and the second has `gssdeleg=disable` I'd suggest that the test on :545 
should have its description updated to add the word "explicitly":

'succeeds with GSS-encrypted access required and hostgssenc hba and credentials

Re: Kerberos delegation support in libpq and postgres_fdw

2023-04-05 Thread David Christensen

On Wed, Apr 5, 2023 at 3:30 PM Stephen Frost  wrote:

> Greetings,
>
> * David Christensen (david...@pgguru.net) wrote:
> > Did a code review pass here; here is some feedback.
>
> Thanks!
>
> > + /* If password was used to connect, make sure it was one provided
> */
> > + if (PQconnectionUsedPassword(conn) &&
> dblink_connstr_has_pw(connstr))
> > + return;
> >
> >   Do we need to consider whether these passwords are the same?  Is there
> a different vector where a different password could be acquired from a
> different source (PGPASSWORD, say) while both of these criteria are true?
> Seems like it probably doesn't matter that much considering we only checked
> Password alone in previous version of this code.
>
> Note that this patch isn't really changing how these checks are being
> done but more moving them around and allowing a GSSAPI-based approach
> with credential delegation to also be allowed.
>
> That said, as noted in the comments above dblink_connstr_check():
>
>  * For non-superusers, insist that the connstr specify a password, except
>  * if GSSAPI credentials have been proxied (and we check that they are used
>  * for the connection in dblink_security_check later).  This prevents a
>  * password or GSSAPI credentials from being picked up from .pgpass, a
>  * service file, the environment, etc.  We don't want the postgres user's
>  * passwords or Kerberos credentials to be accessible to non-superusers.
>
> The point of these checks is, indeed, to ensure that environmental
> values such as a .pgpass or variable don't end up getting picked up and
> used (or, if they do, we realize it post-connection and then throw away
> the connection).
>
> libpq does explicitly prefer to use the password passed in as part of
> the connection string and won't attempt to look up passwords in a
> .pgpass file or similar if a password has been included in the
> connection string.
>

The case I think I was thinking of was (manufactured) when we connected to
a backend with one password but the dblink or postgresql_fdw includes an
explicit password to a different server.  But now I'm thinking that this
PQconnectionUsedPassword() is checking the outgoing connection for dblink
itself, not the connection of the backend that connected to the main
server, so I think this objection is moot, like you say.

> Looks like the pg_gssinfo struct hides the `proxy_creds` def behind:
> >
> > #if defined(ENABLE_GSS) | defined(ENABLE_SSPI)
> > typedef struct
> > {
> > gss_buffer_desc outbuf;   /* GSSAPI output token
> buffer */
> > #ifdef ENABLE_GSS
> > ...
> > bool  proxy_creds;/* GSSAPI Delegated/proxy
> credentials */
> > #endif
> > } pg_gssinfo;
> > #endif
>
> ... right, proxy_creds only exists (today anyway) if ENABLE_GSS is set.
>
> > Which means that the later check in `be_gssapi_get_proxy()` we have:


 [analysis snipped]

Fairly confident the analysis here is wrong, further, the cfbot seems to
> agree that there isn't a compile failure here:
>
> https://cirrus-ci.com/task/6589717672624128
>
> [20:19:15.985] gss: NO
>
> (we always build with SSPI on Windows, per
> src/include/port/win32_port.h).
>

Cool; since we have coverage for that case seems like my concern was
unwarranted.

[snip]


> > 
> > 
> >  Only superusers may connect to foreign servers without password
> > -authentication, so always specify the password
> option
> > -for user mappings belonging to non-superusers.
> > +authentication or using gssapi proxied credentials, so specify the
> > +password option for user mappings belonging to
> > +non-superusers who are not able to proxy GSSAPI credentials.
> > 
> > 
> >
> > s/gssapi/GSSAPI/; this is kind of confusing, as this makes it sound like
> only superuser may use GSSAPI proxied credentials, which I disbelieve to be
> true.  Additionally, it sounds like you're wanting to explicitly maintain a
> denylist for users to not be allowed proxying; is that correct?
>
> Updated to GSSAPI and reworded in the updated patch (attached).
> Certainly open to suggestions on how to improve the documentation here.
> There is no 'denylist' for users when it comes to GSSAPI proxied
> credentials.  If there's a use-case for that then it could be added in
> the future.
>

Okay, I think your revisions here seem more clear, thanks.


>
> > ---
> >
> > libpq/auth.c:
> >
> >   if (proxy != NULL)
> >

Re: Kerberos delegation support in libpq and postgres_fdw

2023-04-07 Thread David Christensen

Reviewed v8; largely looking good, though I notice this hunk, which may
arguably be a bug fix, but doesn't appear to be relevant to this specific
patch, so could probably be debated independently (and if a bug, should
probably be backpatched):

diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 4229d2048c..11d41979c6 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -288,6 +288,9 @@ InitPgFdwOptions(void)
  {"sslcert", UserMappingRelationId, true},
  {"sslkey", UserMappingRelationId, true},

+ /* gssencmode is also libpq option, same to above. */
+ {"gssencmode", UserMappingRelationId, true},
+
  {NULL, InvalidOid, false}
  };

That said, should "gssdeleg" be exposed as a user mapping?  (This shows up
in postgresql_fdw; not sure if there are other places that would be
relevant, like in dblink somewhere as well, just a thought.)

Best,

David

Re: Kerberos delegation support in libpq and postgres_fdw

2023-04-07 Thread David Christensen

Ok, based on the interdiff there, I'm happy with that last change.  Marking
as Ready For Committer.

Best,

David

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-24 Thread David Christensen



> On Mar 24, 2022, at 6:43 AM, Thomas Munro  wrote:
> 
> On Fri, Mar 25, 2022 at 12:26 AM Thomas Munro  wrote:
>>> On Fri, Mar 25, 2022 at 12:01 AM Peter Eisentraut
>>>  wrote:
>>> Or even:  Why are we exposing fork *numbers* in the user interface?
>>> Even low-level tools such as pageinspect use fork *names* in their
>>> interface.
>> 
>> I wondered about that but thought it seemed OK for such a low level
>> tool.  It's a fair point though, especially if other low level tools
>> are doing that.  Here's a patch to change it.
> 
> Oh, and there's already a name lookup function to use for this.

+1 on the semantic names. 

David

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-24 Thread David Christensen


> On Mar 24, 2022, at 4:12 PM, Thomas Munro  wrote:
> 
> On Fri, Mar 25, 2022 at 1:43 AM David Christensen
>  wrote:
>>>> On Mar 24, 2022, at 6:43 AM, Thomas Munro  wrote:
>>> On Fri, Mar 25, 2022 at 12:26 AM Thomas Munro  
>>> wrote:
>>>>> On Fri, Mar 25, 2022 at 12:01 AM Peter Eisentraut
>>>>>  wrote:
>>>>> Or even:  Why are we exposing fork *numbers* in the user interface?
>>>>> Even low-level tools such as pageinspect use fork *names* in their
>>>>> interface.
>>>> 
>>>> I wondered about that but thought it seemed OK for such a low level
>>>> tool.  It's a fair point though, especially if other low level tools
>>>> are doing that.  Here's a patch to change it.
>>> 
>>> Oh, and there's already a name lookup function to use for this.
>> 
>> +1 on the semantic names.
> 
> Cool.
> 
> I had another thought while changing that (and also re-alphabetising):
> Why don't we switch to  -B for --block and -R for --relation?  I
> gather you used -k and -l because -b and -r were already taken, but
> since we already started using upper case for -F, it seems consistent
> this way.  Or were they chosen for consistency with something else?

Works here; was just trying to get semi-memorable ones from the available 
lowercase ones, but I like your idea here, and it kind of puts them in the same 
mental space for remembering. 

> It's also slightly more helpful to a user if the help says
> --relation=T/D/R instead of N/N/N (TS/DB/REL would be nicer but
> doesn't fit in the space).


0001-Improve-command-line-switches-in-pg_waldump-option.patch
Description: Binary data

[PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-22 Thread David Christensen

Hi -hackers,

Enclosed is a patch to allow extraction/saving of FPI from the WAL
stream via pg_waldump.

Description from the commit:

Extracts full-page images from the WAL stream into a target directory,
which must be empty or not
exist.  These images are subject to the same filtering rules as normal
display in pg_waldump, which
means that you can isolate the full page writes to a target relation,
among other things.

Files are saved with the filename:  with
formatting to make things
somewhat sortable; for instance:

-01C0.1663.1.6117.0
-01000150.1664.0.6115.0
-010001E0.1664.0.6114.0
-01000270.1663.1.6116.0
-01000300.1663.1.6113.0
-01000390.1663.1.6112.0
-01000420.1663.1.8903.0
-010004B0.1663.1.8902.0
-01000540.1663.1.6111.0
-010005D0.1663.1.6110.0

It's noteworthy that the raw images do not have the current LSN stored
with them in the WAL
stream (as would be true for on-heap versions of the blocks), nor
would the checksum be valid in
them (though WAL itself has checksums, so there is some protection
there).  This patch chooses to
place the LSN and calculate the proper checksum (if non-zero in the
source image) in the outputted
block.  (This could perhaps be a targetted flag if we decide we don't
always want this.)

These images could be loaded/inspected via `pg_read_binary_file()` and
used in the `pageinspect`
suite of tools to perform detailed analysis on the pages in question,
based on historical
information, and may come in handy for forensics work.

Best,

David


v1-pg_waldump-save-fpi.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-23 Thread David Christensen

On Sat, Apr 23, 2022 at 9:49 AM Matthias van de Meent
 wrote:

> Regardless of my (lack of) opinion on the inclusion of this patch in
> PG (I did not significantly review this patch); I noticed that you do
> not yet identify the 'fork' of the FPI in the file name.
>
> A lack of fork identifier in the exported file names would make
> debugging much more difficult due to the relatively difficult to
> identify data contained in !main forks, so I think this oversight
> should be fixed, be it through `_forkname` postfix like normal fork
> segments, or be it through `.` numerical in- or postfix in
> the filename.
>
> -Matthias

Hi Matthias, great point.  Enclosed is a revised version of the patch
that adds the fork identifier to the end if it's a non-main fork.

Best,

David


v2-pg_waldump-save-fpi.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread David Christensen

On Mon, Apr 25, 2022 at 1:11 AM Michael Paquier  wrote:
>
> On Sat, Apr 23, 2022 at 01:43:36PM -0500, David Christensen wrote:
> > Hi Matthias, great point.  Enclosed is a revised version of the patch
> > that adds the fork identifier to the end if it's a non-main fork.
>
> Like Alvaro, I have seen cases where this would have been really
> handy.  So +1 from me, as well, to have more tooling like what you are
> proposing.  Fine for me to use one file for each block with a name
> like what you are suggesting for each one of them.
>
> +   /* we accept an empty existing directory */
> +   if (stat(config.save_fpw_path, &st) == 0 && S_ISDIR(st.st_mode))
> +   {
> I don't think that there is any need to rely on a new logic if there
> is already some code in place able to do the same work.  See
> verify_dir_is_empty_or_create() in pg_basebackup.c, as one example,
> that relies on pg_check_dir().  I think that you'd better rely at
> least on what pgcheckdir.c offers.

Yeah, though I am tending towards what another user had suggested and
just accepting an existing directory rather than requiring it to be
empty, so thinking I might just skip this one. (Will review the file
you've pointed out to see if there is a relevant function though.)

> +   {"raw-fpi", required_argument, NULL, 'W'},
> I think that we'd better rename this option.  "fpi", that is not used
> much in the user-facing docs, is additionally not adapted when we have
> an other option called -w/--fullpage.  I can think of
> --save-fullpage.

Makes sense.

> +   PageSetLSN(page, record->ReadRecPtr);
> +   /* if checksum field is non-zero then we have checksums 
> enabled,
> +* so recalculate the checksum with new LSN (yes, this is 
> a hack)
> +*/
> Yeah, that looks like a hack, but putting in place a page on a cluster
> that has checksums enabled would be more annoying with
> zero_damaged_pages enabled if we don't do that, so that's fine by me
> as-is.  Perhaps you should mention that FPWs don't have their
> pd_checksum updated when written.

Can make a mention; you thinking just a comment in the code is
sufficient, or want there to be user-visible docs as well?

> +   /* we will now extract the fullpage image from the XLogRecord and save
> +* it to a calculated filename */
> The format of this comment is incorrect.
>
> +The LSN of the record with this block, formatted
> +as %08x-%08X instead of the
> +conventional %X/%X due to filesystem naming
> +limits
> The last part of the sentence about %X/%X could just be removed.  That
> could be confusing, at worse.

Makes sense.

> +   PageSetLSN(page, record->ReadRecPtr);
> Why is pd_lsn set?

This was to make the page as extracted equivalent to the effect of
applying the WAL record block on replay (the LSN and checksum both);
since recovery calls this function to mark the LSN where the page came
from this is why I did this as well.  (I do see perhaps a case for
--raw output that doesn't munge the page whatsoever, just made
comparisons easier.)

> git diff --check complains a bit.

Can look into this.

> This stuff should include some tests.  With --end, the tests can
> be cheap.

Yeah, makes sense, will include some in the next version.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread David Christensen

On Mon, Apr 25, 2022 at 2:00 AM Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 4/25/22 8:11 AM, Michael Paquier wrote:
> > On Sat, Apr 23, 2022 at 01:43:36PM -0500, David Christensen wrote:
> >> Hi Matthias, great point.  Enclosed is a revised version of the patch
> >> that adds the fork identifier to the end if it's a non-main fork.
> > Like Alvaro, I have seen cases where this would have been really
> > handy.  So +1 from me, as well, to have more tooling like what you are
> > proposing.
>
> +1 on the idea.
>
> FWIW, there is an extension doing this [1] but having the feature
> included in pg_waldump would be great.

Cool, glad to see there is some interest; definitely some overlap in
forensics inside and outside the database both, as there are different
use cases for both.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread David Christensen

On Mon, Apr 25, 2022 at 6:03 AM Bharath Rupireddy
 wrote:
> Thanks for working on this. I'm just thinking if we can use these FPIs
> to repair the corrupted pages? I would like to understand more
> detailed usages of the FPIs other than inspecting with pageinspect.

My main use case was for being able to look at potential corruption,
either in the WAL stream, on heap, or in tools associated with the WAL
stream.  I suppose you could use the page images to replace corrupted
on-disk pages (and in fact I think I've heard of a tool or two that
try to do that), though don't know that I consider this the primary
purpose (and having toast tables and the list, as well as clog would
make it potentially hard to just drop-in a page version without
issues).  Might help in extreme situations though.

> Given that others have realistic use-cases (of course I would like to
> know more about those), +1 for the idea. However, I would suggest
> adding a function to extract raw FPI data to the pg_walinspect
> extension that got recently committed in PG 15, the out of which can
> directly be fed to pageinspect functions or

Yeah, makes sense to have some overlap here; will review what is there
and see if there is some shared code base we can utilize.  (ISTR some
work towards getting these two tools using more of the same code, and
this seems like another such instance.)

> Few comments:
> 1) I think it's good to mention the stored file name format.
> + printf(_("  -W, --raw-fpi=path save found full page images to
> given path\n"));

+1, though I've also thought there could be uses to have multiple
possible output formats here (most immediately, there may be cases
where we want *each* FPI for a block vs the *latest*, so files name
with/without the LSN component seem the easiest way forward here).
That would introduce some additional complexity though, so might need
to see if others think that makes any sense.

> 2)
> + for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
> + {
> + /* we will now extract the fullpage image from the XLogRecord and save
> + * it to a calculated filename */
> +
> + if (XLogRecHasBlockImage(record, block_id))
>
> I think we need XLogRecHasBlockRef to be true to check
> XLogRecHasBlockImage otherwise, we will see some build farms failing,
> recently I've seen this failure for pg_walinspect..
>
> for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
> {
> if (!XLogRecHasBlockRef(record, block_id))
> continue;
>
> if (XLogRecHasBlockImage(record, block_id))
> *fpi_len += XLogRecGetBlock(record, block_id)->bimg_len;
> }

Good point; my previous patch that got committed here (127aea2a65)
probably also needed this treatment.

> 3) Please correct the commenting format:
> + /* we will now extract the fullpage image from the XLogRecord and save
> + * it to a calculated filename */

Ack.

> 4) Usually we start errors with lower case letters "could not ."
> + pg_fatal("Couldn't open file for output: %s", filename);
> + pg_fatal("Couldn't write out complete FPI to file: %s", filename);
> And the variable name too:
> + FILE *OPF;

Ack.

> 5) Not sure how the FPIs of TOASTed tables get stored, but it would be
> good to check.

What would be different here? Are there issues you can think of, or
just more from the pageinspect side of things?

> 6) Good to specify the known usages of FPIs in the documentation.

Ack. Prob good to get additional info/use cases from others, as mine
is fairly short. :-)

> 7) Isn't it good to emit an error if RestoreBlockImage returns false?
> + if (RestoreBlockImage(record, block_id, page))
> + {

Ack.

> 8) I think I don't mind if a non-empty directory is specified - IMO
> better usability is this - if the directory is non-empty, just go add
> the FPI files if FPI file exists just replace it, if the directory
> isn't existing, create and write the FPI files.
> + /* we accept an empty existing directory */
> + if (stat(config.save_fpw_path, &st) == 0 && S_ISDIR(st.st_mode))
> + {

Agreed; was mainly trying to prevent accidental expansion inside
`pg_wal` when an earlier version of the patch implied `.` as the
current dir with an optional path, but I've since made the path
non-optional and agree that this is unnecessarily restrictive.

>  9) Instead of following:
> + if (XLogRecordHasFPW(xlogreader_state))
> + XLogRecordSaveFPWs(xlogreader_state, config.save_fpw_path);
> I will just do this in XLogRecordSaveFPWs:
> for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
> {
> if (!XLogRecHasBlockRef(record, block_id))
> continue;
>
> if (XLogRecHasBlockImage(record, block_id))
> {
>
> }
> }

Yeah, a little redundant.

> 10) Along with pg_pwrite(), can we also fsync the files (of course
> users can choose it optionally) so that the writes will be durable for
> the OS crashes?

Can add; you thinking a separate flag to disable thi

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-26 Thread David Christensen

On Mon, Apr 25, 2022 at 9:54 PM Michael Paquier  wrote:
> On Mon, Apr 25, 2022 at 10:11:10AM -0500, David Christensen wrote:
> > On Mon, Apr 25, 2022 at 1:11 AM Michael Paquier  wrote:
> >> I don't think that there is any need to rely on a new logic if there
> >> is already some code in place able to do the same work.  See
> >> verify_dir_is_empty_or_create() in pg_basebackup.c, as one example,
> >> that relies on pg_check_dir().  I think that you'd better rely at
> >> least on what pgcheckdir.c offers.
> >
> > Yeah, though I am tending towards what another user had suggested and
> > just accepting an existing directory rather than requiring it to be
> > empty, so thinking I might just skip this one. (Will review the file
> > you've pointed out to see if there is a relevant function though.)
>
> OK.  FWIW, pg_check_dir() is used in initdb and pg_basebackup because
> these care about the behavior to use when specifying a target path.
> You could reuse it, but use a different policy depending on its
> returned result for the needs of what you see fit in this case.

I have a new version of the patch (pending tests) that uses
pg_check_dir's return value to handle things appropriately, so at
least some code reuse now.  It did end up simplifying a lot.

> >> +   PageSetLSN(page, record->ReadRecPtr);
> >> +   /* if checksum field is non-zero then we have 
> >> checksums enabled,
> >> +* so recalculate the checksum with new LSN (yes, this 
> >> is a hack)
> >> +*/
> >> Yeah, that looks like a hack, but putting in place a page on a cluster
> >> that has checksums enabled would be more annoying with
> >> zero_damaged_pages enabled if we don't do that, so that's fine by me
> >> as-is.  Perhaps you should mention that FPWs don't have their
> >> pd_checksum updated when written.
> >
> > Can make a mention; you thinking just a comment in the code is
> > sufficient, or want there to be user-visible docs as well?
>
> I was thinking about a comment, at least.

New patch version has significantly more comments.

> > This was to make the page as extracted equivalent to the effect of
> > applying the WAL record block on replay (the LSN and checksum both);
> > since recovery calls this function to mark the LSN where the page came
> > from this is why I did this as well.  (I do see perhaps a case for
> > --raw output that doesn't munge the page whatsoever, just made
> > comparisons easier.)
>
> Hm.  Okay.  The argument goes both ways, I guess, depending on what we
> want to do with those raw pages.  Still you should not need pd_lsn if
> the point is to be able to stick the pages back in place to attempt to
> get back as much data as possible when loading it back to the shared
> buffers?

Yeah, I can see that too; I think there's at least enough of an
argument for a flag to apply the fixups or just extract only the raw
page pre-modification.  Not sure which should be the "default"
behavior; either `--raw` or `--fixup-metadata` or something could
work. (Naming suggestions welcomed.)

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-26 Thread David Christensen

On Mon, Apr 25, 2022 at 9:42 PM Michael Paquier  wrote:
> On Mon, Apr 25, 2022 at 10:24:52AM -0500, David Christensen wrote:
> > On Mon, Apr 25, 2022 at 6:03 AM Bharath Rupireddy
> >  wrote:
> >> Thanks for working on this. I'm just thinking if we can use these FPIs
> >> to repair the corrupted pages? I would like to understand more
> >> detailed usages of the FPIs other than inspecting with pageinspect.
> >
> > My main use case was for being able to look at potential corruption,
> > either in the WAL stream, on heap, or in tools associated with the WAL
> > stream.  I suppose you could use the page images to replace corrupted
> > on-disk pages (and in fact I think I've heard of a tool or two that
> > try to do that), though don't know that I consider this the primary
> > purpose (and having toast tables and the list, as well as clog would
> > make it potentially hard to just drop-in a page version without
> > issues).  Might help in extreme situations though.
>
> You could do a bunch of things with those images, even make things
> worse if you are not careful enough.

True. :-) This does seem like a tool geared towards "expert mode", so
maybe we just assume if you need it you know what you're doing?

> >> 10) Along with pg_pwrite(), can we also fsync the files (of course
> >> users can choose it optionally) so that the writes will be durable for
> >> the OS crashes?
> >
> > Can add; you thinking a separate flag to disable this with default true?
>
> We expect data generated by tools like pg_dump, pg_receivewal
> (depending on the use --synchronous) or pg_basebackup to be consistent
> when we exit from the call.  FWIW, flushing this data does not seem
> like a strong requirement for something aimed at being used page-level
> chirurgy or lookups, because the WAL segments should still be around
> even if the host holding the archives is unplugged.

I have added the fsync to the latest path (forthcoming), but I have no
strong preferences here as to the correct/expected behavior.

Best,

David

Re: Moving forward with TDE [PATCH v3]

2023-10-31 Thread David Christensen

On Tue, Oct 31, 2023 at 4:30 PM Bruce Momjian  wrote:

> On Tue, Oct 31, 2023 at 04:23:17PM -0500, David Christensen wrote:
> > Greetings,
> >
> > I am including an updated version of this patch series; it has been
> rebased
> > onto 6ec62b7799 and reworked somewhat.
> >
> > The patches are as follows:
> >
> > 0001 - doc updates
> > 0002 - Basic key management and cipher support
> > 0003 - Backend-related changes to support heap encryption
> > 0004 - modifications to bin tools and programs to manage key rotation
> and add
> > other knowledge
> > 0005 - Encrypted/authenticated WAL
> >
> > These are very broad strokes at this point and should be split up a bit
> more to
> > make things more granular and easier to review, but I wanted to get this
> update
> > out.
>
> This lacks temp table file encryption, right?


Temporary /files/ are handled in a different patch set and are not included
here (not sure of the status of integrating at this point). I  believe that
this patch should handle temporary heap files just fine since they go
through the Page API.

Re: Moving forward with TDE [PATCH v3]

2023-11-06 Thread David Christensen

Hi, thanks for the detailed feedback here.

I do think it's worth addressing the question Stephen raised as far as what
we use for the IV[1]; whether LSN or something else entirely, and if so
what.  The choice of LSN here is fairly fundamental to the existing
implementation, so if we decide to do something different some of this
might be moot.

Best,

David

[1]
https://www.mail-archive.com/pgsql-hackers@lists.postgresql.org/msg154613.html

Re: Moving forward with TDE [PATCH v3]

2023-11-06 Thread David Christensen

On Fri, Nov 3, 2023 at 9:53 PM Andres Freund  wrote:

> On 2023-11-02 19:32:28 -0700, Andres Freund wrote:
> > > From 327e86d52be1df8de9c3a324cb06b85ba5db9604 Mon Sep 17 00:00:00 2001
> > > From: David Christensen 
> > > Date: Fri, 29 Sep 2023 15:16:00 -0400
> > > Subject: [PATCH v3 5/5] Add encrypted/authenticated WAL
> > >
> > > When using an encrypted cluster, we need to ensure that the WAL is also
> > > encrypted. While we could go with an page-based approach, we use
> instead a
> > > per-record approach, using GCM for the encryption method and storing
> the AuthTag
> > > in the xl_crc field.
>
> What was the reason for this decision?
>

This was mainly to prevent IV reuse by using a per-record encryption rather
than per-page, since partial writes out on the WAL buffer would result in
reuse there.  This was somewhat of an experiment since authenticated data
per record was basically equivalent in function to the CRC.

There was a switch here so normal clusters use the crc field with the
existing CRC implementation, only encrypted clusters use this alternate
approach.

Re: Moving forward with TDE [PATCH v3]

2023-11-08 Thread David Christensen

On Tue, Nov 7, 2023 at 6:47 PM Andres Freund  wrote:

> Hi,
>
> On 2023-11-06 11:26:44 +0100, Matthias van de Meent wrote:
> > On Sat, 4 Nov 2023 at 03:38, Andres Freund  wrote:
> > > On 2023-11-02 22:09:40 +0100, Matthias van de Meent wrote:
> > > > I'm quite surprised at the significant number of changes being made
> > > > outside the core storage manager files. I thought that changing out
> > > > mdsmgr with an encrypted smgr (that could wrap mdsmgr if so desired)
> > > > would be the most obvious change to implement cluster-wide encryption
> > > > with the least code touched, as relations don't need to know whether
> > > > the files they're writing are encrypted, right? Is there a reason to
> > > > not implement this at the smgr level that I overlooked in the
> > > > documentation of these patches?
> > >
> > > You can't really implement encryption transparently inside an smgr
> without
> > > significant downsides. You need a way to store an initialization vector
> > > associated with the page (or you can store that elsewhere, but then
> you've
> > > doubled the worst cse amount of random reads/writes). The patch uses
> the LSN
> > > as the IV (which I doubt is a good idea). For authenticated encryption
> further
> > > additional storage space is required.
> >
> > I am unaware of any user of the smgr API that doesn't also use the
> > buffer cache, and thus implicitly the Page layout with PageHeader
> > [^1]
>
> Everything indeed uses a PageHeader - but there are a number of places
> that do
> *not* utilize pd_lower/upper/special. E.g. visibilitymap.c just assumes
> that
> those fields are zero - and changing that wouldn't be trivial / free,
> because
> we do a lot of bitmasking/shifting with constants derived from
>
> #define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
>
> which obviously wouldn't be constant anymore if you could reserve space on
> the
> page.
>

While not constants, I was able to get this working with variable values
here in a way that did not have the overhead of the original patch for the
vismap specifically using Montgomery Multiplication for division/mod. This
was actually the heaviest of the changes from moving to runtime-calculated,
so we might be able to use this approach in this specific case even if only
this change is required for this specific fork.


> > The API of smgr is also tailored to page-sized quanta of data
> > with mostly relation-level information. I don't see why there would be
> > a veil covering the layout of Page for smgr when all other information
> > already points to the use of PageHeader and Page layouts. In my view,
> > it would even make sense to allow the smgr to get exclusive access to
> > some part of the page in the current Page layout.
> >
> > Yes, I agree that there will be an impact on usable page size if you
> > want authenticated encryption, and that AMs will indeed need to
> > account for storage space now being used by the smgr - inconvenient,
> > but it serves a purpose. That would happen regardless of whether smgr
> > or some higher system decides where to store the data for encryption -
> > as long as it is on the page, the AM effectively can't use those
> > bytes.
> > But I'd say that's best solved by making the Page documentation and
> > PageInit API explicit about the potential use of that space by the
> > chosen storage method (encrypted, plain, ...) instead of requiring the
> > various AMs to manually consider encryption when using Postgres' APIs
> > for writing data to disk without hitting shared buffers; page space
> > management is already a task of AMs, but handling the actual
> > encryption is not.
>
> I don't particularly disagree with any detail here - but to me reserving
> space
> for nonces etc at PageInit() time pretty much is the opposite of handling
> encryption inside smgr.
>

Originally, I was anticipating that we might want different space amounts
reserved on different classes of pages (apart from encryption), so while
we'd be storing the default page reserved size in pg_control we'd not be
limited to this in the structure of the page calls.  We could presumably
just move the logic into PageInit() itself if every reserved allocation is
the same and individual call sites wouldn't need to know about it.  The
call sites do have more context as to the requirements of the page or the
"type" of page in play, which if we made it dependent on page type would
need to get passed in somehow, which was where the reserved_page_size
parameter came in to the current patch.

>
> > Should the AM really care whether the data on disk is encrypted or
> > not? I don't think so. When the disk contains encrypted bytes, but
> > smgrread() and smgrwrite() both produce and accept plaintext data,
> > who's going to complain? Requiring AMs to be mindful about encryption
> > on all common paths only adds pitfalls where encryption would be
> > forgotten by the developer of AMs in one path or another.
>
> I agree with that - I think the way the

Re: Moving forward with TDE [PATCH v3]

2023-11-08 Thread David Christensen

On Tue, Nov 7, 2023 at 5:49 PM Andres Freund  wrote:

> Hi,
>
> On 2023-11-06 09:56:37 -0500, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> > > I still am quite quite unconvinced that using the LSN as a nonce is a
> good
> > > design decision.
> >
> > This is a really important part of the overall path to moving this
> > forward, so I wanted to jump to it and have a specific discussion
> > around this.  I agree that there are downsides to using the LSN, some of
> > which we could possibly address (eg: include the timeline ID in the IV),
> > but others that would be harder to deal with.
>
> > The question then is- what's the alternative?
> >
> > One approach would be to change the page format to include space for an
> > explicit nonce.  I don't see the community accepting such a new page
> > format as the only format we support though as that would mean no
> > pg_upgrade support along with wasted space if TDE isn't being used.
>
> Right.
>

Hmm, if we /were/ to introduce some sort of page format change, Couldn't
that be a use case for modifying the pd_version field?  Could v4 pages be
read in and written out as v5 pages with different interpretations?


> > Ideally, we'd instead be able to support multiple page formats where
> > users could decide when they create their cluster what features they
> > want- and luckily we even have such an effort underway with patches
> > posted for review [1].
>
> I think there are some details wrong with that patch - IMO the existing
> macros
> should just continue to work as-is and instead the places that want the
> more
> narrow definition should be moved to the new macros and it changes places
> that
> should continue to use compile time constants - but it doesn't seem like a
> fundamentally bad idea to me.  I certainly like it much better than making
> the
> page size runtime configurable.
>

There had been some discussion about this WRT renaming macros and the like
(constants, etc)—I think a new pass eliminating the variable blocksize
pieces and seeing if we can minimize churn here is worthwhile, will take a
look and see what the minimally-viable set of changes is here.


> (I'll try to reply with the above points to [1])
>
>
> > Certainly, with the base page-special-feature patch, we could have an
> option
> > for users to choose that they want a better nonce than the LSN, or we
> could
> > bundle that assumption in with, say, the authenticated-encryption feature
> > (if you want authenticated encryption, then you want more from the
> > encryption system than the basics, and therefore we presume you also
> want a
> > better nonce than the LSN).
>
> I don't think we should support using the LSN as a nonce if we have an
> alternative. The cost and complexity overhead is just not worth it.  Yes,
> it'll be harder for users to migrate to encryption, but adding complexity
> elsewhere in the system to get an inferior result isn't worth it.
>

>From my read, XTS (which I'd see as inferior to authenticated encryption,
but better than some other options) could use LSN as an IV without leakage
concerns, perhaps mixing in the BlockNumber as well.  If we are going to
allow multiple encryption types, I think we may need to consider that needs
for IVs may be different, so this may need to be something that is
selectable per encryption type.

I am unclear how much of a requirement this is, but seems like having a
design supporting this to be pluggable—even if a static lookup table
internally for encryption type, block length, IV source, etc—seems the most
future proof if we had to retire an encryption method or prevent creation
of specific methods, say.


> > Another approach would be a separate fork, but that then has a number of
> > downsides too- every write has to touch that too, and a single page of
> > nonces would cover a pretty large number of pages also.
>
> Yea, the costs of doing so is nontrivial. If you were trying to implement
> encryption on the smgr level - which I doubt you should but am not certain
> about - my suggestion would be to interleave pages with metadata like
> nonces
> and AEAD with the data pages. I.e. one metadata page would be followed by
>   (BLCKSZ - SizeOfPageHeaderData) / (sizeof(nonce) + sizeof(AEAD))
> pages containing actual relation data.  That way you still get decent
> locality
> during scans and writes.
>

Hmm, this is actually an interesting idea, I will think about this a bit.


> Relation forks were a mistake, we shouldn't use them in more places.
>
>
> I think it'd be much better if we also encrypted forks, rather than just
> the
> main fork...
>

I believe the existing code should just work by modifying
the PageNeedsToBeEncrypted macro; I will test that and see if anything
blows up.

David

Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread David Christensen

On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost  wrote:

> Greetings,
>
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > > From: David Christensen 
> > > Date: Tue, 9 May 2023 16:56:15 -0500
> > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> > >
> > > This space is reserved for extended data on the Page structure which
> will be ultimately used for
> > > encrypted data, extended checksums, and potentially other things.
> This data appears at the end of
> > > the Page, after any `pd_special` area, and will be calculated at
> runtime based on specific
> > > ControlFile features.
> > >
> > > No effort is made to ensure this is backwards-compatible with existing
> clusters for `pg_upgrade`, as
> > > we will require logical replication to move data into a cluster with
> > > different settings here.
> >
> > The first part of the last paragraph makes it sound like pg_upgrade
> won't be
> > supported across this commit, rather than just between different
> settings...
>

Yeah, that's vague, but you picked up on what I meant.


> > I think as a whole this is not an insane idea. A few comments:
>
> Thanks for all the feedback!
>
> > - Why is it worth sacrificing space on every page to indicate which
> features
> >   were enabled?  I think there'd need to be some convincing reasons for
> >   introducing such overhead.
>
> In conversations with folks (my memory specifically is a discussion with
> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> there was a pretty strong push that a page should be able to 'stand
> alone' and not depend on something else (eg: pg_control, or whatever) to
> provide info needed be able to interpret the page.  For my part, I don't
> have a particularly strong feeling on that, but that's what lead to this
> design.
>

Unsurprisingly, I agree that it's useful to keep these features on the page
itself; from a forensic standpoint this seems much easier to interpret what
is happening here, as well it would allow you to have different features on
a given page or type of page depending on need.  The initial patch utilizes
pg_control to store the cluster page features, but there's no reason it
couldn't be dependent on fork/page type or stored in pg_tablespace to
utilize different features.

Thanks,

David

Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread David Christensen

On Tue, Nov 7, 2023 at 6:20 PM Andres Freund  wrote:

> Hi,
>
> On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > From: David Christensen 
> > Date: Tue, 9 May 2023 16:56:15 -0500
> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which
> will be ultimately used for
> > encrypted data, extended checksums, and potentially other things.  This
> data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime
> based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing
> clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with
> > different settings here.
>
> The first part of the last paragraph makes it sound like pg_upgrade won't
> be
> supported across this commit, rather than just between different
> settings...
>

Thanks for the review.


> I think as a whole this is not an insane idea. A few comments:
>
> - IMO the patch touches many places it shouldn't need to touch, because of
>   essentially renaming a lot of existing macro names to *Limit,
>   necessitating modifying a lot of users. I think instead the few places
> that
>   care about the runtime limit should be modified.
>
>   As-is the patch would cause a lot of fallout in extensions that just do
>   things like defining an on-stack array of Datums or such - even though
> all
>   they'd need is to change the define to the *Limit one.
>
>   Even leaving extensions aside, it must makes reviewing (and I'm sure
>   maintaining) the patch very tedious.
>

You make a good point, and I think you're right that we could teach the
places that care about runtime vs compile time differences about the
changes while leaving other callers alone. The *Limit ones were introduced
since we need constant values here from the Calc...() macros, but could try
keeping the existing *Limit with the old name and switching things around.
I suspect there will be the same amount of code churn, but less mechanical.


> - I'm a bit worried about how the extra special page will be managed - if
>   there are multiple features that want to use it, who gets to put their
> data
>   at what offset?
>
>   After writing this I saw that 0002 tries to address this - but I don't
> like
>   the design. It introduces runtime overhead that seems likely to be
> visible.
>

Agreed this could be optimized.


> - Checking for features using PageGetFeatureOffset() seems the wrong
> design to
>   me - instead of a branch for some feature being disabled, perfectly
>   predictable for the CPU, we need to do an external function call every
> time
>   to figure out that yet, checksums are *still* disabled.
>

This is probably not a supported approach (it felt a little icky), but I'd
played around with const pointers to structs of const elements, where the
initial values of a global var was populated early on (so set once and
never changed post init), and the compiler didn't complain and things
seemed to work ok; not sure if this approach might help balance the early
mutability and constant lookup needs:

typedef struct PageFeatureOffsets {
  const Size feature0offset;
  const Size feature1offset;
  ...
} PageFeatureOffsets;

PageFeatureOffsets offsets = {0};
const PageFeatureOffsets *exposedOffsets = &offsets;

void InitOffsets() {
*((Size*)&offsets.feature0offset) = ...;
*((Size*)&offsets.feature1offset) = ...;
...
}

- Recomputing offsets every time in PageGetFeatureOffset() seems too
>   expensive. The offsets can't change while running as
> PageGetFeatureOffset()
>   have enough information to distinguish between different kinds of
> relations
>

Yes, this was a simple approach for ease of implementation; there is
certainly a way to precompute a lookup table from the page feature bitmask
into the offsets themselves or otherwise precompute, turn from function
call into inline/macro, etc.


>   - so why do we need to recompute offsets on every single page?  I'd
> instead
>   add a distinct offset variable for each feature.
>

This would work iff there is a single page feature set across all pages in
the cluster; I'm not sure we don't want more flexibility here.


> - Modifying every single PageInit() call doesn't make sense to me. That'll
>   just create a lot of breakage for - as far as I can tell - no win.
>

This was a placeholder to allow different features depending on page type;
to keep things simple for now I just used the same values

Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread David Christensen

On Mon, Nov 13, 2023 at 2:27 PM Andres Freund  wrote:

> Hi,
>
> On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote:
> > On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost  wrote:
> > > In conversations with folks (my memory specifically is a discussion
> with
> > > Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> > > there was a pretty strong push that a page should be able to 'stand
> > > alone' and not depend on something else (eg: pg_control, or whatever)
> to
> > > provide info needed be able to interpret the page.  For my part, I
> don't
> > > have a particularly strong feeling on that, but that's what lead to
> this
> > > design.
> >
> > The term that I have used in the past is "self-contained". Meaning
> > capable of being decoded more or less as-is, without any metadata, by
> > tools like pg_filedump.
>
> I'm not finding that very convincing - without cluster wide data, like
> keys, a
> tool like pg_filedump isn't going to be able to do much with encrypted
> pages. Given the need to look at some global data, figuring out the offset
> at
> which data starts based on a value in pg_control isn't meaningfully worse
> than
> having the data on each page.
>
> Storing redundant data in each page header, when we've wanted space in the
> page header for plenty other things, just doesn't seem a good use of said
> space.
>

This scheme would open up space per page that would now be available for
plenty of other things; the encoding in the header and the corresponding
available space in the footer would seem to open up quite a few options
now, no?

Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread David Christensen

On Mon, Nov 13, 2023 at 2:52 PM Andres Freund  wrote:

>
> > This scheme would open up space per page that would now be available for
> > plenty of other things; the encoding in the header and the corresponding
> > available space in the footer would seem to open up quite a few options
> > now, no?
>
> Sure, if you're willing to rewrite the whole cluster to upgrade and
> willing to
> permanently sacrifice some data density.  If the stored data is actually
> specific to the page - that is the place to put the data. If not, then the
> tradeoff is much more complicated IMO.
>
> Of course this isn't a new problem - storing the page size on each page was
> just silly, it's never going to change across the cluster and even more
> definitely not going to change within a single relation.
>

Crazy idea; since stored pagesize is already a fixed cost that likely isn't
going away, what if instead of the pd_checksum field, we instead
reinterpret pd_pagesize_version; 4 would mean "no page features", but
anything 5 or higher could be looked up as an external page feature set,
with storage semantics outside of the realm of the page itself (other than
what the page features code itself needs to know); i.e,. move away from the
on-page bitmap into a more abstract representation of features which could
be something along the lines of what you were suggesting, including
extension support.

It seems like this could also support adding/removing features on page
read/write as long as there was sufficient space in the reserved_page
space; read the old feature set on page read, convert to the new feature
set which will write out the page with the additional/changed format.
Obviously there would be bookkeeping to be done in terms of making sure all
pages had been converted from one format to another, but for the page level
this would be straightforward.

Just thinking aloud here...

David

Re: [PATCHES] Post-special page storage TDE support

2023-11-29 Thread David Christensen

On Tue, Nov 7, 2023 at 6:20 PM Andres Freund  wrote:

> Hi,
>
> - IMO the patch touches many places it shouldn't need to touch, because of
>   essentially renaming a lot of existing macro names to *Limit,
>   necessitating modifying a lot of users. I think instead the few places
> that
>   care about the runtime limit should be modified.
>
>   As-is the patch would cause a lot of fallout in extensions that just do
>   things like defining an on-stack array of Datums or such - even though
> all
>   they'd need is to change the define to the *Limit one.
>
>   Even leaving extensions aside, it must makes reviewing (and I'm sure
>   maintaining) the patch very tedious.
>

Hi Andres et al,

So I've been looking at alternate approaches to this issue and considering
how to reduce churn, and I think we still need the *Limit variants.  Let's
take a simple example:

Just looking at MaxHeapTuplesPerPage and breaking down instances in the
code, loosely partitioning into whether it's used as an array index or
other usage (doesn't discriminate against code vs comments, unfortunately)
we get the following breakdown:

$ git grep -hoE [[]?MaxHeapTuplesPerPage | sort | uniq -c
 18 [MaxHeapTuplesPerPage
 51 MaxHeapTuplesPerPage

This would be 18 places where we would need at adjust in a fairly
mechanical fashion to add the MaxHeapTuplesPerPageLimit instead of
MaxHeapTuplesPerPage vs some significant fraction of non-comment--even if
you assumed half were in comments, there would presumably need to be some
sort of adjustments in verbage since we are going to be changing some of
the interpretation.

I am working on a patch to cleanup some of the assumptions that smgr makes
currently about its space usage and how the individual access methods
consider it, as they should only be calculating things based on how much
space is available after smgr is done with it.  That has traditionally been
BLCKSZ - SizeOfPageHeaderData, but this patch (included) factors that out
into a single expression that we can now use in access methods, so we can
then reserve additional page space and not need to adjust the access
methods furter.

Building on top of this patch, we'd define something like this to handle
the #defines that need to be dynamic:

extern Size reserved_page_space;
#define PageUsableSpace (BLCKSZ - SizeOfPageHeaderData -
reserved_page_space)
#define MaxHeapTuplesPerPage CalcMaxHeapTuplesPerPage(PageUsableSpace)
#define MaxHeapTuplesPerPageLimit CalcMaxHeapTuplesPerPage(BLCKSZ -
SizeOfPageHeaderData)
#define CalcMaxHeapTuplesPerPage(freesize)
  ((int) ((freesize) / \
  (MAXALIGN(SizeofHeapTupleHeader) +
sizeof(ItemIdData

In my view, extensions that are expecting to need no changes when it comes
to changing how these are interpreted are better off needing to only change
the static allocation in a mechanical sense than revisit any other uses of
code; this seems more likely to guarantee a correct result than if you
exceed the page space and start overwriting things you weren't because
you're not aware that you need to check for dynamic limits on your own.

Take another thing which would need adjusting for reserving page space,
MaxHeapTupleSize:

$ git grep -ohE '[[]?MaxHeapTupleSize' | sort | uniq -c
  3 [MaxHeapTupleSize
 16 MaxHeapTupleSize

Here there are 3 static arrays which would need to be adjusted vs 16 other
instances. If we kept MaxHeapTupleSize interpretation the same and didn't
adjust an extension it would compile just fine, but with too large of a
length compared to the smaller PageUsableSpace, so you could conceivably
overwrite into the reserved space depending on what you were doing.

(since by definition the reserved_page_space >= 0, so PageUsableSpace will
always be <= BLCKSZ - SizeOfPageHeaderData, so any expression based on it
as a basis will be smaller).

In short, I think the approach I took originally actually will reduce
errors out-of-core, and while churn is still necessary churn.

I can produce a second patch which implements this calc/limit atop this
first one as well.

Thanks,

David

0001-Create-PageUsableSpace-to-represent-space-post-smgr.patch
Description: Binary data

Re: [PATCHES] Post-special page storage TDE support

2023-12-22 Thread David Christensen

Hi again!

Per some offline discussion with Stephen, I've continued to work on some
modifications here; this particular patchset is intended to facilitate
review by highlighting the mechanical nature of many of these changes.  As
such, I have taken the following approach to this rework:

0001 - Create PageUsableSpace to represent space post-smgr
0002 - Add support for fast, non-division-based div/mod algorithms
0003 - Use fastdiv code in visibility map
0004 - Make PageUsableSpace incorporate variable-sized limit
0005 - Add Calc, Limit and Dynamic forms of all variable constants
0006 - Split MaxHeapTuplesPerPage into Limit and Dynamic variants
0007 - Split MaxIndexTuplesPerPage into Limit and Dynamic variants
0008 - Split MaxHeapTupleSize into Limit and Dynamic variants
0009 - Split MaxTIDsPerBTreePage into Limit and Dynamic variant

0001 - 0003 have appeared in this thread or in other forms on the list
already, though 0001 refactors things slightly more aggressively, but makes
StaticAssert() to ensure that this change is still sane.

0004 adds the ReservedPageSpace variable, and also redefines the previous
BLCKSZ - SizeOfPageHeaderDate as PageUsableSpaceMax; there are a few
related fixups.

0005 adds the macros to compute the former constants while leaving their
original definitions to evaluate to the same place (the infamous Calc* and
*Limit, plus we invite *Dynamic to the party as well; the names are
terrible and there must be something better)

0006 - 0009 are all the same approach; we undefine the old constant name
and modify the existing uses of this symbol to be either the *Limit or
*Dynamic, depending on if the changed available space would impact the
calculations.  Since we are touching every use of this symbol, this
facilitates review of the impact, though I would contend that almost every
piece I've spot-checked seems like it really does need to know about the
runtime limit.  Perhaps there is more we could do here.  I could also see a
variable per constant rather than recalculating this every time, in which
case the *Dynamic would just be the variable and we'd need a hook to
initialize this or otherwise set on first use.

There are a number of additional things remaining to be done to get this to
fully work, but I did want to get some of this out there for review.

Still to do (almost all in some form in original patch, so just need to
extract the relevant pieces):
- set reserved-page-size via initdb
- load reserved-page-size from pg_control
- apply to the running cluster
- some form of compatibility for these constants in common and ensuring
bin/ works
- some toast-related changes (this requires a patch to support dynamic
relopts, which I can extract, as the existing code is using a constant
lookup table)
- probably some more pieces I'm forgetting

Thanks,
David


v2-0005-Add-Calc-Limit-and-Dynamic-forms-of-all-variable-.patch
Description: Binary data


v2-0001-Create-PageUsableSpace-to-represent-space-post-sm.patch
Description: Binary data


v2-0002-Add-support-for-fast-non-division-based-div-mod-a.patch
Description: Binary data


v2-0003-Use-fastdiv-code-in-visibility-map.patch
Description: Binary data


v2-0004-Make-PageUsableSpace-incorporate-variable-sized-l.patch
Description: Binary data


v2-0007-Split-MaxIndexTuplesPerPage-into-Limit-and-Dynami.patch
Description: Binary data


v2-0008-Split-MaxHeapTupleSize-into-Limit-and-Dynamic-var.patch
Description: Binary data


v2-0006-Split-MaxHeapTuplesPerPage-into-Limit-and-Dynamic.patch
Description: Binary data


v2-0009-Split-MaxTIDsPerBTreePage-into-Limit-and-Dynamic-.patch
Description: Binary data

Re: Initdb-time block size specification

2023-08-31 Thread David Christensen

Enclosed are TPC-H results for 1GB shared_buffers, 64MB work_mem on a
64GB laptop with SSD storage; everything else is default settings.

TL;DR: unpatched version: 17.30 seconds, patched version: 17.15; there
are some slight variations in runtime, but seems to be within the
noise level at this point.


tpch.unpatched.rst
Description: Binary data


tpch.patched.rst
Description: Binary data

Re: Initdb-time block size specification

2023-08-31 Thread David Christensen

> + * pg_fastmod - calculates the modulus of a 32-bit number against a constant
> + * divisor without using the division operator
> + */
> +static inline uint32 pg_fastmod(uint32 n, uint32 divisor, uint64 fastinv)
> +{
> +#ifdef HAVE_INT128
> + uint64_t lowbits = fastinv * n;
> + return ((uint128)lowbits * divisor) >> 64;
> +#else
> + return n % divisor;
> +#endif
> +}
>
> Requiring 128-bit arithmetic to avoid serious regression is a non-starter as 
> written. Software that relies on fast 128-bit multiplication has to do 
> backflips to get that working on multiple platforms. But I'm not sure it's 
> necessary -- if the max block number is UINT32_MAX and max block size is 
> UINT16_MAX, can't we just use 64-bit multiplication?

I was definitely hand-waving additional implementation here for
non-native 128 bit support; the modulus algorithm as presented
requires 4 times the space as the divisor, so a uint16 implementation
should work for all 64-bit machines.  Certainly open to other ideas or
implementations, this was the one I was able to find initially.  If
the 16bit approach is all that is needed in practice we can also see
about narrowing the domain and not worry about making this a
general-purpose function.

Best,

David

Re: Initdb-time block size specification

2023-08-31 Thread David Christensen

> I was definitely hand-waving additional implementation here for
> non-native 128 bit support; the modulus algorithm as presented
> requires 4 times the space as the divisor, so a uint16 implementation
> should work for all 64-bit machines.  Certainly open to other ideas or
> implementations, this was the one I was able to find initially.  If
> the 16bit approach is all that is needed in practice we can also see
> about narrowing the domain and not worry about making this a
> general-purpose function.

Here's a patch atop the series which converts to 16-bit uints and
passes regressions, but I don't consider well-vetted at this point.

David


0001-Switch-to-using-only-16-bit-to-avoid-128-bit-require.patch
Description: Binary data

Re: Initdb-time block size specification

2023-09-05 Thread David Christensen

On Tue, Sep 5, 2023 at 9:04 AM Robert Haas  wrote:
>
> On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra
>  wrote:
> > Except that no one is forcing you to actually go 130mph or 32mph, right?
> > You make it seem like this patch forces people to use some other page
> > size, but that's clearly not what it's doing - it gives you the option
> > to use smaller or larger block, if you chose to. Just like increasing
> > the speed limit to 130mph doesn't mean you can't keep going 65mph.
> >
> > The thing is - we *already* allow using different block size, except
> > that you have to do custom build. This just makes it easier.
>
> Right. Which is worth doing if it doesn't hurt performance and is
> likely to be useful to a lot of people, and is not worth doing if it
> will hurt performance and be useful to relatively few people. My bet
> is on the latter.

Agreed that this doesn't make sense if there are major performance
regressions, however the goal here is patch evaluation to measure this
against other workloads and see if this is the case; from my localized
testing things were within acceptable noise levels with the latest
version.

I agree with Tomas' earlier thoughts: we already allow different block
sizes, and if there are baked-in algorithmic assumptions about block
size (which there probably are), then identifying those or places in
the code where we need additional work or test coverage will only
improve things overall for those non-standard block sizes.

Best,

David

Re: Initdb-time block size specification

2023-09-05 Thread David Christensen

On Tue, Sep 5, 2023 at 2:52 PM Hannu Krosing  wrote:
>
> Something I also asked at this years Unconference - Do we currently
> have Build Farm animals testing with different page sizes ?
>
> I'd say that testing all sizes from 4KB up (so 4, 8, 16, 32) should be
> done at least before each release if not continuously.
>
> -- Cheers
>
> Hannu

The regression tests currently have a lot of breakage when running
against non-standard block sizes, so I would assume the answer here is
no.  I would expect that we will want to add regression test variants
or otherwise normalize results so they will work with differing block
sizes, but have not done that yet.

Re: DELETE CASCADE

2022-01-31 Thread David Christensen

On Sun, Jan 30, 2022 at 9:47 PM Julien Rouhaud  wrote:

> Hi,
>
> On Tue, Jan 25, 2022 at 10:26:53PM +0800, Julien Rouhaud wrote:
> >
> > It's been almost 4 months since your last email, and almost 2 weeks
> since the
> > notice that this patch doesn't apply anymore.  Without update in the next
> > couple of days this patch will be closed as Returned with Feedback per
> the
> > commitfest rules.
>
> Closed.
>

Sounds good; when I get time to look at this again I will resubmit (if
people think the base functionality is worth it, which is still a topic of
discussion).

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-04 Thread David Christensen

On Fri, Nov 4, 2022 at 1:38 PM Justin Pryzby  wrote:
> > > As I recall, that's due to relying on "cp".  And "rsync", which
> > > shouldn't be assumed to exist by regression tests).

Will poke around other TAP tests to see if there's a more consistent
interface, what perl version we can assume and available modules, etc.
If there's not some trivial wrapper at this point so all TAP tests
could use it regardless of OS, it would definitely be good to
introduce such a method.

> > I will work on supporting the windows compatibility here. Is there some 
> > list of guidelines for what you can and can’t use? I don’t have a windows 
> > machine available to develop on.
>
> I think a lot (most?) developers here don't have a windows environment
> available, so now have been using cirrusci's tests to verify.  If you
> haven't used cirrusci directly (not via cfbot) before, start at:
> src/tools/ci/README

Thanks, good starting point.

> > Was it failing on windows? I was attempting to skip it as I recall.
>
> I don't see anything about skipping, and cirrus's logs from 2
> commitfests ago were pruned.  I looked at this patch earlier this year,
> but never got around to replacing the calls to rsync and cp.

Ah, it's skipped (not fixed) in my git repo, but never got around to
submitting that version through email.  That explains it.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-07 Thread David Christensen

Hi Justin et al,

Enclosed is v5 of this patch which now passes the CirrusCI checks for
all supported OSes. I went ahead and reworked the test a bit so it's a
little more amenable to the OS-agnostic approach for testing.

Best,

David


v5-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data

Re: [PATCHES] Post-special page storage TDE support

2022-11-08 Thread David Christensen

Looking into some CF bot failures which didn't show up locally.  Will
send a v3 when resolved.

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-08 Thread David Christensen

On Tue, Nov 8, 2022 at 4:45 PM Justin Pryzby  wrote:
>
> On Mon, Nov 07, 2022 at 05:01:01PM -0600, David Christensen wrote:
> > Hi Justin et al,
> >
> > Enclosed is v5 of this patch which now passes the CirrusCI checks for
> > all supported OSes. I went ahead and reworked the test a bit so it's a
> > little more amenable to the OS-agnostic approach for testing.
>
> Great, thanks.
>
> This includes the changes that I'd started a few months ago.
> Plus adding the test which was missing for meson.

Cool, will review, thanks.

> +format: 
> LSN.TSOID.DBOID.RELNODE.BLKNO
>
> I'd prefer if the abbreviations were "reltablespace" and "datoid"

Sure, no issues there.

> Also, should the test case call pg_relation_filenode() rather than using
> relfilenode directly ?  Is it a problem that the test code assumes
> pagesize=8192 ?

Both good points. Is pagesize just exposed via
`current_setting('block_size')` or is there a different approach?

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-08 Thread David Christensen

Enclosed is v6, which squashes your refactor and adds the additional
recent suggestions.

Thanks!


v6-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread David Christensen

On Wed, Nov 9, 2022 at 6:30 AM Bharath Rupireddy
 wrote:
>
> On Wed, Nov 9, 2022 at 5:08 AM David Christensen
>  wrote:
> >
> > Enclosed is v6, which squashes your refactor and adds the additional
> > recent suggestions.
>
> Thanks for working on this feature. Here are some comments for now. I
> haven't looked at the tests, I will take another look at the code and
> tests after these and all other comments are addressed.
>
> 1. For ease of review, please split the test patch to 0002.

Per later discussion seems like new feature tests are fine in the same
patch, yes?

> 2. I'm unable to understand the use-case for --fixup-fpi option.
> pg_waldump is supposed to be just WAL reader, and must not return any
> modified information, with --fixup-fpi option, the patch violates this
> principle i.e. it sets page LSN and returns. Without actually
> replaying the WAL record on the page, how is it correct to just set
> the LSN? How will it be useful? ISTM, we must ignore this option
> unless there's a strong use-case.

How I was envisioning this was for cases like extreme surgery for
corrupted pages, where you extract the page from WAL but it has lsn
and checksum set so you could do something like `dd if=fixup-block
of=relation ...`, so it *simulates* the replay of said fullpage blocks
in cases where for some reason you can't play the intermediate
records; since this is always a fullpage block, it's capturing what
would be the snapshot so you could manually insert somewhere as needed
without needing to replay (say if dealing with an incomplete or
corrupted WAL stream).

> 3.
> +if (fork >= 0 && fork <= MAX_FORKNUM)
> +{
> +if (fork)
> +sprintf(forkname, "_%s", forkNames[fork]);
> +else
> +forkname[0] = 0;
> +}
> +else
> +pg_fatal("found invalid fork number: %u", fork);
>
> Isn't the above complex? What's the problem with something like below?
> Why do we need if (fork) - else block?
>
> if (fork >= 0  && fork <= MAX_FORKNUM)
> sprintf(forkname, "_%s", forkNames[fork]);
> else
> pg_fatal("found invalid fork number: %u", fork);

This was to suppress any suffix for main forks, but yes, could
simplify and include the `_` in the suffix name.  Will include such a
change.

> 3.
> +charpage[BLCKSZ] = {0};
> I think when writing to a file, we need PGAlignedBlock rather than a
> simple char array of bytes, see the description around PGAlignedBlock
> for why it is so.

Easy enough change, and makes sense.

> 4.
> +if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ)
> Why pg_pwrite(), why not just fwrite()? If fwrite() is used, you can
> avoid fileno() system calls, no? Do you need the file position to
> remain the same after writing, hence pg_pwrite()?

I don't recall the original motivation, TBH.

> 5.
> +if (!RestoreBlockImage(record, block_id, page))
> +continue;
> +
> +/* we have our extracted FPI, let's save it now */
> After extracting the page from the WAL record, do we need to perform a
> checksum on it?

That is there in fixup mode (or should be).  Are you thinking this
should also be set if not in fixup mode?  That defeats the purpose of
the raw page extract, which is to see *exactly* what the WAL stream
has.

> 6.
> +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> Use pg_dir_create_mode instead of hard-coded 0007?

Sure.

> 7.
> +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> +fsync(fileno(OPF));
> +fclose(OPF);
> Since you're creating the directory in case it's not available, you
> need to fsync the directory too?

Sure.

> 8.
> +case 'W':
> +case 'X':
> +config.fixup_fpw = (option == 'X');
> +config.save_fpw_path = pg_strdup(optarg);
> +break;
> Just set  config.fixup_fpw = false before the switch block starts,
> like the other variables, and then perhaps doing like below is more
> readable:
> case 'W':
> config.save_fpw_path = pg_strdup(optarg);
> case 'X':
>config.fixup_fpw = true;
>config.save_fpw_path = pg_strdup(optarg);

Like separate opt processing with their own `break` statement?
Probably a bit more readable/conventional.

> 9.
> +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> Should we use pg_mkdir_p() instead of mkdir()?

Sure.

Thanks,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread David Christensen

> > 6.
> > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> > Use pg_dir_create_mode instead of hard-coded 0007?
>
> I think I thought of that when I first looked at the patch ...  but, I'm
> not sure, since it says:
>
> src/include/common/file_perm.h-/* Modes for creating directories and files IN 
> THE DATA DIRECTORY */
> src/include/common/file_perm.h:extern PGDLLIMPORT int pg_dir_create_mode;

Looks like it's pretty evenly split in src/bin:

$ git grep -o -E -w '(pg_mkdir_p|mkdir)' '**.c' | sort | uniq -c
  3 initdb/initdb.c:mkdir
  3 initdb/initdb.c:pg_mkdir_p
  1 pg_basebackup/bbstreamer_file.c:mkdir
  2 pg_basebackup/pg_basebackup.c:pg_mkdir_p
  1 pg_dump/pg_backup_directory.c:mkdir
  1 pg_rewind/file_ops.c:mkdir
  4 pg_upgrade/pg_upgrade.c:mkdir

So if that is the preferred approach I'll go ahead and use it.

> I was wondering if there's any reason to do "CREATE DATABASE".  The vast
> majority of TAP tests don't.
>
> $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head
>1435 safe_psql('postgres',
> 335 safe_psql(
>  23 safe_psql($connect_db,

If there was a reason, I don't recall offhand; I will test removing it
and if things still work will consider it good enough.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread David Christensen

On Wed, Nov 9, 2022 at 2:08 PM David Christensen
 wrote:
> Justin sez:
> > I was wondering if there's any reason to do "CREATE DATABASE".  The vast
> > majority of TAP tests don't.
> >
> > $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head
> >1435 safe_psql('postgres',
> > 335 safe_psql(
> >  23 safe_psql($connect_db,
>
> If there was a reason, I don't recall offhand; I will test removing it
> and if things still work will consider it good enough.

Things blew up when I did that; rather than hunt it down, I just left it in. :-)

Enclosed is v7, with changes thus suggested thus far.


v7-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-10 Thread David Christensen

On Thu, Nov 10, 2022 at 2:33 AM Bharath Rupireddy
 wrote:
>
> On Thu, Nov 10, 2022 at 1:31 AM David Christensen
>  wrote:
> >
>
> Thanks for providing the v7 patch, please see my comments and responses below.

Hi Bharath, Thanks for the feedback.

> > > 2. I'm unable to understand the use-case for --fixup-fpi option.
> > > pg_waldump is supposed to be just WAL reader, and must not return any
> > > modified information, with --fixup-fpi option, the patch violates this
> > > principle i.e. it sets page LSN and returns. Without actually
> > > replaying the WAL record on the page, how is it correct to just set
> > > the LSN? How will it be useful? ISTM, we must ignore this option
> > > unless there's a strong use-case.
> >
> > How I was envisioning this was for cases like extreme surgery for
> > corrupted pages, where you extract the page from WAL but it has lsn
> > and checksum set so you could do something like `dd if=fixup-block
> > of=relation ...`, so it *simulates* the replay of said fullpage blocks
> > in cases where for some reason you can't play the intermediate
> > records; since this is always a fullpage block, it's capturing what
> > would be the snapshot so you could manually insert somewhere as needed
> > without needing to replay (say if dealing with an incomplete or
> > corrupted WAL stream).
>
> Recovery sets the page LSN after it replayed the WAL record on the
> page right? Recovery does this - base_page/FPI +
> apply_WAL_record_and_then_set_applied_WAL_record's_LSN =
> new_version_of_page. Essentially, in your patch, you are just setting
> the WAL record LSN with the page contents being the base page's. I'm
> still not sure what's the real use-case here. We don't have an
> independent function in postgres, given a base page and a WAL record
> that just replays the WAL record and output's the new version of the
> page, so I think what you do in the patch with fixup option seems
> wrong to me.

Well if it's not the same output then I guess you're right and there's
not a use for the `--fixup` mode.  By the same token, I'd say
calculating/setting the checksum also wouldn't need to be done, we
should just include the page as included in the WAL stream.

> > > 5.
> > > +if (!RestoreBlockImage(record, block_id, page))
> > > +continue;
> > > +
> > > +/* we have our extracted FPI, let's save it now */
> > > After extracting the page from the WAL record, do we need to perform a
> > > checksum on it?
>
> I think you just need to do the following, this will ensure the
> authenticity of the page that pg_waldump returns.
> if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, blk))
> {
> pg_fatal("page checksum failed");
> }

The WAL already has a checksum, so not certain this makes sense on its
own. Also I'm inclined to make it a warning if it doesn't match rather
than a fatal. (I'd also have to verify that the checksum is properly
set on the page prior to copying the FPI into WAL, which I'm pretty
sure it is but not certain.)

> > > case 'W':
> > > config.save_fpw_path = pg_strdup(optarg);
> > > case 'X':
> > >config.fixup_fpw = true;
> > >config.save_fpw_path = pg_strdup(optarg);
> >
> > Like separate opt processing with their own `break` statement?
> > Probably a bit more readable/conventional.
>
> Yes.

Moot with the removal of the --fixup mode.

> Some more comments:
>
> 1.
> +PGAlignedBlockzerobuf;
> Essentially, it's not a zero buffer, please rename the variable to
> something like 'buf' or 'page_buf' or someother?

Sure.

> 2.
> +if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ)
> Replace pg_pwrite with fwrite() and avoid fileno() system calls that
> should suffice here, AFICS, we don't need pg_pwrite.

Sure.

> 3.
> +if (config.save_fpw_path != NULL)
> +{
> +/* Create the dir if it doesn't exist */
> +if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0)
> I think  you still need pg_check_dir() here, how about something like below?

I was assuming pg_mkdir_p() acted just like mkdir -p, where it's just
an idempotent action, so an existing dir is just treated the same.
What's the benefit here? Would assume if a non-dir file existed at
that path or other permissions issues arose we'd just get an error
from pg_mkdir_p().  (Will review the code there and confirm.)

> 4.
> +/* Create the dir if it doesn

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-11 Thread David Christensen

On Fri, Nov 11, 2022 at 8:15 AM Justin Pryzby  wrote:
>
> On Wed, Nov 09, 2022 at 02:37:29PM -0600, David Christensen wrote:
> > On Wed, Nov 9, 2022 at 2:08 PM David Christensen 
> >  wrote:
> > > Justin sez:
> > > > I was wondering if there's any reason to do "CREATE DATABASE".  The vast
> > > > majority of TAP tests don't.
> > > >
> > > > $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head
> > > >1435 safe_psql('postgres',
> > > > 335 safe_psql(
> > > >  23 safe_psql($connect_db,
> > >
> > > If there was a reason, I don't recall offhand; I will test removing it
> > > and if things still work will consider it good enough.
> >
> > Things blew up when I did that; rather than hunt it down, I just left it 
> > in. :-)
>
> > +$primary->safe_psql('db1',  <
> It worked for me when I removed the 3 references to db1.
> That's good for efficiency of the test.

I did figure that out later; fixed in git.

> > +my $blocksize = 8192;
>
> I think this should be just "my $blocksize;" rather than setting a value
> which is later overwriten.

Yep.  Fixed in git.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-14 Thread David Christensen

On Fri, Nov 11, 2022 at 4:57 AM Bharath Rupireddy
 wrote:
> > Well if it's not the same output then I guess you're right and there's
> > not a use for the `--fixup` mode.  By the same token, I'd say
> > calculating/setting the checksum also wouldn't need to be done, we
> > should just include the page as included in the WAL stream.
>
> Let's hear from others, we may be missing something here. I recommend
> keeping the --fixup patch as 0002, in case if we decide to discard
> it's easier, however I'll leave that to you.

I've whacked in `git` for now; I can resurrect if people consider it useful.

> > > > > 5.
> > > > > +if (!RestoreBlockImage(record, block_id, page))
> > > > > +continue;
> > > > > +
> > > > > +/* we have our extracted FPI, let's save it now */
> > > > > After extracting the page from the WAL record, do we need to perform a
> > > > > checksum on it?
> > >
> > > I think you just need to do the following, this will ensure the
> > > authenticity of the page that pg_waldump returns.
> > > if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, 
> > > blk))
> > > {
> > > pg_fatal("page checksum failed");
> > > }
> >
> > The WAL already has a checksum, so not certain this makes sense on its
> > own. Also I'm inclined to make it a warning if it doesn't match rather
> > than a fatal. (I'd also have to verify that the checksum is properly
> > set on the page prior to copying the FPI into WAL, which I'm pretty
> > sure it is but not certain.)
>
> How about having it as an Assert()?

Based on empirical testing, the checksums don't match, so
asserting/alerting on each block extracted seems next to useless, so
going to just remove that.

> > > 5.
> > > +fsync(fileno(OPF));
> > > +fclose(OPF);
> > > I think just the fsync() isn't enough, you still need fsync_fname()
> > > and/or fsync_parent_path(), perhaps after for (block_id = 0; block_id
> > > <= XLogRecMaxBlockId(record); block_id++) loop.
> >
> > I'm not sure I get the value of the fsyncs here; if you are using this
> > tool at this capacity you're by definition doing some sort of
> > transient investigative steps.  Since the WAL was fsync'd, you could
> > always rerun/recreate as needed in the unlikely event of an OS crash
> > in the middle of this investigation.  Since this is outside the
> > purview of the database operations proper (unlike, say, initdb) seems
> > like it's unnecessary (or definitely shouldn't need to be selectable).
> > My thoughts are that if we're going to fsync, just do the fsyncs
> > unconditionally rather than complicate the interface further.
>
> -1 for fysnc() per file created as it can create a lot of sync load on
> production servers impacting performance. How about just syncing the
> directory at the end assuming it doesn't cost as much as fsync() per
> FPI file created would?

I can fsync the dir if that's a useful compromise.

> > > 4.
> > > +$primary->append_conf('postgresql.conf', "wal_level='replica'");
> > > +$primary->append_conf('postgresql.conf', 'archive_mode=on');
> > > +$primary->append_conf('postgresql.conf', "archive_command='/bin/false'");
> > > Why do you need to set wal_level to replica, out of the box your
> > > cluster comes with replica only no?
> > > And why do you need archive_mode on and set the command to do nothing?
> > > Why archiving is needed for testing your feature firstly?
> >
> > I think it had shown "minimal" in my testing; I was purposefully
> > failing archives so the WAL would stick around.  Maybe a custom
> > archive command that just copied a single WAL file into a known
> > location so I could use that instead of the current approach would
> > work, though not sure how Windows support would work with that.  Open
> > to other ideas to more cleanly get a single WAL file that isn't the
> > last one.  (Earlier versions of this test were using /all/ of the
> > generated WAL files rather than a single one, so maybe I am
> > overcomplicating things for a single WAL file case.)
>
> Typically we create a physical replication slot at the beginning so
> that the server keeps the WAL required for you in pg_wal itself, for
> instance, please see pg_walinspect:
>
> -- Make sure checkpoints don't interfere with the test.
> SELECT 'init' FROM
> pg_create_physical_replication_slot('regress_pg_walinspect_slot',
> true, false);

Will see if I can get something like this to work; I'm currently
stopping the server before running the file-based tests, but I suppose
there's no reason to do so, so a temporary slot that holds it around
until the test is complete is probably fine.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-14 Thread David Christensen

Enclosed is v8, which uses the replication slot method to retain WAL
as well as fsync'ing the output directory when everything is done.


v8-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-15 Thread David Christensen

On Tue, Nov 15, 2022 at 4:41 AM Bharath Rupireddy
 wrote:
>
> On Tue, Nov 15, 2022 at 1:29 AM David Christensen
>  wrote:
> >
> > Enclosed is v8, which uses the replication slot method to retain WAL
> > as well as fsync'ing the output directory when everything is done.
>
> Thanks. It mostly is in good shape. However, few more comments:
>
> 1.
> +if it does not exist.  The images saved will be subject to the same
> +filtering and limiting criteria as display records, but in this
> +mode pg_waldump will not output any other
> +information.
> May I know what's the intention of the statement 'The images saved
> '? If it's not necessary and convey anything useful to the user,
> can we remove it?

Basically I mean if you're limiting to a specific relation or rmgr
type, etc, it only saves those FPIs.  (So filtering is applied first
before considering whether to save the FPI or not.)

> 2.
> +#include "storage/checksum.h"
> +#include "storage/checksum_impl.h"
> I think we don't need the above includes as we got rid of verifying
> page checksums. The patch compiles without them for me.

Good catch.

> 3.
> +char   *save_fpw_path;
> Can we rename the above variable to save_fpi_path, just to be in sync
> with what we expose to the user, the option name 'save-fpi'?

Sure.

> 4.
> +if (config.save_fpw_path != NULL)
> +{
> +/* Fsync our output directory */
> +fsync_fname(config.save_fpw_path, true);
> +}
> I guess adding a comment there as to why we aren't fsyncing for every
> file that gets created, but once per the directory at the end. That'd
> help clarify doubts that other members might get while looking at the
> code.

Can do.

> 5.
> +if (config.save_fpw_path != NULL)
> +{
> +/* Fsync our output directory */
> +fsync_fname(config.save_fpw_path, true);
> +}
> So, are we sure that we don't want to fsync for time_to_stop exit(0)
> cases, say when CTRL+C'ed. Looks like we handle time_to_stop safely
> meaning exiting with return code 0, shouldn't we fsync the directory?

We can. Like I've said before, since these aren't production parts of
the cluster I don't personally have much of an opinion if fsync() is
appropriate at all, so don't have strong feelings here.

> 6.
> +else if (config.save_fpw_path)
> Let's use the same convention to check non-NULLness,
> config.save_fpw_path != NULL.

Good catch.

> 7.
> +CHECKPOINT;
> +SELECT pg_switch_wal();
> +UPDATE test_table SET a = a + 1;
> +SELECT pg_switch_wal();
> I don't think switching WAL after checkpoint is necessary here,
> because the checkpoint ensures all the WAL gets flushed to disk.
> Please remove it.

The point is to ensure we have a clean WAL segment that we know will
contain the relation we are filtering by.  Will test if this still
holds without the extra pg_switch_wal(), but that's the rationale.

> PS: I've seen the following code:
> +my $walfile = [sort { $a <=> $b } glob("$waldir/00*")]->[1]; # we
> want the second WAL file, which will be a complete WAL file with
> full-page writes for our specific relation.

I don't understand the question.

> 8.
> +$node->safe_psql('postgres', < +EOF
> Why EOF is used here? Can't we do something like below to execute
> multiple statements?
> $node->safe_psql(
> 'postgres', qq[
> SELECT data FROM pg_logical_slot_get_changes('regression_slot1', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> SELECT data FROM pg_logical_slot_get_changes('regression_slot3', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> SELECT data FROM pg_logical_slot_get_changes('regression_slot4', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> ]);
>
> Same here:
> +$node->safe_psql('postgres', < +SELECT pg_drop_replication_slot('regress_pg_waldump_slot');
> +EOQ

As a long-time perl programmer, heredocs seem more natural and easier
to read rather than a string that accomplishes the same function.  If
there is an established project style I'll stick with it, but it just
rolled out that way. :-)

> 9.
> +my $walfile = [sort { $a <=> $b } glob("

Re: Moving forward with TDE

2022-11-15 Thread David Christensen



> On Nov 15, 2022, at 1:08 PM, Jacob Champion  wrote:
> 
> On Mon, Oct 24, 2022 at 9:29 AM David Christensen
>  wrote:
>> I would love to open a discussion about how to move forward and get
>> some of these features built out.  The historical threads here are
>> quite long and complicated; is there a "current state" other than the
>> wiki that reflects the general thinking on this feature?  Any major
>> developments in direction that would not be reflected in the code from
>> May 2021?
> 
> I don't think the patchset here has incorporated the results of the
> discussion [1] that happened at the end of 2021. For example, it looks
> like AES-CTR is still in use for the pages, which I thought was
> already determined to be insufficient.

Good to know about the next steps, thanks. 

> The following next steps were proposed in that thread:
> 
>> 1. modify temporary file I/O to use a more centralized API
>> 2. modify the existing cluster file encryption patch to use XTS with a
>>   IV that uses more than the LSN
>> 3. add XTS regression test code like CTR
>> 4. create WAL encryption code using CTR
> 
> Does this patchset need review before those steps are taken (or was
> there additional conversation/work that I missed)?

This was just a refresh of the old patches on the wiki to work as written on 
HEAD. If there are known TODOs here this then that work is still needing to be 
done. 

I was going to take 2) and Stephen was going to work on 3); I am not sure about 
the other two but will review the thread you pointed to. Thanks for pointing 
that out. 

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-15 Thread David Christensen

Enclosed is v9.

- code style consistency (FPI instead of FPW) internally.
- cleanup of no-longer needed checksum-related pieces from code and tests.
- test cleanup/simplification.
- other comment cleanup.

Passes all CI checks.

Best,

David


v9-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data

ScanSourceDatabasePgClass

2022-11-16 Thread David Christensen

Hi Robert,

In 9c08aea6a you introduce the block-by-block strategy for creating a
copy of the database.  In the main loop, this utilizes this call:

buf = ReadBufferWithoutRelcache(rlocator, MAIN_FORKNUM, blkno,
RBM_NORMAL, bstrategy, false);

Here, the last parameter is "false" for the permanence factor of this
relation.  Since we know that pg_class is in fact a permanent
relation, this ends up causing issues for the TDE patches that I am
working on updating, due using the opposite value when calculating the
page's IV and thus failing the decryption when trying to create a
database based on template0.

Is there a reason why this needs to be "false" here?  I recognize that
this routine is accessing the table outside of a formal connection, so
there might be more subtle effects that I am not aware of.  If so this
should be documented.  If it's an oversight, I think we should change
to be "true" to match the actual permanence state of the relation.

I did test changing it to true and didn't notice any adverse effects
in `make installcheck-world`, but let me know if there is more to this
story than meets the eye.

(I did review the original discussion at
https://www.postgresql.org/message-id/flat/CA%2BTgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T%3DCYCvRyFHywSBUQ%40mail.gmail.com
and did not notice any discussion of this specific parameter choice.)

Thanks,

David

Re: Moving forward with TDE

2022-11-17 Thread David Christensen

Hi Jacob,

Thanks, I've added this patch in my tree [1]. (For now, just adding
fixes and the like atop the original separate patches, but will
eventually get things winnowed down into probably the same 12 parts
the originals were reviewed in.

Best,

David

[1] https://github.com/pgguru/postgres/tree/tde

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-17 Thread David Christensen

On Wed, Nov 16, 2022 at 3:30 AM Bharath Rupireddy
 wrote:
> 1.
> -if (config.filter_by_fpw && !XLogRecordHasFPW(xlogreader_state))
> +if (config.filter_by_fpw && !XLogRecordHasFPI(xlogreader_state))
> These changes are not related to this feature, hence renaming those
> variables/function names must be dealt with separately. If required,
> proposing another patch can be submitted to change filter_by_fpw to
> filter_by_fpi and XLogRecordHasFPW() to XLogRecordHasFPI().

Not required; can revert the changes unrelated to this specific patch.
(I'd written the original ones for it, so didn't really think anything
of it... :-))

> 2.
> +/* We fsync our output directory only; since these files are not part
> + * of the production database we do not require the performance hit
> + * that fsyncing every FPI would entail, so are doing this as a
> + * compromise. */
> The commenting style doesn't match the standard that we follow
> elsewhere in postgres, please refer to other multi-line comments.

Will fix.

> 3.
> +fsync_fname(config.save_fpi_path, true);
> +}
> It looks like fsync_fname()/fsync() in general isn't recursive, in the
> sense that it doesn't fsync the files under the directory, but the
> directory only. So, the idea of directory fsync doesn't seem worth it.
> We either 1) get rid of fsync entirely or 2) fsync all the files after
> they are created and the directory at the end or 3) do option (2) with
> --no-sync option similar to its friends. Since option (2) is a no go,
> we can either choose option (1) or option (2). My vote at this point
> is for option (1).

Agree to remove.

> 4.
> +($walfile_name, $blocksize) = split '\|' =>
> $node->safe_psql('postgres',"SELECT pg_walfile_name(pg_switch_wal()),
> current_setting('block_size')");
> +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> I think there's something wrong with this, no? pg_switch_wal() can, at
> times, return end+1 of the prior segment (see below snippet) and I'm
> not sure if such a case can happen here.
>
>  * The return value is either the end+1 address of the switch record,
>  * or the end+1 address of the prior segment if we did not need to
>  * write a switch record because we are already at segment start.
>  */
> XLogRecPtr
> RequestXLogSwitch(bool mark_unimportant)

I think this approach is pretty common to get the walfile name, no?
While there might be an edge case here, since the rest of the test is
a controlled environment I'm inclined to just not worry about it; this
would require the changes prior to this to exactly fill a WAL segment
which strikes me as extremely unlikely to the point of impossible in
this specific scenario.

> 5.
> +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> +ok(-f $walfile, "Got a WAL file");
> Is this checking if the WAL file is present or not in PGDATA/pg_wal?
> If yes, I think this isn't required as pg_switch_wal() ensures that
> the WAL is written and flushed to disk.

You are correct, probably another artifact of the earlier version.
That said, not sure I see the harm in keeping it as a sanity-check.

> 6.
> +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> Isn't "pgdata" hardcoded here? I think you might need to do the following:
> $node->data_dir . '/pg_wal/' . $walfile_name;;

Can fix.

> 7.
> +# save filename for later verification
> +$files{$file}++;
>
> +# validate that we ended up with some FPIs saved
> +ok(keys %files > 0, 'verify we processed some files');
> Why do we need to store filenames in an array when we later just check
> the size of the array? Can't we use a boolean (file_found) or an int
> variable (file_count) to verify that we found the file.

Another artifact; we were comparing the files output between two
separate lists of arbitrary numbers of pages being written out and
verifying the raw/fixup versions had the same lists.

> 8.
> +$node->safe_psql('postgres', < +SELECT 'init' FROM
> pg_create_physical_replication_slot('regress_pg_waldump_slot', true,
> false);
> +CREATE TABLE test_table AS SELECT generate_series(1,100) a;
> +CHECKPOINT; -- required to force FPI for next writes
> +UPDATE test_table SET a = a + 1;
> +EOF
> The EOF with append_conf() is being used in 4 files and elsewhere in
> the TAP test files (more than 100?) qq[] or quotes is being used. I
> have no strong opinion here, I'll leave it to the other reviewers or
> committer.

I'm inclined to leave it just for (personal) readability, but can
change if there's a strong consensus against.

> > > 11.
> > > +# verify filename formats matches w/--save-fpi
> > > +for my $fullpath (glob "$tmp_folder/raw/*")
> > > Do we need to look for the exact match of the file that gets created
> > > in the save-fpi path? While checking for this is great, it makes the
> > > test code non-portable (may not work on Windows or other platforms,
> > > no?) and complex? This way, you can get rid of get_block_info(

Re: Moving forward with TDE

2022-11-17 Thread David Christensen

Hi Dilip,

Thanks for the feedback here. I will review the docs changes and add to my tree.

Best,

David

[PATCH] expand the units that pg_size_pretty supports on output

2021-04-14 Thread David Christensen

Hi folks,

Enclosed is a patch that expands the unit output for
pg_size_pretty(numeric) going up to Yottabytes; I reworked the existing
numeric output code to account for the larger number of units we're using
rather than just adding nesting levels.

There are also a few other places that could benefit from expanded units,
but this is a standalone starting point.

Best,

David


0001-Expand-the-units-that-pg_size_pretty-numeric-knows-a.patch
Description: Binary data

Issue in recent pg_stat_statements?

2021-04-26 Thread David Christensen

-hackers,

So in doing some recent work on pg_stat_statements, I notice that while the
regression test still passes on HEAD, it appears that 4f0b096 (per git
bisect) changed/broke how this works compared to historical versions.

Essentially, when doing a fresh install of pg_stat_statements on a new
fresh db (outside of the regression framework), it's not returning any rows
from the view.  I didn't see any related documentation changes, so as far
as I know, this should still be recording all statements as per normal.

My full steps to reproduce from a clean Centos 7 install are attached.  I
have also been able to reproduce this on OS X and Fedora 33.  The TL;DR is:

CREATE EXTENSION pg_stat_statements;
CREATE TABLE foo (a int, b text);
INSERT INTO foo VALUES (1,'a');
SELECT * FROM foo;
SELECT * FROM pg_stat_statements; -- returns nothing

Settings for pg_stat_statements:
postgres=# select name, setting from pg_settings where name like
'pg_stat_statements%';
   name| setting
---+-
 pg_stat_statements.max| 5000
 pg_stat_statements.save   | on
 pg_stat_statements.track  | top
 pg_stat_statements.track_planning | off
 pg_stat_statements.track_utility  | on
(5 rows)

Is this an expected change, or is this in fact broken?  In previous
revisions, this was showing the INSERT and SELECT at the very least.  I'm
unclear as to why the regression test is still passing, so want to verify
that I'm not doing something wrong in the testing.

Best,

David
#!/bin/bash -e

sudo yum install -y git
sudo yum groupinstall -y "Development Tools"
sudo yum install -y readline-devel zlib-devel 'perl(IPC::Run)' 'perl(Test::More)' 'perl(Time::HiRes)'

mkdir -p ~/Checkouts

cd ~/Checkouts
git clone https://github.com/postgres/postgres postgresql
cd ~/Checkouts/postgresql

git checkout 4f0b096
make distclean || true
./configure --enable-cassert --enable-debug CFLAGS="-ggdb -O0 -g3 -fno-omit-frame-pointer" --enable-tap-tests
make -j10 && \
( \
  cd contrib/pg_stat_statements/; \
  make check \
)

cd tmp_install/usr/local/pgsql
export PGUSER=postgres
export PGHOST=/tmp/
export LD_LIBRARY_PATH=./lib:$LD_LIBRARY_PATH

PATH=$HOME/tmp_install/usr/local/pgsql:$PATH
bin/initdb -D data -U postgres
bin/pg_ctl -D data -l logfile start
bin/psql -c 'ALTER SYSTEM SET shared_preload_libraries = $$pg_stat_statements$$'
bin/pg_ctl -D data -l logfile restart
bin/psql <

Re: Issue in recent pg_stat_statements?

2021-04-26 Thread David Christensen

>
> > Is this an expected change, or is this in fact broken?  In previous
> revisions, this was showing the INSERT and SELECT at the very least.  I'm
> unclear as to why the regression test is still passing, so want to verify
> that I'm not doing something wrong in the testing.
>
> Yes, you want to look into the queryid functionality. See
>
> https://www.postgresql.org/message-id/flat/35457b09-36f8-add3-1d07-6034fa585ca8%40oss.nttdata.com
>
> Interface changes may still be coming in 14 for that. Or warnings.
>

Hmm, I'm unclear as to why you would potentially want to use
pg_stat_statements *without* this functionality.  At the very least, it
violates POLA — I spent the better part of a day thinking this was a bug
due to the expected behavior being so obvious I wouldn't have expected any
different.

In any case, this discussion is better had on a different thread.  Thanks
at least for explaining what I was seeing.

Best,

David

Re: Issue in recent pg_stat_statements?

2021-04-26 Thread David Christensen

On Mon, Apr 26, 2021 at 12:18 PM Julien Rouhaud  wrote:

> On Mon, Apr 26, 2021 at 11:40 PM David Christensen
>  wrote:
> >>
> >> > Is this an expected change, or is this in fact broken?  In previous
> revisions, this was showing the INSERT and SELECT at the very least.  I'm
> unclear as to why the regression test is still passing, so want to verify
> that I'm not doing something wrong in the testing.
> >>
> >> Yes, you want to look into the queryid functionality. See
> >>
> https://www.postgresql.org/message-id/flat/35457b09-36f8-add3-1d07-6034fa585ca8%40oss.nttdata.com
> >>
> >> Interface changes may still be coming in 14 for that. Or warnings.
> >
> >
> > Hmm, I'm unclear as to why you would potentially want to use
> pg_stat_statements *without* this functionality.
>
> Using pg_stat_statements with a different query_id semantics without
> having to fork pg_stat_statements.
>

I can see that argument for allowing alternatives, but the current default
of nothing seems to be particularly non-useful, so some sensible default
value would seem to be in order, or I can predict a whole mess of future
user complaints.

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-04-28 Thread David Christensen

A second patch to teach the same units to pg_size_bytes().

Best,

David

On Wed, Apr 14, 2021 at 11:13 AM David Christensen <
david.christen...@crunchydata.com> wrote:

> Hi folks,
>
> Enclosed is a patch that expands the unit output for
> pg_size_pretty(numeric) going up to Yottabytes; I reworked the existing
> numeric output code to account for the larger number of units we're using
> rather than just adding nesting levels.
>
> There are also a few other places that could benefit from expanded units,
> but this is a standalone starting point.
>
> Best,
>
> David
>


0001-Expand-the-supported-units-in-pg_size_bytes-to-cover.patch
Description: Binary data

[PATCH] comment fixes for delayChkptFlags

2022-10-14 Thread David Christensen

Enclosed is a trivial fix for a typo and misnamed field I noted when doing
some code review.

Best,

David


0001-fix-comment-typos-for-delayChkptFlags.patch
Description: Binary data

Re: [PATCHES] Post-special page storage TDE support

2022-10-25 Thread David Christensen

> > Explicitly
> > locking (assuming you stay in your lane) should only need to guard
> > against access from other
> > backends of this type if using shared buffers, so will be use-case 
> > dependent.
>
> I'm not sure what you mean here?

I'm mainly pointing out that the specific code that manages this
feature is the only one who has to worry about modifying said page
region.

> > This does have a runtime overhead due to moving some offset
> > calculations from compile time to
> > runtime. It is thought that the utility of this feature will outweigh
> > the costs here.
>
> Have you done some benchmarking to give an idea of how much overhead we're
> talking about?

Not yet, but I am going to work on this.  I suspect the current code
could be improved, but will try to get some sort of measurement of the
additional overhead.

> > Candidates for page features include 32-bit or 64-bit checksums,
> > encryption tags, or additional
> > per-page metadata.
> >
> > While we are not currently getting rid of the pd_checksum field, this
> > mechanism could be used to
> > free up that 16 bits for some other purpose.
>
> IIUC there's a hard requirement of initdb-time initialization, as there's
> otherwise no guarantee that you will find enough free space in each page at
> runtime.  It seems like a very hard requirement for a full replacement of the
> current checksum approach (even if I agree that the current implementation
> limitations are far from ideal), especially since there's no technical reason
> that would prevent us from dynamically enabling data-checksums without doing
> all the work when the cluster is down.

As implemented, that is correct; we are currently assuming this
specific feature mechanism is set at initdb time only.  Checksums are
not the primary motivation here, but were something that I could use
for an immediate illustration of the feature.  That said, presumably
you could define a way to set the features per-relation (say with a
template field in pg_class) which would propagate to a relation on
rewrite, so there could be ways to handle things incrementally, were
this an overall goal.

Thanks for looking,

David

Re: [PATCHES] Post-special page storage TDE support

2022-10-28 Thread David Christensen

Hi Matthias,

> Did you read the related thread with related discussion from last June, "Re: 
> better page-level checksums" [0]? In that I argued that space at the end of a 
> page is already allocated for the AM, and that reserving variable space at 
> the end of the page for non-AM usage is wasting the AM's performance 
> potential.

Yes, I had read parts of that thread among others, but have given it a
re-read.  I can see the point you're making here, and agree that if we
can allocate between pd_special and pd_upper that could make sense.  I
am a little unclear as to what performance impacts for the AM there
would be if this additional space were ahead or behind the page
special area; it seems like if this is something that needs to live on
the page *somewhere* just being aligned correctly would be sufficient
from the AM's standpoint.  Considering that I am trying to make this
have zero storage impact if these features are not active, the impact
on a cluster with no additional features would be moot from a storage
perspective, no?

> Apart from that: Is this variable-sized 'metadata' associated with smgr 
> infrastructure only, or is it also available for AM features? If not; then 
> this is a strong -1. The amount of tasks smgr needs to do on a page is 
> generally much less than the amount of tasks an AM needs to do; so in my view 
> the AM has priority in prime page real estate, not smgr or related 
> infrastructure.

I will confess to a slightly wobbly understanding of the delineation
of responsibility here. I was under the impression that by modifying
any consumer of PageHeaderData this would be sufficient to cover all
AMs for the types of cluster-wide options we'd be concerned about (say
extended checksums, multiple page encryption schemes, or other
per-page information we haven't yet anticipated).  Reading smgr/README
and the various access/*/README has not made the distinction clear to
me yet.

> re: PageFeatures
> I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part 
> of the storage manager (smgr) implementation / can't this be part of the smgr 
> of the relation?

For at least the feature cases I'm anticipating, this would apply to
any disk page that may have user data, set (at least initially) at
initdb time, so should apply to any pages in the cluster, regardless
of AM.

> re: use of pd_checksum
> I mentioned this in the above-mentioned thread too, in [1], that we could use 
> pd_checksum as an extra area marker for this storage-specific data, which 
> would be located between pd_upper and pd_special.

I do think that we could indeed use this as an additional in-page
pointer, but at least for this version was keeping things
backwards-compatible.  Peter G (I think) also made some good points
about how to include the various status bits on the page somehow in
terms of making a page completely self-contained.

> Re: patch contents
>
> 0001:
> >+ specialSize = MAXALIGN(specialSize) + reserved_page_size;
>
> This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or 
> an assertion that reserved_page_size is MAXALIGNED, would be better.

It is currently aligned via the space calculation return value but
agree that folding it into an assert or reworking it explicitly is
clearer.

> >  PageValidateSpecialPointer(Page page)
> >  {
> >  Assert(page);
> > -Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> > +AssertPageHeader) page)->pd_special - reserved_page_size) <= 
> > BLCKSZ);
>
> This check is incorrect. With your code it would allow pd_special past the 
> end of the block. If you want to put the reserved_space_size effectively 
> inside the special area, this check should instead be:
>
> +Assert(((PageHeader) page)->pd_special <= (BLCKSZ - reserved_page_size));
>
> Or, equally valid
>
> +AssertPageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ);

Yup, I think I inverted my logic there; thanks.

> > + * +-+-++-+
> > + * | ... tuple2 tuple1 | "special space" | "reserved" |
> > + * +---++-+
>
> Could you fix the table display if / when you revise the patchset? It seems 
> to me that the corners don't line up with the column borders.

Sure thing.

> 0002:
> > Make the output of "select_views" test stable
> > Changing the reserved_page_size has resulted in non-stable results for this 
> > test.
>
> This makes sense, what kind of instability are we talking about? Are there 
> different results for runs with the same binary, or is this across 
> compilations?

When running with the same compilation/initdb settings, the test
results are stable, but differ depending what options you chose, so
`make installcheck` output will fail when testing a cluster with
different options vs upstream HEAD without these patches, etc.

> 0003 and up were not yet reviewed in depth.

Thanks, I appreciate the feedback so far.

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-04 Thread David Christensen

On Nov 4, 2022, at 9:02 AM, Justin Pryzby  wrote:
> 
> On Fri, Nov 04, 2022 at 11:52:59AM +0900, Ian Lawrence Barwick wrote:
>> 2022年5月3日(火) 8:45 David Christensen :
>>> 
>>> ...and pushing a couple fixups pointed out by cfbot, so here's v4.
>> 
>> cfbot reports the patch no longer applies [1]. As CommitFest 2022-11 is
>> currently underway, this would be an excellent time to update the patch.
> 
> More important than needing to be rebased, the patch has never passed
> its current tests on windows.
> 
> As I recall, that's due to relying on "cp".  And "rsync", which
> shouldn't be assumed to exist by regression tests).

I will work on supporting the windows compatibility here. Is there some list of 
guidelines for what you can and can’t use? I don’t have a windows machine 
available to develop on. 

Was it failing on windows? I was attempting to skip it as I recall. 

Best,

David

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-03 Thread David Christensen

New versions attached to address the initial CF feedback and rebase on HEAD
as of now.

0001-Expand-the-units-that-pg_size_pretty-numeric-knows-a.patch

- expands the units that pg_size_pretty() can handle up to YB.

0002-Expand-the-supported-units-in-pg_size_bytes-to-cover.patch

- expands the units that pg_size_bytes() can handle, up to YB.


0002-Expand-the-supported-units-in-pg_size_bytes-to-cover.patch
Description: Binary data


0001-Expand-the-units-that-pg_size_pretty-numeric-knows-a.patch
Description: Binary data

1 2 >

1 - 100 of 166 matches

Mail list logo