date:20220225

Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:
> From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
> From: Yura Sokolov 
> Date: Mon, 21 Feb 2022 08:49:03 +0300
> Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.
> 
> Acquiring two partition locks leads to complex dependency chain that hurts
> at high concurrency level.
> 
> There is no need to hold both lock simultaneously. Buffer is pinned so
> other processes could not select it for eviction. If tag is cleared and
> buffer removed from old partition other processes will not find it.
> Therefore it is safe to release old partition lock before acquiring
> new partition lock.

Yes, the current design is pretty nonsensical. It leads to really absurd stuff
like holding the relation extension lock while we write out old buffer
contents etc.



> +  * We have pinned buffer and we are single pinner at the moment so there
> +  * is no other pinners.

Seems redundant.


> We hold buffer header lock and exclusive partition
> +  * lock if tag is valid. Given these statements it is safe to clear tag
> +  * since no other process can inspect it to the moment.
> +  */

Could we share code with InvalidateBuffer here? It's not quite the same code,
but nearly the same.


> +  * The usage_count starts out at 1 so that the buffer can survive one
> +  * clock-sweep pass.
> +  *
> +  * We use direct atomic OR instead of Lock+Unlock since no other backend
> +  * could be interested in the buffer. But StrategyGetBuffer,
> +  * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them 
> to
> +  * compare tag, and UnlockBufHdr does raw write to state. So we have to
> +  * spin if we found buffer locked.

So basically the first half of of the paragraph is wrong, because no, we
can't?


> +  * Note that we write tag unlocked. It is also safe since there is 
> always
> +  * check for BM_VALID when tag is compared.



>*/
>   buf->tag = newTag;
> - buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> -BM_CHECKPOINT_NEEDED | BM_IO_ERROR | 
> BM_PERMANENT |
> -BUF_USAGECOUNT_MASK);
>   if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == 
> INIT_FORKNUM)
> - buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> + new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
>   else
> - buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> -
> - UnlockBufHdr(buf, buf_state);
> + new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
>  
> - if (oldPartitionLock != NULL)
> + buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> + while (unlikely(buf_state & BM_LOCKED))

I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Why don't you just use LockBufHdr/UnlockBufHdr?

Greetings,

Andres Freund

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-25 Thread Kyotaro Horiguchi

At Fri, 25 Feb 2022 16:52:30 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Ugh! Wait for a moment. Something's wrong.

Sorry, what is wrong was my working directory.  It was broken by my
bogus operation. All the files apply corresponding versions correctly.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Kyotaro Horiguchi

At Fri, 25 Feb 2022 00:04:55 -0800, Andres Freund  wrote in 
> Why don't you just use LockBufHdr/UnlockBufHdr?

FWIW, v2 looked fine to me in regards to this point.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Yura Sokolov

Hello, Simon.

В Пт, 25/02/2022 в 04:35 +, Simon Riggs пишет:
> On Mon, 21 Feb 2022 at 08:06, Yura Sokolov  wrote:
> > Good day, Kyotaro Horiguchi and hackers.
> > 
> > В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> > > At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov 
> > >  wrote in
> > > > Hello, all.
> > > > 
> > > > I thought about patch simplification, and tested version
> > > > without BufTable and dynahash api change at all.
> > > > 
> > > > It performs suprisingly well. It is just a bit worse
> > > > than v1 since there is more contention around dynahash's
> > > > freelist, but most of improvement remains.
> > > > 
> > > > I'll finish benchmarking and will attach graphs with
> > > > next message. Patch is attached here.
> > > 
> > > Thanks for the new patch.  The patch as a whole looks fine to me. But
> > > some comments needs to be revised.
> > 
> > Thank you for review and remarks.
> 
> v3 gets the buffer partition locking right, well done, great results!
> 
> In v3, the comment at line 1279 still implies we take both locks
> together, which is not now the case.
> 
> Dynahash actions are still possible. You now have the BufTableDelete
> before the BufTableInsert, which opens up the possibility I discussed
> here:
> http://postgr.es/m/CANbhV-F0H-8oB_A+m=55hP0e0QRL=rdddqusxmtft6jprdx...@mail.gmail.com
> (Apologies for raising a similar topic, I hadn't noticed this thread
> before; thanks to Horiguchi-san for pointing this out).
> 
> v1 had a horrible API (sorry!) where you returned the entry and then
> explicitly re-used it. I think we *should* make changes to dynahash,
> but not with the API you proposed.
> 
> Proposal for new BufTable API
> BufTableReuse() - similar to BufTableDelete() but does NOT put entry
> back on freelist, we remember it in a private single item cache in
> dynahash
> BufTableAssign() - similar to BufTableInsert() but can only be
> executed directly after BufTableReuse(), fails with ERROR otherwise.
> Takes the entry from single item cache and re-assigns it to new tag
> 
> In dynahash we have two new modes that match the above
> HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
> places entry on the single item cache, avoiding freelist
> HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
> uses the entry from the single item cache, rather than asking freelist
> This last call can fail if someone else already inserted the tag, in
> which case it adds the single item cache entry back onto freelist
> 
> Notice that single item cache is not in shared memory, so on abort we
> should give it back, so we probably need an extra API call for that
> also to avoid leaking an entry.

Why there is need for this? Which way backend could be forced to abort
between BufTableReuse and BufTableAssign in this code path? I don't
see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
something.

> 
> Doing it this way allows us to
> * avoid touching freelists altogether in the common path - we know we
> are about to reassign the entry, so we do remember it - no contention
> from other backends, no borrowing etc..
> * avoid sharing the private details outside of the dynahash module
> * allows us to use the same technique elsewhere that we have
> partitioned hash tables
> 
> This approach is cleaner than v1, but should also perform better
> because there will be a 1:1 relationship between a buffer and its
> dynahash entry, most of the time.

Thank you for suggestion. Yes, it is much clearer than my initial proposal.

Should I incorporate it to v4 patch? Perhaps, it could be a separate
commit in new version.


> 
> With these changes, I think we will be able to *reduce* the number of
> freelists for partitioned dynahash from 32 to maybe 8, as originally
> speculated by Robert in 2016:
>
> https://www.postgresql.org/message-id/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
> since the freelists will be much less contended with the above approach
> 
> It would be useful to see performance with a higher number of connections, 
> >400.
> 
> --
> Simon Riggshttp://www.EnterpriseDB.com/

--

regards,
Yura Sokolov

Re: PATCH: add "--config-file=" option to pg_rewind

2022-02-25 Thread Gunnar "Nick" Bluth


Am 24.02.22 um 14:46 schrieb Daniel Gustafsson:

On 24 Feb 2022, at 14:43, Gunnar Nick Bluth  wrote:



That looks just as good to me, and it already has tests, so I'll happily step 
down!


Actually, I think this looks like a saner approach.  Putting a config setting
in two place (postgresql.conf and on the commandline for pg_rewind) is a recipe
for them diverging.


Well, well,

reverted https://commitfest.postgresql.org/37/3562/ back to "needs 
review" and moved it to "Replication & Recovery".


Patch v2 includes:
* doc improvements as suggested
* tests shamelessly copied & adapted from 
https://commitfest.postgresql.org/37/3213/


Since the need is pretty obvious (I *think* the Debian-ish userbase is 
quite significant alone), I'd love to see one of the approaches making 
it into 15!


Best regards,
--
Gunnar "Nick" Bluth

Eimermacherweg 106
D-48159 Münster

Mobil +49 172 8853339
Email: gunnar.bl...@pro-open.de
__
"Ceterum censeo SystemD esse delendam" - Catodiff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 33e6bb64ad..31a1a1dedb 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -241,6 +241,17 @@ PostgreSQL documentation
   
  
 
+ 
+  --config-file=filepath
+  
+   
+In case the postgresql.conf of your target cluster is not in the 
+target pgdata and you want to use the -c / --restore-target-wal option,
+provide a (relative or absolute) path to the postgresql.conf using this option.
+   
+  
+ 
+
  
   --debug
   
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index efb82a4034..d57a3a6852 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -60,6 +60,7 @@ char	   *datadir_target = NULL;
 char	   *datadir_source = NULL;
 char	   *connstr_source = NULL;
 char	   *restore_command = NULL;
+char	   *config_file = NULL;
 
 static bool debug = false;
 bool		showprogress = false;
@@ -86,6 +87,7 @@ usage(const char *progname)
 	printf(_("Options:\n"));
 	printf(_("  -c, --restore-target-wal   use restore_command in target configuration to\n"
 			 " retrieve WAL files from archives\n"));
+	printf(_("  --config-file=FILE path to postgresql.conf if outside target-pgdata (for -c)\n"));
 	printf(_("  -D, --target-pgdata=DIRECTORY  existing data directory to modify\n"));
 	printf(_("  --source-pgdata=DIRECTORY  source data directory to synchronize with\n"));
 	printf(_("  --source-server=CONNSTRsource server to synchronize with\n"));
@@ -120,6 +122,7 @@ main(int argc, char **argv)
 		{"no-sync", no_argument, NULL, 'N'},
 		{"progress", no_argument, NULL, 'P'},
 		{"debug", no_argument, NULL, 3},
+		{"config-file", required_argument, NULL, 5},
 		{NULL, 0, NULL, 0}
 	};
 	int			option_index;
@@ -204,6 +207,10 @@ main(int argc, char **argv)
 			case 4:
 no_ensure_shutdown = true;
 break;
+
+			case 5:
+config_file = pg_strdup(optarg);
+break;
 		}
 	}
 
@@ -1051,9 +1058,16 @@ getRestoreCommand(const char *argv0)
 	 * Build a command able to retrieve the value of GUC parameter
 	 * restore_command, if set.
 	 */
-	snprintf(postgres_cmd, sizeof(postgres_cmd),
+	if (config_file == NULL)
+	{
+		snprintf(postgres_cmd, sizeof(postgres_cmd),
 			 "\"%s\" -D \"%s\" -C restore_command",
 			 postgres_exec_path, datadir_target);
+	} else {
+		snprintf(postgres_cmd, sizeof(postgres_cmd),
+			 "\"%s\" -D \"%s\" --config_file=\"%s\" -C restore_command",
+			 postgres_exec_path, datadir_target, config_file);
+	}
 
 	if (!pipe_read_line(postgres_cmd, cmd_output, sizeof(cmd_output)))
 		exit(1);
diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index db9201f38e..3c395ece12 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -190,5 +190,6 @@ in primary, before promotion
 run_test('local');
 run_test('remote');
 run_test('archive');
+run_test('archive_cli');
 
 done_testing();
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index 5651602858..94d413dcb6 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -208,6 +208,69 @@ sub promote_standby
 	return;
 }
 
+sub run_pg_rewind_archive
+{
+	my $test_mode = shift;
+	my $restore_command;
+	my $primary_pgdata  = $node_primary->data_dir;
+
+	# Remove the existing archive directory and move all WAL
+	# segments from the old primary to the archives.  These
+	# will be used by pg_rewind.
+	rmtree($node_primary->archive_dir);
+	PostgreSQL::Test::RecursiveCopy::copypath($node_primary->data_dir . '/pg_wal',
+		$node_primary->archive_dir);
+
+	# Fast way to remove entire directory content
+	rmtree($node_primary->data_dir . '/pg_wal');
+	mkdir($node_primary->data_dir . '/pg_wal');
+
+	# Make sure that directories have the

Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Simon Riggs

On Fri, 25 Feb 2022 at 09:24, Yura Sokolov  wrote:

> > This approach is cleaner than v1, but should also perform better
> > because there will be a 1:1 relationship between a buffer and its
> > dynahash entry, most of the time.
>
> Thank you for suggestion. Yes, it is much clearer than my initial proposal.
>
> Should I incorporate it to v4 patch? Perhaps, it could be a separate
> commit in new version.

I don't insist that you do that, but since the API changes are a few
hours work ISTM better to include in one patch for combined perf
testing. It would be better to put all changes in this area into PG15
than to split it across multiple releases.

> Why there is need for this? Which way backend could be forced to abort
> between BufTableReuse and BufTableAssign in this code path? I don't
> see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
> something.

Sounds reasonable.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Yura Sokolov

Hello, Andres

В Пт, 25/02/2022 в 00:04 -0800, Andres Freund пишет:
> Hi,
> 
> On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:
> > From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
> > From: Yura Sokolov 
> > Date: Mon, 21 Feb 2022 08:49:03 +0300
> > Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.
> > 
> > Acquiring two partition locks leads to complex dependency chain that hurts
> > at high concurrency level.
> > 
> > There is no need to hold both lock simultaneously. Buffer is pinned so
> > other processes could not select it for eviction. If tag is cleared and
> > buffer removed from old partition other processes will not find it.
> > Therefore it is safe to release old partition lock before acquiring
> > new partition lock.
> 
> Yes, the current design is pretty nonsensical. It leads to really absurd stuff
> like holding the relation extension lock while we write out old buffer
> contents etc.
> 
> 
> 
> > +* We have pinned buffer and we are single pinner at the moment so there
> > +* is no other pinners.
> 
> Seems redundant.
> 
> 
> > We hold buffer header lock and exclusive partition
> > +* lock if tag is valid. Given these statements it is safe to clear tag
> > +* since no other process can inspect it to the moment.
> > +*/
> 
> Could we share code with InvalidateBuffer here? It's not quite the same code,
> but nearly the same.
> 
> 
> > +* The usage_count starts out at 1 so that the buffer can survive one
> > +* clock-sweep pass.
> > +*
> > +* We use direct atomic OR instead of Lock+Unlock since no other backend
> > +* could be interested in the buffer. But StrategyGetBuffer,
> > +* Flush*Buffers, Drop*Buffers are scanning all buffers and locks them 
> > to
> > +* compare tag, and UnlockBufHdr does raw write to state. So we have to
> > +* spin if we found buffer locked.
> 
> So basically the first half of of the paragraph is wrong, because no, we
> can't?

Logically, there are no backends that could be interesting in the buffer.
Physically they do LockBufHdr/UnlockBufHdr just to check they are not 
interesting.

> > +* Note that we write tag unlocked. It is also safe since there is 
> > always
> > +* check for BM_VALID when tag is compared.
> 
> 
> >  */
> > buf->tag = newTag;
> > -   buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> > -  BM_CHECKPOINT_NEEDED | BM_IO_ERROR | 
> > BM_PERMANENT |
> > -  BUF_USAGECOUNT_MASK);
> > if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == 
> > INIT_FORKNUM)
> > -   buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > +   new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > else
> > -   buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > -
> > -   UnlockBufHdr(buf, buf_state);
> > +   new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> >  
> > -   if (oldPartitionLock != NULL)
> > +   buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> > +   while (unlikely(buf_state & BM_LOCKED))
> 
> I don't think it's safe to atomic in arbitrary bits. If somebody else has
> locked the buffer header in this moment, it'll lead to completely bogus
> results, because unlocking overwrites concurrently written contents (which
> there shouldn't be any, but here there are)...

That is why there is safety loop in the case buf->state were locked just
after first optimistic atomic_fetch_or. 99.999% times this loop will not
have a job. But in case other backend did lock buf->state, loop waits
until it releases lock and retry atomic_fetch_or.

> And or'ing contents in also doesn't make sense because we it doesn't work to
> actually unset any contents?

Sorry, I didn't understand sentence :((

> Why don't you just use LockBufHdr/UnlockBufHdr?

This pair makes two atomic writes to memory. Two writes are heavier than
one write in this version (if optimistic case succeed).

But I thought to use Lock+UnlockBuhHdr instead of safety loop:

buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
if (unlikely(buf_state & BM_LOCKED))
{
buf_state = LockBufHdr(&buf->state);
UnlockBufHdr(&buf->state, buf_state | new_bits);
}

I agree this way code is cleaner. Will do in next version.

-

regards,
Yura Sokolov

Re: logical replication empty transactions

2022-02-25 Thread Peter Smith

Hi. Here are my review comments for the v19 patch.

==

1. Commit message

The current logical replication behavior is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications).

SUGGESTION
"to subscriber even though" --> "to the subscriber even if"

~~~

2. Commit message

This patch addresses the above problem by postponing the BEGIN message
until the first change. While processing a COMMIT message,
if there is no other change for that transaction,
do not send COMMIT message. It means that pgoutput will
skip BEGIN/COMMIT messages for transactions that are empty.

SUGGESTION
"if there is" --> "if there was"
"do not send COMMIT message" --> "do not send the COMMIT message"
"It means that pgoutput" --> "This means that pgoutput"

~~~

3. Commit message

Shouldn't there be some similar description about using a lazy send
mechanism for STREAM START?

~~~

4. src/backend/replication/pgoutput/pgoutput.c - typedef struct PGOutputTxnData

+/*
+ * Maintain a per-transaction level variable to track whether the
+ * transaction has sent BEGIN. BEGIN is only sent when the first
+ * change in a transaction is processed. This makes it possible
+ * to skip transactions that are empty.
+ */
+typedef struct PGOutputTxnData
+{
+   bool sent_begin_txn;/* flag indicating whether BEGIN has been sent */
+   bool sent_stream_start; /* flag indicating if stream start has been sent */
+   bool sent_any_stream;   /* flag indicating if any stream has been sent */
+} PGOutputTxnData;
+

The struct comment looks stale because it doesn't mention anything
about the similar lazy send mechanism for STREAM_START.

~~~

5. src/backend/replication/pgoutput/pgoutput.c - pgoutput_begin_txn

 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+ PGOutputTxnData*txndata = MemoryContextAllocZero(ctx->context,
+ sizeof(PGOutputTxnData));
+
+ txndata->sent_begin_txn = false;
+ txn->output_plugin_private = txndata;
+}

You don’t need to assign the other members 'sent_stream_start',
'sent_any_stream' because you are doing MemoryContextAllocZero anyway,
but for the same reason you did not really need to assign the
'sent_begin_txn' flag either.

I guess for consistency maybe it is better to (a)  set all of them or
(b) set none of them. I prefer (b).

~~~

6. src/backend/replication/pgoutput/pgoutput.c - pgoutput_begin

I feel the 'pgoutput_begin' function is not well named. It makes some
of the code where they are called look quite confusing.

For streaming there is:
1. pgoutput_stream_start (does not send)
2. pgoutput_send_stream_start (does send)
so it is very clear.

OTOH there are
3. pgoutput_begin_txn (does not send)
4. pgoutput_begin (does send)

For consistency I think the 'pgoutput_begin' name should be changed to
include "send" verb
1. pgoutput_begin_txn (does not send)
2. pgoutput_send_begin_txn (does send)

~~~

7. src/backend/replication/pgoutput/pgoutput.c - maybe_send_schema

@@ -594,6 +663,20 @@ maybe_send_schema(LogicalDecodingContext *ctx,
  if (schema_sent)
  return;

+   /* set up txndata */
+   txndata = toptxn->output_plugin_private;
+
+   /*
+* Before we send schema, make sure that STREAM START/BEGIN/BEGIN PREPARE
+* is sent. If not, send now.
+*/
+   if (in_streaming && !txndata->sent_stream_start)
+   pgoutput_send_stream_start(ctx, toptxn);
+   else if (txndata && !txndata->sent_begin_txn)
+   {
+   pgoutput_begin(ctx, toptxn);
+   }
+

How come the in_streaming case is not checking for a NULL txndata
before referencing it? Even if it is OK to do that, some more comments
or assertions might help for this piece of code.
(Stop-Press: see later comments #9, #10)

~~~

8. src/backend/replication/pgoutput/pgoutput.c - maybe_send_schema

@@ -594,6 +663,20 @@ maybe_send_schema(LogicalDecodingContext *ctx,
  if (schema_sent)
  return;

+   /* set up txndata */
+   txndata = toptxn->output_plugin_private;
+
+   /*
+* Before we send schema, make sure that STREAM START/BEGIN/BEGIN PREPARE
+* is sent. If not, send now.
+*/

What part of this code is doing anything about "BEGIN PREPARE" ?

~~~

9. src/backend/replication/pgoutput/pgoutput.c - pgoutput_change

@@ -1183,6 +1267,15 @@ pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
  Assert(false);
  }

+ /* If streaming, send STREAM START if we haven't yet */
+ if (in_streaming && (txndata && !txndata->sent_stream_start))
+ pgoutput_send_stream_start(ctx, txn);
+ /*
+ * Output BEGIN if we haven't yet, unless streaming.
+ */
+ else if (!in_streaming && (txndata && !txndata->sent_begin_txn))
+ pgoutput_begin(ctx, txn);
+

The above code fragment looks more like what IU was expecting should
be in 'maybe_send_schema',

If you expand it out (and tweak the comments) it can become much less
complex looking IMO

e.g.

if (in_streaming)
{
/* If streaming, send STREAM START if we haven't yet */
if (txndata && !txndata-

Commitfest manager for 2022-03

2022-02-25 Thread Julien Rouhaud

Hi,

The final commitfest for pg15 will start in a few days, and I didn't see any
discussion on it or anyone volunteering to be a CFM.

I thought it would be a good idea to send this reminder now and avoid the same
situation as the last commitfest, to avoid unnecessary pain for the CFM(s).

Is there any volunteer?

For the record there are already 246 active patches registered.

Re: PROXY protocol support

2022-02-25 Thread Magnus Hagander

On Tue, Nov 16, 2021 at 12:03 AM Jacob Champion  wrote:
>
> On Thu, 2021-11-04 at 12:03 +0100, Magnus Hagander wrote:
> > Thanks for the pointer, PFA a rebase.
>
> I think the Unix socket handling needs the same "success" fix that you
> applied to the TCP socket handling above it:
>
> > @@ -1328,9 +1364,23 @@ PostmasterMain(int argc, char *argv[])
> > ereport(WARNING,
> > (errmsg("could not create Unix-domain socket in 
> > directory \"%s\"",
> > socketdir)));
> > +
> > +   if (ProxyPortNumber)
> > +   {
> > +   socket = StreamServerPort(AF_UNIX, NULL,
> > + (unsigned short) ProxyPortNumber,
> > + socketdir,
> > + ListenSocket, MAXLISTEN);
> > +   if (socket)
> > +   socket->isProxy = true;
> > +   else
> > +   ereport(WARNING,
> > +   (errmsg("could not create Unix-domain PROXY 
> > socket for \"%s\"",
> > +   socketdir)));
> > +   }
> > }
> >
> > -   if (!success && elemlist != NIL)
> > +   if (socket == NULL && elemlist != NIL)
> > ereport(FATAL,
> > (errmsg("could not create any Unix-domain sockets")));
>
> Other than that, I can find nothing else to improve, and I think this
> is ready for more eyes than mine. :)

Here's another rebase on top of the AF_UNIX patch.



> To tie off some loose ends from upthread:
>
> I didn't find any MAXLISTEN documentation either, so I guess it's only
> a documentation issue if someone runs into it, heh.
>
> I was not able to find any other cases (besides ident) where using
> daddr instead of laddr would break things. I am going a bit snow-blind
> on the patch, though, and there's a lot of auth code.

Yeah, that's definitely a good reason for more eyes on it.


> A summary of possible improvements talked about upthread, for a future
> v2:
>
> - SQL functions to get the laddr info (scoped to superusers, somehow),
> if there's a use case for them
>
> - Setting up PROXY Unix socket permissions separately from the "main"
> socket
>
> - Allowing PROXY-only communication (disable the "main" port)

These all seem useful, but I'm liking the idea of putting them in a
v2, to avoid expanding the scope too much.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index 02f0489112..a3ff09b3ac 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -353,6 +353,15 @@ hostnogssenc  database  user
 
+  
+   If  is enabled and the
+   connection is made through a proxy server using the PROXY
+   protocol, the actual IP address of the client will be used
+   for matching. If a connection is made through a proxy server
+   not using the PROXY protocol, the IP address of the
+   proxy server will be used.
+  
+
   
These fields do not apply to local records.
   
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ed8c82a9d..e0847b6347 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -682,6 +682,56 @@ include_dir 'conf.d'
   
  
 
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+   
+The TCP port the server listens on for PROXY connections, disabled by
+default. If set to a number, PostgreSQL
+will listen on this port on the same addresses as for regular
+connections, but expect all connections to use the PROXY protocol to
+identify the client.  This parameter can only be set at server start.
+   
+   
+If a proxy connection is made over this port, and the proxy is listed
+in , the actual client address
+will be considered as the address of the client, instead of listing
+all connections as coming from the proxy server.
+   
+   
+ The http://www.haproxy.org/download/1.9/doc/proxy-protocol.txt";>PROXY
+ protocol is maintained by HAProxy,
+ and supported in many proxies and load
+ balancers. PostgreSQL supports version 2
+ of the protocol.
+   
+  
+ 
+
+ 
+  proxy_servers (string)
+  
+   proxy_servers configuration parameter
+  
+  
+  
+   
+A comma separated list of one or more ip addresses, cidr specifications or the
+literal unix, indicating which proxy servers to trust when
+connecting on the port specified in .
+   
+   
+If a proxy connection is made from an IP address not covered by this
+list, the connection will be rejected. By default no proxy is trusted
+and all proxy conne

Re: Some optimisations for numeric division

2022-02-25 Thread Dean Rasheed

On Wed, 23 Feb 2022 at 22:55, Tom Lane  wrote:
>
> Dean Rasheed  writes:
>
> > One thought that occurred to me was that it's a bit silly that
> > exp_var() and ln_var() have to use a NumericVar for what could just be
> > an int, if we had a div_var_int() function that could divide by an
> > int. Then both div_var() and div_var_fast() could hand off to it for
> > one and two digit divisors.
>
> Oooh, that seems like a good idea.
>

OK, I've replaced the 0003 patch with one that does that instead. The
div_var_int() API is slightly different in that it also accepts a
divisor weight argument, but the alternative would have been for the
callers to have to adjust both the result weight and rscale, which
would have been uglier.

There's a large block of code in div_var() that needs re-indenting,
but I think it would be better to leave that to a later pgindent run.

The performance results are quite pleasing. It's slightly faster than
the old one-digit div_var() code because it manages to avoid some
digit array copying, and for two digit divisors it's much faster:

CREATE TEMP TABLE div_test(x numeric, y numeric);
SELECT setseed(0);
INSERT INTO div_test
  SELECT (SELECT (((x%9)+1)||string_agg((random()*9)::int::text, ''))::numeric
  FROM generate_series(1,50)),
 (SELECT ('1.'||string_agg((random()*9)::int::text,
'')||(x%10)||'e3')::numeric
  FROM generate_series(1,6))
  FROM generate_series(1,5000) g(x);
select * from div_test limit 10;

SELECT sum(t1.x/t2.y) FROM div_test t1, div_test t2;

Time: 11600.034 ms  (HEAD)
Time: 9890.788 ms   (with 0002)
Time: 6202.851 ms   (with 0003)

And obviously it'll be a larger relative gain for div_var_fast(),
since that was slower to begin with in such cases.

This makes me think that it might also be worthwhile to follow this
with a similar div_var_int64() function on platforms with 128-bit
integers, which could then be used to handle 3- and 4-digit divisors,
which are probably quite common in practice.

Attached is the updated patch series (0001 and 0002 unchanged).

Regards,
Dean
From 481b7c0044afbe72adff1961624a1bb515791b96 Mon Sep 17 00:00:00 2001
From: Dean Rasheed 
Date: Sun, 20 Feb 2022 17:15:49 +
Subject: [PATCH 2/3] Simplify the inner loop of numeric division in div_var().

In the standard numeric division algorithm, the inner loop multiplies
the divisor by the next quotient digit and subtracts that from the
working dividend. As suggested by the original code comment, the
separate "carry" and "borrow" variables (from the multiplication and
subtraction steps respectively) can be folded together into a single
variable. Doing so significantly improves performance, as well as
simplifying the code.
---
 src/backend/utils/adt/numeric.c | 36 ++---
 1 file changed, 15 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index fc43d2a456..909f4eed74 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8605,31 +8605,25 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 			/* As above, need do nothing more when quotient digit is 0 */
 			if (qhat > 0)
 			{
+NumericDigit *dividend_j = ÷nd[j];
+
 /*
  * Multiply the divisor by qhat, and subtract that from the
- * working dividend.  "carry" tracks the multiplication,
- * "borrow" the subtraction (could we fold these together?)
+ * working dividend.  The multiplication and subtraction are
+ * folded together here, noting that qhat <= NBASE (since it
+ * might be one too large), and so the intermediate result
+ * "tmp_result" is in the range [-NBASE^2, NBASE - 1], and
+ * "borrow" is in the range [0, NBASE].
  */
-carry = 0;
 borrow = 0;
 for (i = var2ndigits; i >= 0; i--)
 {
-	carry += divisor[i] * qhat;
-	borrow -= carry % NBASE;
-	carry = carry / NBASE;
-	borrow += dividend[j + i];
-	if (borrow < 0)
-	{
-		dividend[j + i] = borrow + NBASE;
-		borrow = -1;
-	}
-	else
-	{
-		dividend[j + i] = borrow;
-		borrow = 0;
-	}
+	int			tmp_result;
+
+	tmp_result = dividend_j[i] - borrow - divisor[i] * qhat;
+	borrow = (NBASE - 1 - tmp_result) / NBASE;
+	dividend_j[i] = tmp_result + borrow * NBASE;
 }
-Assert(carry == 0);
 
 /*
  * If we got a borrow out of the top dividend digit, then
@@ -8645,15 +8639,15 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	carry = 0;
 	for (i = var2ndigits; i >= 0; i--)
 	{
-		carry += dividend[j + i] + divisor[i];
+		carry += dividend_j[i] + divisor[i];
 		if (carry >= NBASE)
 		{
-			dividend[j + i] = carry - NBASE;
+			dividend_j[i] = carry - NBASE;
 			carry = 1;
 		}
 		else
 		{
-			dividend[j + i] = carry;
+			dividend_j[i] = carry;
 			carry = 0;
 		}
 	}
-- 
2.26.2

From 41732ad9a

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-02-25 Thread Bharath Rupireddy

On Wed, Feb 16, 2022 at 9:04 AM Ashutosh Sharma  wrote:
> I don't think that's the use case of this patch. Unless there is some
> other valid reason, I would suggest you remove it.

Removed the function pg_verify_raw_wal_record. Robert and Greg also
voted for removal upthread.

> > > Should we add a function that returns the pointer to the first and
> > > probably the last WAL record in the WAL segment? This would help users
> > > to inspect the wal records in the entire wal segment if they wish to
> > > do so.
> >
> > Good point. One can do this already with pg_get_wal_records_info and
> > pg_walfile_name_offset. Usually, the LSN format itself can give an
> > idea about the WAL file it is in.
> >
> > postgres=# select lsn, pg_walfile_name_offset(lsn) from
> > pg_get_wal_records_info('0/500', '0/5FF') order by lsn asc
> > limit 1;
> > lsn|pg_walfile_name_offset
> > ---+---
> >  0/538 | (00010005,56)
> > (1 row)
> >
> > postgres=# select lsn, pg_walfile_name_offset(lsn) from
> > pg_get_wal_records_info('0/500', '0/5FF') order by lsn desc
> > limit 1;
> > lsn|   pg_walfile_name_offset
> > ---+-
> >  0/5C0 | (00010005,16777152)
> > (1 row)
> >
>
> The workaround you are suggesting is not very user friendly and FYKI
> pg_wal_records_info simply hangs at times when we specify the higher
> and lower limit of lsn in a wal file.
>
> To make things easier for the end users I would suggest we add a
> function that can return a valid first and last lsn in a walfile. The
> output of this function can be used to inspect the wal records in the
> entire wal file if they wish to do so and I am sure they will. So, it
> should be something like this:
>
> select first_valid_lsn, last_valid_lsn from
> pg_get_first_last_valid_wal_record('wal-segment-name');
>
> And above function can directly be used with pg_get_wal_records_info() like
>
> select 
> pg_get_wal_records_info(pg_get_first_last_valid_wal_record('wal-segment'));
>
> I think this is a pretty basic ASK that we expect to be present in the
> module like this.

Added a new function that returns the first and last valid WAL record
LSN of a given WAL file.

> > > +PG_FUNCTION_INFO_V1(pg_get_raw_wal_record);
> > > +PG_FUNCTION_INFO_V1(pg_get_first_valid_wal_record_lsn);
> > > +PG_FUNCTION_INFO_V1(pg_verify_raw_wal_record);
> > > +PG_FUNCTION_INFO_V1(pg_get_wal_record_info);
> > > +PG_FUNCTION_INFO_V1(pg_get_wal_records_info);
> > >
> > > I think we should allow all these functions to be executed in wait and
> > > *nowait* mode. If a user specifies nowait mode, the function should
> > > return if no WAL data is present, rather than waiting for new WAL data
> > > to become available, default behaviour could be anything you like.
> >
> > Currently, pg_walinspect uses read_local_xlog_page which waits in the
> > while(1) loop if a future LSN is specified. As read_local_xlog_page is
> > an implementation of XLogPageReadCB, which doesn't have a wait/nowait
> > parameter, if we really need a wait/nowait mode behaviour, we need to
> > do extra things(either add a backend-level global wait variable, set
> > before XLogReadRecord, if set, read_local_xlog_page can just exit
> > without waiting and reset after the XLogReadRecord or add an extra
> > bool wait variable to XLogReaderState and use it in
> > read_local_xlog_page).
> >
>
> I am not asking to do any changes in the backend code. Please check -
> how pg_waldump does this when a user requests to stop once the endptr
> has reached. If not for all functions at least for a few functions we
> can do this if it is doable.

I've added a new function read_local_xlog_page_2 (similar to
read_local_xlog_page but works in wait and no wait mode) and the
callers can specify whether to wait or not wait using private_data.
Actually, I wanted to use the private_data structure of
read_local_xlog_page but the logical decoding already has context as
private_data, that is why I had to have a new function. I know it
creates a bit of duplicate code, but its cleaner than using
backend-local variables or additional flags in XLogReaderState or
adding wait/no-wait boolean to page_read callback. Any other
suggestions are welcome here.

With this, I'm able to have wait/no wait versions for any functions.
But for now, I'm having wait/no wait for two functions
(pg_get_wal_records_info and pg_get_wal_stats) for which it makes more
sense.

> > > +Datum
> > > +pg_get_wal_records_info(PG_FUNCTION_ARGS)
> > > +{
> > > +#define PG_GET_WAL_RECORDS_INFO_COLS 10
> > >
> > >
> > > We could probably have another variant of this function that would
> > > work even if the end pointer is not specified, in which case the
> > > default end pointer would be the last WAL record in the WAL segment.
> > > Currently it mandates the use of an end pointer which slightly reduces
> > > flexibility.
> >
> > Last WAL record in

Re: logical replication empty transactions

2022-02-25 Thread Ajin Cherian

On Fri, Feb 18, 2022 at 9:27 PM Amit Kapila  wrote:
>
>
> Yeah, I think there could be multiple ways (a) We can send such a keep
> alive in WalSndUpdateProgress() itself by using ctx->write_location.
> For this, we need to modify WalSndKeepalive() to take sentPtr as
> input. (b) set some flag in WalSndUpdateProgress() and then do it
> somewhere in WalSndLoop probably in WalSndKeepaliveIfNecessary, or
> maybe there is another better way.
>

Thanks for the suggestion Amit and Osumi-san, I experimented with both
the suggestions but finally decided to use
 (a)Modifying WalSndKeepalive() to take an LSN optionally as input and
passed in the  ctx->write_location.

I also verified that if I block the WalSndKeepalive() in
WalSndWaitForWal, then my new code sends the keepalive
when skipping transactions and the syncrep gets back feedback..

I will address comments from Peter and Wang in my next patch update.

regards,
Ajin Cherian
Fujitsu Australia

v20-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data

Add LZ4 compression in pg_dump

2022-02-25 Thread Georgios

Hi,

please find attached a patchset which adds lz4 compression in pg_dump.

The first commit does the heavy lifting required for additional compression 
methods.
It expands testing coverage for the already supported gzip compression. Commit
bf9aa490db introduced cfp in compress_io.{c,h} with the intent of unifying
compression related code and allow for the introduction of additional archive
formats. However, pg_backup_archiver.c was not using that API. This commit
teaches pg_backup_archiver.c about cfp and is using it through out.

Furthermore, compression was chosen based on the value of the level passed
as an argument during the invocation of pg_dump or some hardcoded defaults. This
does not scale for more than one compression methods. Now the method used for
compression can be explicitly requested during command invocation, or set during
hardcoded defaults. Then it is stored in the relevant structs and passed in the
relevant functions, along side compression level which has lost it's special
meaning. The method for compression is not yet stored in the actual archive.
This is done in the next commit which does introduce a new method.

The previously named CompressionAlgorithm enum is changed for
CompressionMethod so that it matches better similar variables found through out
the code base.

In a fashion similar to the binary for pg_basebackup, the method for compression
is passed using the already existing -Z/--compress parameter of pg_dump. The
legacy format and behaviour is maintained. Additionally, the user can explicitly
pass a requested method and optionaly the level to be used after a 
semicolon,e.g. --compress=gzip:6

The second commit adds LZ4 compression in pg_dump and pg_restore.

Within compress_io.{c,h} there are two distinct APIs exposed, the streaming API
and a file API. The first one, is aimed at inlined use cases and thus simple
lz4.h calls can be used directly. The second one is generating output, or is
parsing input, which can be read/generated via the lz4 utility.

In the later case, the API is using an opaque wrapper around a file stream,
which aquired via fopen() or gzopen() respectively. It would then provide
wrappers around fread(), fwrite(), fgets(), fgetc(), feof(), and fclose(); or
their gz equivallents. However the LZ4F api does not provide this functionality.
So this has been implemented localy.

In order to maintain the API compatibility a new structure LZ4File is
introduced. It is responsible for keeping state and any yet unused generated
content. The later is required when the generated decompressed output, exceeds
the caller's buffer capacity.

Custom compressed archives need to now store the compression method in their
header. This requires a bump in the version number. The level of compression is
still stored in the dump, though admittedly is of no apparent use.

The series is authored by me. Rachel Heaton helped out with the expansion
of the testing coverage, testing in different platforms and providing debug 
information
on those, as well as native speaker wording.

Cheers,
//GeorgiosFrom 8dfe12b08378571add6e1d178bed97a61e988593 Mon Sep 17 00:00:00 2001
From: Georgios Kokolatos 
Date: Mon, 22 Nov 2021 09:58:49 +
Subject: [PATCH v1 1/2] Prepare pg_dump for additional compression methods

This commmit does the heavy lifting required for additional compression methods.
It expands testing coverage for the already supported gzip compression. Commit
bf9aa490db introduced cfp in compress_io.{c,h} with the intent of unifying
compression related code and allow for the introduction of additional archive
formats. However, pg_backup_archiver.c was not using that API. This commit
teaches pg_backup_archiver.c about cfp and is using it through out.

Furthermore, compression was chosen based on the value of the level passed
as an argument during the invocation of pg_dump or some hardcoded defaults. This
does not scale for more than one compression methods. Now the method used for
compression can be explicitly requested during command invocation, or set during
hardcoded defaults. Then it is stored in the relevant structs and passed in the
relevant functions, along side compression level which has lost it's special
meaning. The method for compression is not yet stored in the actual archive.
This is done in the next commit which does introduce a new method.

The previously named CompressionAlgorithm enum is changed for
CompressionMethod so that it matches better similar variables found through out
the code base.

In a fashion similar to the binary for pg_basebackup, the method for compression
is passed using the already existing -Z/--compress parameter of pg_dump. The
legacy format and behaviour is maintained. Additionally, the user can explicitly
pass a requested method and optionaly the level to be used after a semicolon,
e.g. --compress=gzip:6
---
 doc/src/sgml/ref/pg_dump.sgml |  30 +-
 src/bin/pg_dump/Makefile  |   2 +
 src/bin/pg_dump/compress_io.c |

Re: Typo in pgbench messages.

2022-02-25 Thread Fabien COELHO




Hello Daniel,

I think that the break of typographical rules is intentional to allow 
such simplistic low-level stream handling through pipes or scripts. I'd 
prefer that the format is not changed. Maybe a comment could be added 
to explain the reason behind it.


That doesn't sound like an overwhelmingly convincing argument to print 
some messages with 'X %' and others with 'X%'.


Indeed. The no-space % are for database loading progress and the final 
report which I happen not to process through pipes and are more 
user-facing interfaces/reports. The stream piping need applies more to 
repeated lines such as those output by progress, which happen to avoid 
percentages anyway so the questions does not arise.


So fine with me wrt having a more homogeneous report.

--
Fabien.

Re: Some optimisations for numeric division

2022-02-25 Thread Dean Rasheed

On Fri, 25 Feb 2022 at 10:45, Dean Rasheed  wrote:
>
> Attached is the updated patch series (0001 and 0002 unchanged).
>

And another update following feedback from the cfbot.

Regards,
Dean
From 41732ad9a44dcd12e52d823fb59cb23cce4fe217 Mon Sep 17 00:00:00 2001
From: Dean Rasheed 
Date: Sun, 20 Feb 2022 12:45:01 +
Subject: [PATCH 1/3] Apply auto-vectorization to the inner loop of
 div_var_fast().

This loop is basically the same as the inner loop of mul_var(), which
was auto-vectorized in commit 8870917623, but the compiler will only
consider auto-vectorizing the div_var_fast() loop if the assignment
target div[qi + i] is replaced by div_qi[i], where div_qi = &div[qi].

Additionally, since the compiler doesn't know that qdigit is
guaranteed to fit in a 16-bit NumericDigit, cast it to NumericDigit
before multiplying to make the resulting auto-vectorized code more
efficient.
---
 src/backend/utils/adt/numeric.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 3208789f75..fc43d2a456 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8908,13 +8908,22 @@ div_var_fast(const NumericVar *var1, const NumericVar *var2,
 			 * which would make the new value simply div[qi] mod vardigits[0].
 			 * The lower-order terms in qdigit can change this result by not
 			 * more than about twice INT_MAX/NBASE, so overflow is impossible.
+			 *
+			 * This inner loop is the performance bottleneck for division, so
+			 * code it in the same way as the inner loop of mul_var() so that
+			 * it can be auto-vectorized.  We cast qdigit to NumericDigit
+			 * before multiplying to allow the compiler to generate more
+			 * efficient code (using 16-bit multiplication), which is safe
+			 * since we know that the quotient digit is off by at most one, so
+			 * there is no overflow risk.
 			 */
 			if (qdigit != 0)
 			{
 int			istop = Min(var2ndigits, div_ndigits - qi + 1);
+int		   *div_qi = &div[qi];
 
 for (i = 0; i < istop; i++)
-	div[qi + i] -= qdigit * var2digits[i];
+	div_qi[i] -= ((NumericDigit) qdigit) * var2digits[i];
 			}
 		}
 
-- 
2.26.2

From 481b7c0044afbe72adff1961624a1bb515791b96 Mon Sep 17 00:00:00 2001
From: Dean Rasheed 
Date: Sun, 20 Feb 2022 17:15:49 +
Subject: [PATCH 2/3] Simplify the inner loop of numeric division in div_var().

In the standard numeric division algorithm, the inner loop multiplies
the divisor by the next quotient digit and subtracts that from the
working dividend. As suggested by the original code comment, the
separate "carry" and "borrow" variables (from the multiplication and
subtraction steps respectively) can be folded together into a single
variable. Doing so significantly improves performance, as well as
simplifying the code.
---
 src/backend/utils/adt/numeric.c | 36 ++---
 1 file changed, 15 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index fc43d2a456..909f4eed74 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8605,31 +8605,25 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 			/* As above, need do nothing more when quotient digit is 0 */
 			if (qhat > 0)
 			{
+NumericDigit *dividend_j = ÷nd[j];
+
 /*
  * Multiply the divisor by qhat, and subtract that from the
- * working dividend.  "carry" tracks the multiplication,
- * "borrow" the subtraction (could we fold these together?)
+ * working dividend.  The multiplication and subtraction are
+ * folded together here, noting that qhat <= NBASE (since it
+ * might be one too large), and so the intermediate result
+ * "tmp_result" is in the range [-NBASE^2, NBASE - 1], and
+ * "borrow" is in the range [0, NBASE].
  */
-carry = 0;
 borrow = 0;
 for (i = var2ndigits; i >= 0; i--)
 {
-	carry += divisor[i] * qhat;
-	borrow -= carry % NBASE;
-	carry = carry / NBASE;
-	borrow += dividend[j + i];
-	if (borrow < 0)
-	{
-		dividend[j + i] = borrow + NBASE;
-		borrow = -1;
-	}
-	else
-	{
-		dividend[j + i] = borrow;
-		borrow = 0;
-	}
+	int			tmp_result;
+
+	tmp_result = dividend_j[i] - borrow - divisor[i] * qhat;
+	borrow = (NBASE - 1 - tmp_result) / NBASE;
+	dividend_j[i] = tmp_result + borrow * NBASE;
 }
-Assert(carry == 0);
 
 /*
  * If we got a borrow out of the top dividend digit, then
@@ -8645,15 +8639,15 @@ div_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	carry = 0;
 	for (i = var2ndigits; i >= 0; i--)
 	{
-		carry += dividend[j + i] + divisor[i];
+		carry += dividend_j[i] + divisor[i];
 		if (carry >= NBASE)
 		{
-			dividend[j + i] = carry - NBASE;
+			dividend_j[i] = carry - NBASE;
 			carry = 1;

RE: Optionally automatically disable logical replication subscriptions on error

2022-02-25 Thread osumi.takami...@fujitsu.com

On Friday, February 25, 2022 12:57 PM Amit Kapila  
wrote:
> On Thu, Feb 24, 2022 at 6:30 PM Masahiko Sawada
>  wrote:
> >
> > On Thu, Feb 24, 2022 at 8:08 PM Amit Kapila 
> wrote:
> > >
> > > On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada
>  wrote:
> > > >
> > > > Here are some comments:
> > > >
> > > > Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?
> > > >
> > >
> > > I have given this comment to move the related code to separate
> > > functions to slightly simplify ApplyWorkerMain() code but if you
> > > don't like we can move it back. I am not sure I like the new
> > > function names in the patch though.
> >
> > Okay, I'm fine with moving this code but perhaps we can find a better
> > function name as "Wrapper" seems slightly odd to me.
> >
> 
> Agreed.
> 
> > For example,
> > start_table_sync_start() and start_apply_changes() or something (it
> > seems we use the snake case for static functions in worker.c).
> >
> 
> I am fine with something on these lines, how about start_table_sync() and
> start_apply() respectively?
Adopted. (If we come up with better names, we can change those then)

Kindly have a look at attached the v22.
It has incorporated other improvements of TAP test,
refinement of the DisableSubscriptionOnError function and so on.

Best Regards,
Takamichi Osumi



v22-0001-Optionally-disable-subscriptions-on-error.patch
Description: v22-0001-Optionally-disable-subscriptions-on-error.patch

RE: Optionally automatically disable logical replication subscriptions on error

2022-02-25 Thread osumi.takami...@fujitsu.com

On Thursday, February 24, 2022 8:09 PM Amit Kapila 
> On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada
>  wrote:
> > +   /*
> > +* Log the error that caused DisableSubscriptionOnError to be
> called. We
> > +* do this immediately so that it won't be lost if some other 
> > internal
> > +* error occurs in the following code.
> > +*/
> > +   EmitErrorReport();
> > +   AbortOutOfAnyTransaction();
> > +   FlushErrorState();
> >
> > Do we need to hold interrupts during cleanup here?
> >
> 
> I think so. We do prevent interrupts via
> HOLD_INTERRUPTS/RESUME_INTERRUPTS during cleanup.
Fixed.

Kindly have a look at v22 shared in [1].

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

> Cool.  I think we can report an error instead of reading wal files,
> if the tablespace, database, or relation is invalid.  Does there any
> WAL record that has invalid tablespace, database, or relation OID?

The only sort of validity check we could do here is range checking for the 
underlying data types (which we certainly could/should add if it’s known to 
never be valid for the underlying types); non-existence of objects is a no-go, 
since that depends purely on the WAL range you are looking at and you’d have 
to, you know, scan it to see if it existed before marking as invalid. :)

Thanks,

David

RE: Optionally automatically disable logical replication subscriptions on error

2022-02-25 Thread osumi.takami...@fujitsu.com

On Thursday, February 24, 2022 4:50 PM Masahiko Sawada  
wrote:
> On Tue, Feb 22, 2022 at 3:03 PM Peter Smith 
> wrote:
> >
> > ~~~
> >
> > 1. worker.c - comment
> >
> > + subform = (Form_pg_subscription) GETSTRUCT(tup);
> > +
> > + /*
> > + * We would not be here unless this subscription's disableonerr field
> > + was
> > + * true, but check whether that field has changed in the interim.
> > + */
> > + if (!subform->subdisableonerr)
> > + {
> > + heap_freetuple(tup);
> > + table_close(rel, RowExclusiveLock);
> > + CommitTransactionCommand();
> > + return false;
> > + }
> >
> > I felt that comment belongs above the subform assignment because that
> > is the only reason we are getting the subform again.
> 
> IIUC if we return false here, the same error will be emitted twice.
> And I'm not sure this check is really necessary. It would work only when the
> subdisableonerr is set to false concurrently, but doesn't work for the 
> opposite
> caces. I think we can check
> MySubscription->disableonerr and then just update the tuple.
Addressed. I followed your advice and deleted the check.


Kindly have a look at v22 shared in [1].


[1] - 
https://www.postgresql.org/message-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi

RE: Optionally automatically disable logical replication subscriptions on error

2022-02-25 Thread osumi.takami...@fujitsu.com

On Wednesday, February 23, 2022 6:52 PM Tang, Haiying/唐 海英 
 wrote:
> I have a comment on v21 patch.
> 
> I wonder if we really need subscription s2 in 028_disable_on_error.pl. I 
> think for
> subscription s2, we only tested some normal cases(which could be tested with
> s1), and didn't test any error case, which means it wouldn't be automatically
> disabled.
> Is there any reason for creating subscription s2?
Removed the subscription s2.

This has reduced the code amount of TAP tests.
Kindly have a look at the v22 shared in [1].

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

RE: Optionally automatically disable logical replication subscriptions on error

2022-02-25 Thread osumi.takami...@fujitsu.com

On Tuesday, February 22, 2022 3:03 PM Peter Smith  wrote:
> Here are a couple more review comments for v21.
> 
> ~~~
> 
> 1. worker.c - comment
> 
> + subform = (Form_pg_subscription) GETSTRUCT(tup);
> +
> + /*
> + * We would not be here unless this subscription's disableonerr field
> + was
> + * true, but check whether that field has changed in the interim.
> + */
> + if (!subform->subdisableonerr)
> + {
> + heap_freetuple(tup);
> + table_close(rel, RowExclusiveLock);
> + CommitTransactionCommand();
> + return false;
> + }
> 
> I felt that comment belongs above the subform assignment because that is the
> only reason we are getting the subform again.
This part has been removed along with the modification
that we just disable the subscription in the main processing
when we get an error.

 
> ~~
> 
> 2. worker.c - subform->oid
> 
> + /* Notify the subscription will be no longer valid */ ereport(LOG,
> + errmsg("logical replication subscription \"%s\" will be disabled due
> to an error",
> +MySubscription->name));
> +
> + LockSharedObject(SubscriptionRelationId, subform->oid, 0,
> AccessExclusiveLock);
> 
> Can't we just use MySubscription->oid here? We really only needed that
> subform to get new option values.
Fixed.


> ~~
> 
> 3. worker.c - whitespace
> 
> Your pg_indent has also changed some whitespace for parts of worker.c that
> are completely unrelated to this patch. You might want to revert those 
> changes.
Fixed.

Kindly have a look at v22 that took in all your comments.
It's shared in [1].

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi

Re: should vacuum's first heap pass be read-only?

2022-02-25 Thread Dilip Kumar

On Mon, Feb 7, 2022 at 10:06 PM Robert Haas  wrote:
>
> On Fri, Feb 4, 2022 at 4:12 PM Peter Geoghegan  wrote:
> > I had imagined that we'd
> > want to do heap vacuuming in the same way as today with the dead TID
> > conveyor belt stuff -- it just might take several VACUUM operations
> > until we are ready to do a round of heap vacuuming.
>
> I am trying to understand exactly what you are imagining here. Do you
> mean we'd continue to lazy_scan_heap() at the start of every vacuum,
> and lazy_vacuum_heap_rel() at the end? I had assumed that we didn't
> want to do that, because we might already know from the conveyor belt
> that there are some dead TIDs that could be marked unused, and it
> seems strange to just ignore that knowledge at a time when we're
> scanning the heap anyway. However, on reflection, that approach has
> something to recommend it, because it would be somewhat simpler to
> understand what's actually being changed. We could just:
>
> 1. Teach lazy_scan_heap() that it should add TIDs to the conveyor
> belt, if we're using one, unless they're already there, but otherwise
> work as today.
>
> 2. Teach lazy_vacuum_heap_rel() that it, if there is a conveyor belt,
> it should try to clear from the indexes all of the dead TIDs that are
> eligible.
>
> 3. If there is a conveyor belt, use some kind of magic to decide when
> to skip vacuuming some or all indexes. When we skip one or more
> indexes, the subsequent lazy_vacuum_heap_rel() can't possibly mark as
> unused any of the dead TIDs we found this time, so we should just skip
> it, unless somehow there are TIDs on the conveyor belt that were
> already ready to be marked unused at the start of this VACUUM, in
> which case we can still handle those.

Based on this discussion, IIUC, we are saying that now we will do the
lazy_scan_heap every time like we are doing now.  And we will
conditionally skip the index vacuum for all or some of the indexes and
then based on how much index vacuum is done we will conditionally do
the lazy_vacuum_heap_rel().  Is my understanding correct?

IMHO, if we are doing the heap scan every time and then we are going
to get the same dead items again which we had previously collected in
the conveyor belt.  I agree that we will not add them again into the
conveyor belt but why do we want to store them in the conveyor belt
when we want to redo the whole scanning again?

I think (without global indexes) the main advantage of using the
conveyor belt is that if we skip the index scan for some of the
indexes then we can save the dead item somewhere so that without
scanning the heap again we have those dead items to do the index
vacuum sometime in future but if you are going to rescan the heap
again next time before doing any index vacuuming then why we want to
store them anyway.

IMHO, what we should do is, if there are not many new dead tuples in
the heap (total dead tuple based on the statistic - existing items in
the conveyor belt) then we should conditionally skip the heap scanning
(first pass) and directly jump to the index vacuuming for some or all
the indexes based on the index size bloat.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Expose JIT counters/timing in pg_stat_statements

2022-02-25 Thread Magnus Hagander

Here's a patch to add the sum of timings for JIT counters to
pg_stat_statements, as a way to follow-up on if JIT is doing a good or
a bad job in a configuration.

I decided to only store the total time for the timings, since there
are 4 different timings and storing max/min/etc for each one of them
would lead to a bit too much data. This can of course be reconsidered,
but I think that's a reasonable tradeoff.

Another option I guess could be to store the max/min/etc, but only
store for the total jit time instead of for each individual one. Maybe
that'd actually be more useful?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 7fabd96f38..d9eacfb364 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,7 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
+DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.9--1.10.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql b/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql
new file mode 100644
index 00..19b16874b4
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql
@@ -0,0 +1,62 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.10'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements(boolean);
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements(boolean);
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+OUT userid oid,
+OUT dbid oid,
+OUT toplevel bool,
+OUT queryid bigint,
+OUT query text,
+OUT plans int8,
+OUT total_plan_time float8,
+OUT min_plan_time float8,
+OUT max_plan_time float8,
+OUT mean_plan_time float8,
+OUT stddev_plan_time float8,
+OUT calls int8,
+OUT total_exec_time float8,
+OUT min_exec_time float8,
+OUT max_exec_time float8,
+OUT mean_exec_time float8,
+OUT stddev_exec_time float8,
+OUT rows int8,
+OUT shared_blks_hit int8,
+OUT shared_blks_read int8,
+OUT shared_blks_dirtied int8,
+OUT shared_blks_written int8,
+OUT local_blks_hit int8,
+OUT local_blks_read int8,
+OUT local_blks_dirtied int8,
+OUT local_blks_written int8,
+OUT temp_blks_read int8,
+OUT temp_blks_written int8,
+OUT blk_read_time float8,
+OUT blk_write_time float8,
+OUT wal_records int8,
+OUT wal_fpi int8,
+OUT wal_bytes numeric,
+OUT jit_functions int8,
+OUT jit_generation_time float8,
+OUT jit_inlining_time float8,
+OUT jit_optimization_time float8,
+OUT jit_emission_time float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_10'
+LANGUAGE C STRICT VOLATILE PARALLEL SAFE;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 38d92a89cc..b802f67db2 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -52,6 +52,7 @@
 #include "common/hashfn.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
+#include "jit/jit.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "optimizer/planner.h"
@@ -121,7 +122,8 @@ typedef enum pgssVersion
 	PGSS_V1_2,
 	PGSS_V1_3,
 	PGSS_V1_8,
-	PGSS_V1_9
+	PGSS_V1_9,
+	PGSS_V1_10
 } pgssVersion;
 
 typedef enum pgssStoreKind
@@ -189,6 +191,11 @@ typedef struct Counters
 	int64		wal_records;	/* # of WAL records generated */
 	int64		wal_fpi;		/* # of WAL full page images generated */
 	uint64		wal_bytes;		/* total amount of WAL generated in bytes */
+	int64		jit_functions;	/* total number of JIT functions emitted */
+	double		jit_generation_time;	/* total time to generate jit code */
+	double		jit_inlining_time;	/* total time to inline jit code */
+	double		jit_optimization_time;	/* total time to optimize jit code */
+	double		jit_emission_time;	/* total time to emit jit code */
 } Counters;
 
 /*
@@ -302,6 +309,7 @@ PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statemen

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread Japin Li



On Fri, 25 Feb 2022 at 20:48, David Christensen 
 wrote:
>> Cool.  I think we can report an error instead of reading wal files,
>> if the tablespace, database, or relation is invalid.  Does there any
>> WAL record that has invalid tablespace, database, or relation OID?
>
> The only sort of validity check we could do here is range checking for the 
> underlying data types
> (which we certainly could/should add if it’s known to never be valid for the 
> underlying types);

The invalid OID I said here is such as negative number and zero, for those
parameters, we do not need to read the WAL files, since it always invalid.

> non-existence of objects is a no-go, since that depends purely on the WAL 
> range you are
> looking at and you’d have to, you know, scan it to see if it existed before 
> marking as invalid. :)
>

Agreed.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Add id's to various elements in protocol.sgml

2022-02-25 Thread Peter Eisentraut


On 25.02.22 06:36, Brar Piening wrote:

Yes, that would be possible. In fact appending a link and optionally
adding a tiny bit of CSS like I show below does the trick.

The major problem in that regard would probably be my lack of
XSLT/docbook skills but if no one can jump in here, I can see if I can
make it work.


I think that kind of stuff would be added in via the web site 
stylesheets, so you wouldn't have to deal with XSLT at all.

Re: convert libpq uri-regress tests to tap test

2022-02-25 Thread Peter Eisentraut


On 24.02.22 16:17, Tom Lane wrote:

I think that having t/ directories contain only Perl test scripts
is a good convention that we should stick to.  Peter's proposal
of a separate test/ subdirectory for C test scaffolding is
probably fine.


I wonder if there are any conventions in the Perl community about where 
to put test support files relative to the "t" directory.

Re: Expose JIT counters/timing in pg_stat_statements

2022-02-25 Thread Julien Rouhaud

Hi,

On Fri, Feb 25, 2022 at 02:06:29PM +0100, Magnus Hagander wrote:
> Here's a patch to add the sum of timings for JIT counters to
> pg_stat_statements, as a way to follow-up on if JIT is doing a good or
> a bad job in a configuration.

+1, it seems like something quite useful.

> I decided to only store the total time for the timings, since there
> are 4 different timings and storing max/min/etc for each one of them
> would lead to a bit too much data. This can of course be reconsidered,
> but I think that's a reasonable tradeoff.

I think the cumulated total time is enough.  Looking at the patch, I think we
should also cumulate the number of time jit was triggered, and
probably the same for each other main operation (optimization and inlining).
Otherwise the values may be wrong and look artificially low.

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread Bharath Rupireddy

On Fri, Feb 25, 2022 at 12:36 AM David Christensen
 wrote:
>
> Greetings,
>
> This patch adds the ability to specify a RelFileNode and optional BlockNum to 
> limit output of
> pg_waldump records to only those which match the given criteria.  This should 
> be more performant
> than `pg_waldump | grep` as well as more reliable given specific variations 
> in output style
> depending on how the blocks are specified.
>
> This currently affects only the main fork, but we could presumably add the 
> option to filter by fork
> as well, if that is considered useful.

Thanks for the patch. This is not adding something that users can't do
right now, but definitely improves the usability of the pg_waldump as
it avoids external filterings. Also, it can give the stats/info at
table and block level. So, +1 from my side.

I have some comments on the patch:

1) Let's use capitalized "OID" as is the case elsewhere in the documentation.
+specified via tablespace oid, database oid, and relfilenode separated

2) Good idea to specify an example:
+by slashes.
Something like, "by slashes, for instance, //

3) Crossing 80 char limit
+/*
+ * Boolean to return whether the given WAL record matches a specific
relation and optional block
+ */
+static bool
+XLogRecordMatchesRelationBlock(XLogReaderState *record, RelFileNode
matchRnode, BlockNumber matchBlock
)

+ pg_log_error("could not parse block number \"%s\"", optarg);
+ pg_log_error("could not parse relation from \"%s\" (expecting
\"spc/dat/rel\")", optarg);

4) How about (expecting \"tablespace OID/database OID/relation OID\")?
Let's be clear.
+ pg_log_error("could not parse relation from \"%s\" (expecting
\"spc/dat/rel\")", optarg);

5) I would also see a need for "filter by FPW" i.e. list all WAL
records with "FPW".

6) How about "--block option requires --relation option" or some other
better phrasing?
+ pg_log_error("cannot filter by --block without also filtering --relation");

7) Extra new line between } and return false;
+ return true;
+ }
+ return false;
+}

8) Can we have this for-loop local variables instead of function level
variables?
+ RelFileNode rnode;
+ ForkNumber forknum;
+ BlockNumber blk;

Regards,
Bharath Rupireddy.

Re: Readd use of TAP subtests

2022-02-25 Thread Peter Eisentraut


On 24.02.22 16:00, Andres Freund wrote:

I've incidentally played with subtests yesterdays, when porting
src/interfaces/libpq/test/regress.pl to a tap test. Unfortunately it seems
that subtests aren't actually specified in the tap format, and that different
libraries generate different output formats. The reason this matters somewhat
is that meson's testrunner can parse tap and give nicer progress / error
reports. But since subtests aren't in the spec it can't currently parse
them...


Ok that's good to know.  What exactly happens when it tries to parse 
them?  Does it not count them or does it fail somehow?  The way the 
output is structured


t/001_basic.pl ..
# Subtest: vacuumlo --help
ok 1 - exit code 0
ok 2 - goes to stdout
ok 3 - nothing to stderr
1..3
ok 1 - vacuumlo --help

it appears that it should be able to parse it nonetheless and should 
just count the non-indented lines.

Re: set TESTDIR from perl rather than Makefile

2022-02-25 Thread Andrew Dunstan



On 2/24/22 20:17, Justin Pryzby wrote:
> On Mon, Feb 21, 2022 at 07:00:54AM -0500, Andrew Dunstan wrote:
>> On 2/19/22 18:53, Justin Pryzby wrote:
>>> On Sat, Feb 19, 2022 at 05:41:49PM -0600, Justin Pryzby wrote:
 I rebased and fixed the check-guc script to work, made it work with vpath
 builds, and cleaned it up some.
>>> I also meant to also attach it.
>> This is going to break a bunch of stuff as written.
>>
>> First, it's not doing the same thing. The current system sets TESTDIR to
>> be the parent of the directory that holds the test. e.g. for
>> src/bin/pg_ctl/t/001_start_stop.pl it's src/bin/pg_ctl in the build
>> tree, not the 't' subdirectory. This patch apparently sets it to the 't'
>> subdirectory. That will break anything that goes looking for log files
>> in the current location, like the buildfarm client, and possibly some CI
>> setups as well.
> Yes, thanks.
>
> I changed the patch to use ENV{CURDIR} || dirname(dirname($0)).  If I'm not
> wrong, that seems to be doing the right thing.
>
>> Also, unless I'm mistaken it appears to to the wrong thing for vpath
>> builds:
>>
>> my $test_dir = File::Spec->rel2abs(dirname($0));
>>
>> is completely wrong for vpaths, since that will place it in the source
>> tree, not the build tree.
>>
>> Last, and for no explained reason that I can see, the patch undoes
>> commit f4ce6c4d3a, but only for msvc builds. Even if that's justified it
>> appears to have no relevance to this patch.
> Andres' idea is that perl should set TESTDIR and PATH.  Here I commented out
> PATH, and had the improbable issue that nothing seemed to be breaking,
> including the pipeline test under msvc.  It'd be helpful to know what
> configuration that breaks so I can test that it's broken and then test that
> it's fixed when set from within perl...


I'm fairly sure what's broken here is the MSVC install procedure, which
is massively overeager. We should fix that rather than take it for granted.


>
> I got busy here, and may not be able to progress this for awhile.



You have fixed the vpath issue. But the changes in vcregress.pl look
wrong to me.


-    $ENV{TESTDIR} = "$dir";


If we set PG_SUBDIR in the Makefile shouldn't we set it here too? Yes it
probably doesn't matter as our MSVC build system doesn't support vpath
builds, but it's good to maintain as much equivalence between the two as
possible. (Supporting vpath builds for MSVC might be a nice
student/intern project)


 my $module = basename $dir;
-    # add the module build dir as the second element in the PATH
-    $ENV{PATH} =~ s!;!;$topdir/$Config/$module;!;


See above comment re msvc install. If we're not reverting f4ce6c4d3a in
the Makefile we shouldn't be reverting here either.


+    # $ENV{VCREGRESS_MODE} = $Config;


Why is this commented out line here?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Readd use of TAP subtests

2022-02-25 Thread Andrew Dunstan



On 2/25/22 08:39, Peter Eisentraut wrote:
> On 24.02.22 16:00, Andres Freund wrote:
>> I've incidentally played with subtests yesterdays, when porting
>> src/interfaces/libpq/test/regress.pl to a tap test. Unfortunately it
>> seems
>> that subtests aren't actually specified in the tap format, and that
>> different
>> libraries generate different output formats. The reason this matters
>> somewhat
>> is that meson's testrunner can parse tap and give nicer progress / error
>> reports. But since subtests aren't in the spec it can't currently parse
>> them...
>
> Ok that's good to know.  What exactly happens when it tries to parse
> them?  Does it not count them or does it fail somehow?  The way the
> output is structured
>
> t/001_basic.pl ..
> # Subtest: vacuumlo --help
>     ok 1 - exit code 0
>     ok 2 - goes to stdout
>     ok 3 - nothing to stderr
>     1..3
> ok 1 - vacuumlo --help
>
> it appears that it should be able to parse it nonetheless and should
> just count the non-indented lines.


AIUI TAP consumers are supposed to ignore lines they don't understand.
The Node TAP setup produces output like this, so perl is hardly alone
here. See 


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-25 Thread Greg Stark

On Tue, 25 Jan 2022 at 01:32, Andres Freund  wrote:
>
> Hi,
>
> I was looking the shared memory stats patch again.

Can you point me to this thread? I looked for it but couldn't find it.


-- 
greg

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-25 Thread Nitin Jadhav

> Thank you Alvaro and Matthias for your views. I understand your point
> of not updating the progress-report flag here as it just checks
> whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
> based on that but it doesn't change the checkpoint flags. I will
> modify the code but I am a bit confused here. As per Alvaro, we need
> to make the progress-report flag change in whatever is the place that
> *requests* an immediate checkpoint. I feel this gives information
> about the upcoming checkpoint not the current one. So updating here
> provides wrong details in the view. The flags available during
> CreateCheckPoint() will remain same for the entire checkpoint
> operation and we should show the same information in the view till it
> completes. So just removing the above piece of code (modified in
> ImmediateCheckpointRequested()) in the patch will make it correct. My
> opinion about maintaining a separate field to show upcoming checkpoint
> flags is it makes the view complex. Please share your thoughts.

I have modified the code accordingly.
---

> I think the use of capitals in CHECKPOINT and CHECKPOINTER in the
> documentation is excessive.

Fixed. Here the word CHECKPOINT represents command/checkpoint
operation. If we treat it as a checkpoint operation, I agree to use
lowercase but if we treat it as command, then I think uppercase is
recommended (Refer
https://www.postgresql.org/docs/14/sql-checkpoint.html). Is it ok to
always use lowercase here?
---

> (Same for terms such as MULTIXACT and
> others in those docs; we typically use those in lowercase when
> user-facing; and do we really use term CLOG anymore? Don't we call it
> "commit log" nowadays?)

I have observed the CLOG term in the existing documentation. Anyways I
have changed MULTIXACT to multixact, SUBTRANS to subtransaction and
CLOG to commit log.
---

> +   Whenever the checkpoint operation is running, the
> +   pg_stat_progress_checkpoint view will contain a
> +   single row indicating the progress of the checkpoint. The tables below
>
> Maybe it should show a single row , unless the checkpointer isn't running at
> all (like in single user mode).

Nice thought. Can we add an additional checkpoint phase like 'Idle'.
Idle is ON whenever the checkpointer process is running and there are
no on-going checkpoint Thoughts?
---

> +   Process ID of a CHECKPOINTER process.
>
> It's *the* checkpointer process.

Fixed.
---

> pgstatfuncs.c has a whitespace issue (tab-space).

I have verified with 'git diff --check' and also manually. I did not
find any issue. Kindly mention the specific code which has an issue.
---

> I suppose the functions should set provolatile.

Fixed.
---

> > I am storing the checkpoint start timestamp in the st_progress_param[]
> > and this gets set only once during the checkpoint (at the start of the
> > checkpoint). I have added function
> > pg_stat_get_progress_checkpoint_elapsed() which calculates the elapsed
> > time and returns a string. This function gets called whenever
> > pg_stat_progress_checkpoint view is queried. Kindly refer v2 patch and
> > share your thoughts.
>
> I dislike the lack of access to the actual value of the checkpoint
> start / checkpoint elapsed field.
>
> As a user, if I query the pg_stat_progress_* views, my terminal or
> application can easily interpret an `interval` value and cast it to
> string, but the opposite is not true: the current implementation for
> pg_stat_get_progress_checkpoint_elapsed loses precision. This is why
> we use typed numeric fields in effectively all other places instead of
> stringified versions of the values: oid fields, counters, etc are all
> rendered as bigint in the view, so that no information is lost and
> interpretation is trivial.

Displaying start time of the checkpoint.
---

> > I understand that the log based reporting is very costly and very
> > frequent updates are not advisable.  I am planning to use the existing
> > infrastructure of 'log_startup_progress_interval' which provides an
> > option for the user to configure the interval between each progress
> > update. Hence it avoids frequent updates to server logs. This approach
> > is used only during shutdown and end-of-recovery cases because we
> > cannot access pg_stat_progress_checkpoint view during those scenarios.
>
> I see; but log_startup_progress_interval seems to be exclusively
> consumed through the ereport_startup_progress macro. Why put
> startup/shutdown logging on the same path as the happy flow of normal
> checkpoints?

You mean to say while updating the progress of the checkpoint, call
pgstat_progress_update_param() and then call
ereport_startup_progress() ?

> I think that, instead of looking to what might at some point be added,
> it is better to use the currently available functions instead, and
> move to new functions if and when the log-based reporting requires it.

Make sense. Removing checkpoint_progress_update_param() and
checkpoint_progress_end(). I would like to

Size functions inconsistent results

2022-02-25 Thread Fabrízio de Royes Mello

Hi all,

While doing some work using our functions [1] for calculate relations size
I noticed an inconsistency between pg_total_relation_size and calculate
everything separately, have a look in this example:

fabrizio=# create table test_size (id bigserial primary key, toast_column
text);
CREATE TABLE

fabrizio=# insert into test_size (toast_column)

  select repeat('X'::text, pg_size_bytes('1MB')::integer)
  from generate_series(1,1000);
INSERT 0 1000

fabrizio=# with relations as (
  select schemaname, relname, relid
  from pg_stat_user_tables
  where relname = 'test_size'
),
sizes as (
  select
schemaname,
r.relname,

pg_total_relation_size(relid) AS total_bytes,

pg_relation_size(relid, 'main') +
pg_relation_size(relid, 'init') +
pg_relation_size(relid, 'fsm') +
pg_relation_size(relid, 'vm') AS heap_bytes,
pg_indexes_size(relid) AS index_bytes,
pg_table_size(reltoastrelid) AS toast_bytes
  from relations r
  join pg_class on pg_class.oid = r.relid
)
select
  total_bytes, heap_bytes, index_bytes, toast_bytes,
  (total_bytes = (heap_bytes+index_bytes+toast_bytes)) as "Equal?",
  (total_bytes - (heap_bytes+index_bytes+toast_bytes)) as "Diff"
from sizes;
 total_bytes | heap_bytes | index_bytes | toast_bytes | Equal? |  Diff
-++-+-++
14000128 |  90112 |   40960 |13688832 | f  | 180224
(1 row)

I want to calculate separately HEAP, INDEXES and TOAST (including indexes)
sizes but it seems it's a bit inconsistent with pg_total_relation_size.

Is it correct or am I missing something?

Regards,

[1]
https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-DBSIZE

-- 
Fabrízio de Royes Mello

Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)

2022-02-25 Thread Bharath Rupireddy

On Thu, Jan 6, 2022 at 1:29 PM SATYANARAYANA NARLAPURAM
 wrote:
>
> Consider a cluster formation where we have a Primary(P), Sync Replica(S1), 
> and multiple async replicas for disaster recovery and read scaling (within 
> the region and outside the region). In this setup, S1 is the preferred 
> failover target in an event of the primary failure. When a transaction is 
> committed on the primary, it is not acknowledged to the client until the 
> primary gets an acknowledgment from the sync standby that the WAL is flushed 
> to the disk (assume synchrnous_commit configuration is remote_flush). 
> However, walsenders corresponds to the async replica on the primary don't 
> wait for the flush acknowledgment from the primary and send the WAL to the 
> async standbys (also any logical replication/decoding clients). So it is 
> possible for the async replicas and logical client ahead of the sync replica. 
> If a failover is initiated in such a scenario, to bring the formation into a 
> healthy state we have to either
>
>  run the pg_rewind on the async replicas for them to reconnect with the new 
> primary or
> collect the latest WAL across the replicas and feed the standby.
>
> Both these operations are involved, error prone, and can cause multiple 
> minutes of downtime if done manually. In addition, there is a window where 
> the async replicas can show the data that was neither acknowledged to the 
> client nor committed on standby. Logical clients if they are ahead may need 
> to reseed the data as no easy rewind option for them.
>
> I would like to propose a GUC send_Wal_after_quorum_committed which when set 
> to ON, walsenders corresponds to async standbys and logical replication 
> workers wait until the LSN is quorum committed on the primary before sending 
> it to the standby. This not only simplifies the post failover steps but 
> avoids unnecessary downtime for the async replicas. Thoughts?

Thanks Satya and others for the inputs. Here's the v1 patch that
basically allows async wal senders to wait until the sync standbys
report their flush lsn back to the primary. Please let me know your
thoughts.

I've done pgbench testing to see if the patch causes any problems. I
ran tests two times, there isn't much difference in the txns per
seconds (tps), although there's a delay in the async standby receiving
the WAL, after all, that's the feature we are pursuing.

[1]
HEAD or WITHOUT PATCH:
./pgbench -c 10 -t 500 -P 10 testdb
transaction type: 
scaling factor: 100
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 500
number of transactions actually processed: 5000/5000
latency average = 247.395 ms
latency stddev = 74.409 ms
initial connection time = 13.622 ms
tps = 39.713114 (without initial connection time)

PATCH:
./pgbench -c 10 -t 500 -P 10 testdb
transaction type: 
scaling factor: 100
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 500
number of transactions actually processed: 5000/5000
latency average = 251.757 ms
latency stddev = 72.846 ms
initial connection time = 13.025 ms
tps = 39.315862 (without initial connection time)

TEST SETUP:
primary in region 1
async standby 1 in the same region as that of the primary region 1
i.e. close to primary
sync standby 1 in region 2
sync standby 2 in region 3
an archive location in a region different from the primary and
standbys regions, region 4
Note that I intentionally kept sync standbys in regions far from
primary because it allows sync standbys to receive WAL a bit late by
default, which works well for our testing.

PGBENCH SETUP:
./psql -d postgres -c "drop database testdb"
./psql -d postgres -c "create database testdb"
./pgbench -i -s 100 testdb
./psql -d testdb -c "\dt"
./psql -d testdb -c "SELECT pg_size_pretty(pg_database_size('testdb'))"
./pgbench -c 10 -t 500 -P 10 testdb

Regards,
Bharath Rupireddy.


v1-0001-Allow-async-standbys-wait-for-sync-replication.patch
Description: Binary data

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-25 Thread Nitin Jadhav

> +   if ((ckpt_flags &
> +(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
> +   {
>
> This code (present at multiple places) looks a little ugly to me, what
> we can do instead is add a macro probably named IsShutdownCheckpoint()
> which does the above check and use it in all the functions that have
> this check. See below:
>
> #define IsShutdownCheckpoint(flags) \
>  (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY) != 0)
>
> And then you may use this macro like:
>
> if (IsBootstrapProcessingMode() || IsShutdownCheckpoint(flags))
>return;

Good suggestion. In the v3 patch, I have removed the corresponding
code as these checks are not required. Hence this suggestion is not
applicable now.
---

> pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
> InvalidOid);
> +
> +   val[0] = XLogCtl->InsertTimeLineID;
> +   val[1] = flags;
> +   val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
> +   val[3] = CheckpointStats.ckpt_start_t;
> +
> +   pgstat_progress_update_multi_param(4, index, val);
> +   }
>
> Any specific reason for recording the timelineID in checkpoint stats
> table? Will this ever change in our case?

The timelineID is used to decide whether the current operation is
checkpoint or restartpoint. There is a field in the view to display
this information.

Thanks & Regards,
Nitin Jadhav

On Wed, Feb 23, 2022 at 9:46 PM Ashutosh Sharma  wrote:
>
> +   if ((ckpt_flags &
> +(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) == 0)
> +   {
>
> This code (present at multiple places) looks a little ugly to me, what
> we can do instead is add a macro probably named IsShutdownCheckpoint()
> which does the above check and use it in all the functions that have
> this check. See below:
>
> #define IsShutdownCheckpoint(flags) \
>   (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY) != 0)
>
> And then you may use this macro like:
>
> if (IsBootstrapProcessingMode() || IsShutdownCheckpoint(flags))
> return;
>
> This change can be done in all these functions:
>
> +void
> +checkpoint_progress_start(int flags)
>
> --
>
> + */
> +void
> +checkpoint_progress_update_param(int index, int64 val)
>
> --
>
> + * Stop reporting progress of the checkpoint.
> + */
> +void
> +checkpoint_progress_end(void)
>
> ==
>
> +
> pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT,
> InvalidOid);
> +
> +   val[0] = XLogCtl->InsertTimeLineID;
> +   val[1] = flags;
> +   val[2] = PROGRESS_CHECKPOINT_PHASE_INIT;
> +   val[3] = CheckpointStats.ckpt_start_t;
> +
> +   pgstat_progress_update_multi_param(4, index, val);
> +   }
>
> Any specific reason for recording the timelineID in checkpoint stats
> table? Will this ever change in our case?
>
> --
> With Regards,
> Ashutosh Sharma.
>
> On Wed, Feb 23, 2022 at 6:59 PM Nitin Jadhav
>  wrote:
> >
> > > I will make use of pgstat_progress_update_multi_param() in the next
> > > patch to replace multiple calls to checkpoint_progress_update_param().
> >
> > Fixed.
> > ---
> >
> > > > The other progress tables use [type]_total as column names for counter
> > > > targets (e.g. backup_total for backup_streamed, heap_blks_total for
> > > > heap_blks_scanned, etc.). I think that `buffers_total` and
> > > > `files_total` would be better column names.
> > >
> > > I agree and I will update this in the next patch.
> >
> > Fixed.
> > ---
> >
> > > How about this "The checkpoint is started because max_wal_size is 
> > > reached".
> > >
> > > "The checkpoint is started because checkpoint_timeout expired".
> > >
> > > "The checkpoint is started because some operation forced a checkpoint".
> >
> > I have used the above description. Kindly let me know if any changes
> > are required.
> > ---
> >
> > > > > +  checkpointing CommitTs pages
> > > >
> > > > CommitTs -> Commit time stamp
> > >
> > > I will handle this in the next patch.
> >
> > Fixed.
> > ---
> >
> > > There are more scenarios where you can have a baackend requesting a 
> > > checkpoint
> > > and waiting for its completion, and there may be more than one backend
> > > concerned, so I don't think that storing only one / the first backend pid 
> > > is
> > > ok.
> >
> > Thanks for this information. I am not considering backend_pid.
> > ---
> >
> > > I think all the information should be exposed.  Only knowing why the 
> > > current
> > > checkpoint has been triggered without any further information seems a bit
> > > useless.  Think for instance for cases like [1].
> >
> > I have supported all possible checkpoint kinds. Added
> > pg_stat_get_progress_checkpoint_kind() to convert the flags (int) to a
> > string representing a combination of flags and also checking for the
> > flag update in ImmediateCheckpointRequested() which checks whether
> > CHECKPOINT_IMMEDIATE flag is set or not. I did not find any other
> > cases where the

Fix typo in logicalfuncs.c - :%s/private date/Private data

2022-02-25 Thread Bharath Rupireddy

Hi,

Here's a tiny patch to do $subject.

Regards,
Bharath Rupireddy.


v1-0001-Fix-typo-in-logicalfuncs.c.patch
Description: Binary data

Re: Size functions inconsistent results

2022-02-25 Thread Japin Li



On Fri, 25 Feb 2022 at 22:58, Fabrízio de Royes Mello  
wrote:
> Hi all,
>
> While doing some work using our functions [1] for calculate relations size
> I noticed an inconsistency between pg_total_relation_size and calculate
> everything separately, have a look in this example:
>
> fabrizio=# create table test_size (id bigserial primary key, toast_column
> text);
> CREATE TABLE
>
> fabrizio=# insert into test_size (toast_column)
>
>   select repeat('X'::text, pg_size_bytes('1MB')::integer)
>   from generate_series(1,1000);
> INSERT 0 1000
>
> fabrizio=# with relations as (
>   select schemaname, relname, relid
>   from pg_stat_user_tables
>   where relname = 'test_size'
> ),
> sizes as (
>   select
> schemaname,
> r.relname,
>
> pg_total_relation_size(relid) AS total_bytes,
>
> pg_relation_size(relid, 'main') +
> pg_relation_size(relid, 'init') +
> pg_relation_size(relid, 'fsm') +
> pg_relation_size(relid, 'vm') AS heap_bytes,
> pg_indexes_size(relid) AS index_bytes,
> pg_table_size(reltoastrelid) AS toast_bytes
>   from relations r
>   join pg_class on pg_class.oid = r.relid
> )
> select
>   total_bytes, heap_bytes, index_bytes, toast_bytes,
>   (total_bytes = (heap_bytes+index_bytes+toast_bytes)) as "Equal?",
>   (total_bytes - (heap_bytes+index_bytes+toast_bytes)) as "Diff"
> from sizes;
>  total_bytes | heap_bytes | index_bytes | toast_bytes | Equal? |  Diff
> -++-+-++
> 14000128 |  90112 |   40960 |13688832 | f  | 180224
> (1 row)
>
> I want to calculate separately HEAP, INDEXES and TOAST (including indexes)
> sizes but it seems it's a bit inconsistent with pg_total_relation_size.
>
> Is it correct or am I missing something?
>

I think, you forget the index size of toast table.

with relations as (
  select schemaname, relname, relid
  from pg_stat_user_tables
  where relname = 'test_size'
),
sizes as (
  select
schemaname,
r.relname,

pg_total_relation_size(relid) AS total_bytes,

pg_relation_size(relid, 'main') +
pg_relation_size(relid, 'init') +
pg_relation_size(relid, 'fsm') +
pg_relation_size(relid, 'vm') AS heap_bytes,
pg_indexes_size(relid) AS index_bytes,
pg_table_size(reltoastrelid) + pg_indexes_size(reltoastrelid) AS toast_bytes
  from relations r
  join pg_class on pg_class.oid = r.relid
)
select
  total_bytes, heap_bytes, index_bytes, toast_bytes,
  (total_bytes = (heap_bytes+index_bytes+toast_bytes)) as "Equal?",
  (total_bytes - (heap_bytes+index_bytes+toast_bytes)) as "Diff"
from sizes;

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Add parameter jit_warn_above_fraction

2022-02-25 Thread Magnus Hagander

This patch adds a configuration parameter jit_warn_above_fraction that
will cause a warning to be logged if the fraction of time spent on
doing JIT is bigger than the specified one. For example, this can be
used to track down those cases where JIT ends up taking 90% of the
query runtime because of bad estimates...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ed8c82a9d..7c3bc56227 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6592,6 +6592,21 @@ local0.*/var/log/postgresql

  
 
+ 
+  jit_warn_above_fraction (floating point)
+  
+   jit_warn_above_fraction configuration parameter
+  
+  
+   
+
+ Causes a warning to be written to the log if the time spent on JIT (see )
+ goes above this fraction of the total query runtime.
+ A value of 0 (the default)disables the warning.
+
+   
+ 
+
  
   log_startup_progress_interval (integer)
   
diff --git a/src/backend/tcop/fastpath.c b/src/backend/tcop/fastpath.c
index d429aa4663..a1bff893a3 100644
--- a/src/backend/tcop/fastpath.c
+++ b/src/backend/tcop/fastpath.c
@@ -196,6 +196,7 @@ HandleFunctionRequest(StringInfo msgBuf)
 	struct fp_info *fip;
 	bool		callit;
 	bool		was_logged = false;
+	int			msecs;
 	char		msec_str[32];
 
 	/*
@@ -305,7 +306,7 @@ HandleFunctionRequest(StringInfo msgBuf)
 	/*
 	 * Emit duration logging if appropriate.
 	 */
-	switch (check_log_duration(msec_str, was_logged))
+	switch (check_log_duration(&msecs, msec_str, was_logged))
 	{
 		case 1:
 			ereport(LOG,
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3c7d08209f..c0487ea67f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -961,7 +961,9 @@ exec_simple_query(const char *query_string)
 	bool		save_log_statement_stats = log_statement_stats;
 	bool		was_logged = false;
 	bool		use_implicit_block;
+	int			msecs;
 	char		msec_str[32];
+	int64		jit_time = 0;
 
 	/*
 	 * Report query to various monitoring facilities.
@@ -1220,6 +1222,16 @@ exec_simple_query(const char *query_string)
 
 		receiver->rDestroy(receiver);
 
+		/* Collect JIT timings in case it's active */
+		if (jit_enabled && jit_warn_above_fraction > 0 && portal->queryDesc && portal->queryDesc->estate->es_jit)
+		{
+			jit_time +=
+INSTR_TIME_GET_MILLISEC(portal->queryDesc->estate->es_jit->instr.generation_counter) +
+INSTR_TIME_GET_MILLISEC(portal->queryDesc->estate->es_jit->instr.inlining_counter) +
+INSTR_TIME_GET_MILLISEC(portal->queryDesc->estate->es_jit->instr.optimization_counter) +
+INSTR_TIME_GET_MILLISEC(portal->queryDesc->estate->es_jit->instr.emission_counter);
+		}
+
 		PortalDrop(portal, false);
 
 		if (lnext(parsetree_list, parsetree_item) == NULL)
@@ -1290,7 +1302,7 @@ exec_simple_query(const char *query_string)
 	/*
 	 * Emit duration logging if appropriate.
 	 */
-	switch (check_log_duration(msec_str, was_logged))
+	switch (check_log_duration(&msecs, msec_str, was_logged))
 	{
 		case 1:
 			ereport(LOG,
@@ -1306,6 +1318,16 @@ exec_simple_query(const char *query_string)
 			break;
 	}
 
+	if (jit_enabled && jit_warn_above_fraction > 0)
+	{
+		if (jit_time > msecs * jit_warn_above_fraction)
+		{
+			ereport(WARNING,
+	(errmsg("JIT time was %ld ms of %d ms",
+			jit_time, msecs)));
+		}
+	}
+
 	if (save_log_statement_stats)
 		ShowUsage("QUERY STATISTICS");
 
@@ -1333,6 +1355,7 @@ exec_parse_message(const char *query_string,	/* string to execute */
 	CachedPlanSource *psrc;
 	bool		is_named;
 	bool		save_log_statement_stats = log_statement_stats;
+	int			msecs;
 	char		msec_str[32];
 
 	/*
@@ -1548,7 +1571,7 @@ exec_parse_message(const char *query_string,	/* string to execute */
 	/*
 	 * Emit duration logging if appropriate.
 	 */
-	switch (check_log_duration(msec_str, false))
+	switch (check_log_duration(&msecs, msec_str, false))
 	{
 		case 1:
 			ereport(LOG,
@@ -1595,6 +1618,7 @@ exec_bind_message(StringInfo input_message)
 	MemoryContext oldContext;
 	bool		save_log_statement_stats = log_statement_stats;
 	bool		snapshot_set = false;
+	int			msecs;
 	char		msec_str[32];
 	ParamsErrorCbData params_data;
 	ErrorContextCallback params_errcxt;
@@ -2007,7 +2031,7 @@ exec_bind_message(StringInfo input_message)
 	/*
 	 * Emit duration logging if appropriate.
 	 */
-	switch (check_log_duration(msec_str, false))
+	switch (check_log_duration(&msecs, msec_str, false))
 	{
 		case 1:
 			ereport(LOG,
@@ -2053,6 +2077,7 @@ exec_execute_message(const char *portal_name, long max_rows)
 	bool		is_xact_command;
 	bool		execute_is_fetch;
 	bool		was_logged = false;
+	int			msecs;
 	char		msec_str[32];
 	ParamsErrorCbData params_data;
 	ErrorContextCallback params_errcxt;
@@ -2244,7 +2269,7 @@ exec_execute_message(const char *portal_name, long max_rows)
 	/*
 	 * Emit duration logging if appropriate.

Re: Size functions inconsistent results

2022-02-25 Thread Fabrízio de Royes Mello

On Fri, Feb 25, 2022 at 12:10 PM Japin Li  wrote:
>
>
> I think, you forget the index size of toast table.
>
> with relations as (
>   select schemaname, relname, relid
>   from pg_stat_user_tables
>   where relname = 'test_size'
> ),
> sizes as (
>   select
> schemaname,
> r.relname,
>
> pg_total_relation_size(relid) AS total_bytes,
>
> pg_relation_size(relid, 'main') +
> pg_relation_size(relid, 'init') +
> pg_relation_size(relid, 'fsm') +
> pg_relation_size(relid, 'vm') AS heap_bytes,
> pg_indexes_size(relid) AS index_bytes,
> pg_table_size(reltoastrelid) + pg_indexes_size(reltoastrelid) AS
toast_bytes
>   from relations r
>   join pg_class on pg_class.oid = r.relid
> )
> select
>   total_bytes, heap_bytes, index_bytes, toast_bytes,
>   (total_bytes = (heap_bytes+index_bytes+toast_bytes)) as "Equal?",
>   (total_bytes - (heap_bytes+index_bytes+toast_bytes)) as "Diff"
> from sizes;
>

Ahh perfect... thanks... make sense because pg_table_size don't compute the
indexes size, now it worked:

fabrizio=# with relations as (
  select schemaname, relname, relid
  from pg_stat_user_tables
  where relname = 'test_size'
),
sizes as (
  select
schemaname,
r.relname,

pg_total_relation_size(relid) AS total_bytes,

pg_relation_size(relid, 'main') +
pg_relation_size(relid, 'init') +
pg_relation_size(relid, 'fsm') +
pg_relation_size(relid, 'vm') AS heap_bytes,
pg_indexes_size(relid) AS index_bytes,
pg_table_size(reltoastrelid) + pg_indexes_size(reltoastrelid) AS
toast_bytes
  from relations r
  join pg_class on pg_class.oid = r.relid
)
select
  total_bytes, heap_bytes, index_bytes, toast_bytes,
  (total_bytes = (heap_bytes+index_bytes+toast_bytes)) as "Equal?",
  (total_bytes - (heap_bytes+index_bytes+toast_bytes)) as "Diff"
from sizes;
 total_bytes | heap_bytes | index_bytes | toast_bytes | Equal? | Diff
-++-+-++--
14622720 |  65536 |   40960 |14516224 | t  |0
(1 row)

Regards,

--
Fabrízio de Royes Mello

Re: Expose JIT counters/timing in pg_stat_statements

2022-02-25 Thread Magnus Hagander

On Fri, Feb 25, 2022 at 2:33 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Fri, Feb 25, 2022 at 02:06:29PM +0100, Magnus Hagander wrote:
> > Here's a patch to add the sum of timings for JIT counters to
> > pg_stat_statements, as a way to follow-up on if JIT is doing a good or
> > a bad job in a configuration.
>
> +1, it seems like something quite useful.

Given the amount of time often spent debugging JIT -- getting more
insight is going to make it easier to tune it instead of like what
unfortunately many people do and just turn it off..


> > I decided to only store the total time for the timings, since there
> > are 4 different timings and storing max/min/etc for each one of them
> > would lead to a bit too much data. This can of course be reconsidered,
> > but I think that's a reasonable tradeoff.
>
> I think the cumulated total time is enough.  Looking at the patch, I think we
> should also cumulate the number of time jit was triggered, and
> probably the same for each other main operation (optimization and inlining).
> Otherwise the values may be wrong and look artificially low.

So just to be clear, you're basically thinking:

jit_count = count of entries where jit_functions>0
jit_functions = 
jit_optimizatinos = count of entries where time spent on jit_optimizations > 0

etc?

So we count the times with min/max like other times for the total one,
but instead add a counter for each of the details?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-25 Thread Nitin Jadhav

> > Thank you Alvaro and Matthias for your views. I understand your point
> > of not updating the progress-report flag here as it just checks
> > whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
> > based on that but it doesn't change the checkpoint flags. I will
> > modify the code but I am a bit confused here. As per Alvaro, we need
> > to make the progress-report flag change in whatever is the place that
> > *requests* an immediate checkpoint. I feel this gives information
> > about the upcoming checkpoint not the current one. So updating here
> > provides wrong details in the view. The flags available during
> > CreateCheckPoint() will remain same for the entire checkpoint
> > operation and we should show the same information in the view till it
> > completes.
>
> I'm not sure what Matthias meant, but as far as I know there's no fundamental
> difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
> and there's also no scheduling for multiple checkpoints.
>
> Yes, the flags will remain the same but checkpoint.c will test both the passed
> flags and the shmem flags to see whether a delay should be added or not, which
> is the only difference in checkpoint processing for this flag.  See the call 
> to
> ImmediateCheckpointRequested() which will look at the value in shmem:
>
>/*
> * Perform the usual duties and take a nap, unless we're behind 
> schedule,
> * in which case we just try to catch up as quickly as possible.
> */
>if (!(flags & CHECKPOINT_IMMEDIATE) &&
>!ShutdownRequestPending &&
>!ImmediateCheckpointRequested() &&
>IsCheckpointOnSchedule(progress))

I understand that the checkpointer considers flags as well as the
shmem flags and if CHECKPOINT_IMMEDIATE flag is set, it affects the
current checkpoint operation (No further delay) but does not change
the current flag value. Should we display this change in the kind
field of the view or not? Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Fri, Feb 25, 2022 at 12:33 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Fri, Feb 25, 2022 at 12:23:27AM +0530, Nitin Jadhav wrote:
> > > I think the change to ImmediateCheckpointRequested() makes no sense.
> > > Before this patch, that function merely inquires whether there's an
> > > immediate checkpoint queued.  After this patch, it ... changes a
> > > progress-reporting flag?  I think it would make more sense to make the
> > > progress-report flag change in whatever is the place that *requests* an
> > > immediate checkpoint rather than here.
> > >
> > > > diff --git a/src/backend/postmaster/checkpointer.c 
> > > > b/src/backend/postmaster/checkpointer.c
> > > > +ImmediateCheckpointRequested(int flags)
> > > >  if (cps->ckpt_flags & CHECKPOINT_IMMEDIATE)
> > > > +{
> > > > +updated_flags |= CHECKPOINT_IMMEDIATE;
> > >
> > > I don't think that these changes are expected behaviour. Under in this
> > > condition; the currently running checkpoint is still not 'immediate',
> > > but it is going to hurry up for a new, actually immediate checkpoint.
> > > Those are different kinds of checkpoint handling; and I don't think
> > > you should modify the reported flags to show that we're going to do
> > > stuff faster than usual. Maybe maintiain a seperate 'upcoming
> > > checkpoint flags' field instead?
> >
> > Thank you Alvaro and Matthias for your views. I understand your point
> > of not updating the progress-report flag here as it just checks
> > whether the CHECKPOINT_IMMEDIATE is set or not and takes an action
> > based on that but it doesn't change the checkpoint flags. I will
> > modify the code but I am a bit confused here. As per Alvaro, we need
> > to make the progress-report flag change in whatever is the place that
> > *requests* an immediate checkpoint. I feel this gives information
> > about the upcoming checkpoint not the current one. So updating here
> > provides wrong details in the view. The flags available during
> > CreateCheckPoint() will remain same for the entire checkpoint
> > operation and we should show the same information in the view till it
> > completes.
>
> I'm not sure what Matthias meant, but as far as I know there's no fundamental
> difference between checkpoint with and without the CHECKPOINT_IMMEDIATE flag,
> and there's also no scheduling for multiple checkpoints.
>
> Yes, the flags will remain the same but checkpoint.c will test both the passed
> flags and the shmem flags to see whether a delay should be added or not, which
> is the only difference in checkpoint processing for this flag.  See the call 
> to
> ImmediateCheckpointRequested() which will look at the value in shmem:
>
> /*
>  * Perform the usual duties and take a nap, unless we're behind 
> schedule,
>  * in which case we just try to catch up as quickly as possible.
>  */
> if (!(flags & CHECKPOINT_IMMEDIATE) &&
>

Re: Fix overflow in justify_interval related functions

2022-02-25 Thread Joseph Koshakow

Just checking because I'm not very familiar with the process,
are there any outstanding items that I need to do for this patch?

- Joe Koshakow

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

On Fri, Feb 25, 2022 at 7:33 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Thanks for the patch. This is not adding something that users can't do
> right now, but definitely improves the usability of the pg_waldump as
> it avoids external filterings. Also, it can give the stats/info at
> table and block level. So, +1 from my side.
>

Thanks for the feedback; I will be incorporating most of this into a new
version, with a couple of responses below.

> 3) Crossing 80 char limit
>

This is neither here nor there, but have we as a project considered
increasing that to something more modern?  I know a lot of current projects
consider 132 to be a more reasonable limit.  (Will reduce it down to that
for now, but consider this a vote towards increasing that limit.)

> 5) I would also see a need for "filter by FPW" i.e. list all WAL
> records with "FPW".
>

Yes, that wouldn't be too hard to add to this, can add to the next
version.  We probably ought to also add the fork number as specifiable as
well. Other thoughts on could be some wildcard value in the relation part,
so `1234/23456/*` could filter WAL to a specific database only, say, or
some other multiple specifier, like `--block=1234,123456,121234`.  (I
honestly consider this to be more advanced than we'd need to support in
this patch, but if probably wouldn't be too hard to add to it.)

Thanks,

David

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

On Fri, Feb 25, 2022 at 7:08 AM Japin Li  wrote:

>
> On Fri, 25 Feb 2022 at 20:48, David Christensen <
> david.christen...@crunchydata.com> wrote:
> >> Cool.  I think we can report an error instead of reading wal files,
> >> if the tablespace, database, or relation is invalid.  Does there any
> >> WAL record that has invalid tablespace, database, or relation OID?
> >
> > The only sort of validity check we could do here is range checking for
> the underlying data types
> > (which we certainly could/should add if it’s known to never be valid for
> the underlying types);
>
> The invalid OID I said here is such as negative number and zero, for those
> parameters, we do not need to read the WAL files, since it always invalid.
>

Agreed.  Can add some additional range validation to the parsed values.

David

Re: Expose JIT counters/timing in pg_stat_statements

2022-02-25 Thread Julien Rouhaud

On Fri, Feb 25, 2022 at 04:19:27PM +0100, Magnus Hagander wrote:
> 
> So just to be clear, you're basically thinking:
> 
> jit_count = count of entries where jit_functions>0
> jit_functions = 
> jit_optimizatinos = count of entries where time spent on jit_optimizations > 0
> 
> etc?

Yes exactly, so 3 new fields on top of the one you already added.

> So we count the times with min/max like other times for the total one,
> but instead add a counter for each of the details?

I don't understand this one.  Did you mean we *don't* count times with min/max?
If that's the case then yes :)

Re: Expose JIT counters/timing in pg_stat_statements

2022-02-25 Thread Magnus Hagander

On Fri, Feb 25, 2022 at 4:40 PM Julien Rouhaud  wrote:
>
> On Fri, Feb 25, 2022 at 04:19:27PM +0100, Magnus Hagander wrote:
> >
> > So just to be clear, you're basically thinking:
> >
> > jit_count = count of entries where jit_functions>0
> > jit_functions = 
> > jit_optimizatinos = count of entries where time spent on jit_optimizations 
> > > 0
> >
> > etc?
>
> Yes exactly, so 3 new fields on top of the one you already added.
>
> > So we count the times with min/max like other times for the total one,
> > but instead add a counter for each of the details?
>
> I don't understand this one.  Did you mean we *don't* count times with 
> min/max?
> If that's the case then yes :)

Hmm. Yeah, that was a bit unclear. I mean we track total time with
min/max, and detailed time not at all. For details, we only track
count, not time.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Add parameter jit_warn_above_fraction

2022-02-25 Thread Julien Rouhaud

Hi,

On Fri, Feb 25, 2022 at 04:16:01PM +0100, Magnus Hagander wrote:
> This patch adds a configuration parameter jit_warn_above_fraction that
> will cause a warning to be logged if the fraction of time spent on
> doing JIT is bigger than the specified one. For example, this can be
> used to track down those cases where JIT ends up taking 90% of the
> query runtime because of bad estimates...

I think that's tremendously useful, huge +1.

Just a few minor nit:

+ A value of 0 (the default)disables the warning.

missing space

+   ereport(WARNING,
+   (errmsg("JIT time was %ld ms of %d ms",
+   jit_time, msecs)));

"JIT time" may a bit obscure for users, how about "JIT total processing time"?"

+   gettext_noop("Sets the fraction of query time spent on 
JIT before writing"
+"a warning to the log."),
+   gettext_noop("Write a message tot he server log if more 
than this"
+"fraction of the query runtime 
is spent on JIT."
+"Zero turns off the warning.")

missing spaces in the concatenated strings.

The rest of the patch looks good to me.

Re: Add parameter jit_warn_above_fraction

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 16:16:01 +0100, Magnus Hagander wrote:
> This patch adds a configuration parameter jit_warn_above_fraction that
> will cause a warning to be logged if the fraction of time spent on
> doing JIT is bigger than the specified one. For example, this can be
> used to track down those cases where JIT ends up taking 90% of the
> query runtime because of bad estimates...

Hm. Could it make sense to do this as a auto_explain feature?

Greetings,

Andres Freund

Re: Add parameter jit_warn_above_fraction

2022-02-25 Thread Justin Pryzby

On Fri, Feb 25, 2022 at 04:16:01PM +0100, Magnus Hagander wrote:
> + {
> + {"jit_warn_above_fraction", PGC_SUSET, LOGGING_WHEN,
> + gettext_noop("Sets the fraction of query time spent on 
> JIT before writing"
> +  "a warning to the log."),
> + gettext_noop("Write a message tot he server log if more 
> than this"
> +  "fraction of the query runtime 
> is spent on JIT."
> +  "Zero turns off the warning.")
> + },
> + &jit_warn_above_fraction,
> + 0.0, 0.0, 1.0,
> + NULL, NULL, NULL
> + },

Should be PGC_USERSET ?

+   gettext_noop("Write a message tot he server log if more 
than this"  
   

to the

+   if (jit_enabled && jit_warn_above_fraction > 0) 

   
+   {   

   
+   int64 jit_time =

   
+   
INSTR_TIME_GET_MILLISEC(portal->queryDesc->estate->es_jit->instr.generation_counter)
 +  
   
+   
INSTR_TIME_GET_MILLISEC(portal->queryDesc->estate->es_jit->instr.inlining_counter)
 +  
 
+   
INSTR_TIME_GET_MILLISEC(portal->queryDesc->estate->es_jit->instr.optimization_counter)
 +  
 
+   
INSTR_TIME_GET_MILLISEC(portal->queryDesc->estate->es_jit->instr.emission_counter);


+   

   
+   if (jit_time > msecs * jit_warn_above_fraction) 

   
+   {   

   
+   ereport(WARNING,

   
+   (errmsg("JIT time was %ld ms of %d ms", 

   
+   jit_time, msecs))); 

   
+   }   

   
+   }   

   


I think it should be a NOTICE (or less?)

Is it more useful if this is applied combined with log_min_duration_statement ?

It's easy to imagine a query for which the planner computes a high cost, but
actually runs quickly.  You might get a bunch of WARNINGs that the query took
10 MS and JIT was 75% of that, even if you don't care about queries that take
less than 10 SEC.

I should say that this is already available by processing the output of
autoexplain.

-- 
Justin

Re: Readd use of TAP subtests

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 14:39:15 +0100, Peter Eisentraut wrote:
> On 24.02.22 16:00, Andres Freund wrote:
> > I've incidentally played with subtests yesterdays, when porting
> > src/interfaces/libpq/test/regress.pl to a tap test. Unfortunately it seems
> > that subtests aren't actually specified in the tap format, and that 
> > different
> > libraries generate different output formats. The reason this matters 
> > somewhat
> > is that meson's testrunner can parse tap and give nicer progress / error
> > reports. But since subtests aren't in the spec it can't currently parse
> > them...
>
> Ok that's good to know.  What exactly happens when it tries to parse them?
> Does it not count them or does it fail somehow?  The way the output is
> structured

Says that it can't pase a line of the tap output:
16:06:55 MALLOC_PERTURB_=156 /usr/bin/perl /tmp/meson-test/build/../subtest.pl
--- output ---
stdout:
# Subtest: a
ok 1 - a: a
ok 2 - a: b
1..2
ok 1 - a
1..1
stderr:

TAP parsing error: unexpected input at line 4
--


> t/001_basic.pl ..
> # Subtest: vacuumlo --help
> ok 1 - exit code 0
> ok 2 - goes to stdout
> ok 3 - nothing to stderr
> 1..3
> ok 1 - vacuumlo --help
>
> it appears that it should be able to parse it nonetheless and should just
> count the non-indented lines.

It looks like it's not ignoring indented lines...

Greetings,

Andres Freund

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-02-25 Thread Jacob Champion

On Thu, 2022-02-24 at 20:44 +, Jacob Champion wrote:
> That simplifies things. PFA a smaller v2; if pgaudit can't make use of
> libpq-be.h for some reason, then I guess we can tailor a fix to that
> use case.

Ha, opr_sanity caught my use of cstring. I'll work on a fix later
today.

--Jacob

Re: Add parameter jit_warn_above_fraction

2022-02-25 Thread Magnus Hagander

On Fri, Feb 25, 2022 at 5:20 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-02-25 16:16:01 +0100, Magnus Hagander wrote:
> > This patch adds a configuration parameter jit_warn_above_fraction that
> > will cause a warning to be logged if the fraction of time spent on
> > doing JIT is bigger than the specified one. For example, this can be
> > used to track down those cases where JIT ends up taking 90% of the
> > query runtime because of bad estimates...
>
> Hm. Could it make sense to do this as a auto_explain feature?

It could be. But I was looking for something a lot more "light weight"
than having to install an extension. But yes, if we wanted to, we
could certainly change jit_warn_above_fraction to be
auto_explain.log_min_jit_fraction or something like that, and do
basically the same thing. But then, we could also have
log_min_duration_statement be part of auto_explain instead, so it's
all about where to draw the line :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Readd use of TAP subtests

2022-02-25 Thread Jacob Champion

On Fri, 2022-02-25 at 09:43 -0500, Andrew Dunstan wrote:
> AIUI TAP consumers are supposed to ignore lines they don't understand.

It's undefined behavior [1]:

> Any output that is not a version, a plan, a test line, a YAML block,
> a diagnostic or a bail out is incorrect. How a harness handles the
> incorrect line is undefined. Test::Harness silently ignores incorrect
> lines, but will become more stringent in the future. TAP::Harness
> reports TAP syntax errors at the end of a test run.

--Jacob

[1] https://testanything.org/tap-version-13-specification.html

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-25 Thread Julien Rouhaud

On Fri, Feb 25, 2022 at 08:53:50PM +0530, Nitin Jadhav wrote:
> >
> > I'm not sure what Matthias meant, but as far as I know there's no 
> > fundamental
> > difference between checkpoint with and without the CHECKPOINT_IMMEDIATE 
> > flag,
> > and there's also no scheduling for multiple checkpoints.
> >
> > Yes, the flags will remain the same but checkpoint.c will test both the 
> > passed
> > flags and the shmem flags to see whether a delay should be added or not, 
> > which
> > is the only difference in checkpoint processing for this flag.  See the 
> > call to
> > ImmediateCheckpointRequested() which will look at the value in shmem:
> >
> >/*
> > * Perform the usual duties and take a nap, unless we're behind 
> > schedule,
> > * in which case we just try to catch up as quickly as possible.
> > */
> >if (!(flags & CHECKPOINT_IMMEDIATE) &&
> >!ShutdownRequestPending &&
> >!ImmediateCheckpointRequested() &&
> >IsCheckpointOnSchedule(progress))
> 
> I understand that the checkpointer considers flags as well as the
> shmem flags and if CHECKPOINT_IMMEDIATE flag is set, it affects the
> current checkpoint operation (No further delay) but does not change
> the current flag value. Should we display this change in the kind
> field of the view or not? Please share your thoughts.

I think the fields should be added.  It's good to know that a checkpoint was
trigger due to normal activity and should be spreaded, and then something
upgraded it to an immediate checkpoint.  If you're desperately waiting for the
end of a checkpoint for some reason and ask for an immediate checkpoint, you'll
certainly be happy to see that the checkpointer is aware of it.

But maybe I missed something in the code, so let's wait for Matthias input
about it.

Re: Readd use of TAP subtests

2022-02-25 Thread Jacob Champion

On Fri, 2022-02-25 at 16:35 +, Jacob Champion wrote:
> On Fri, 2022-02-25 at 09:43 -0500, Andrew Dunstan wrote:
> > AIUI TAP consumers are supposed to ignore lines they don't understand.
> 
> It's undefined behavior [1]:

And of course the minute I send this I notice that I've linked the v13
spec instead of the original... sorry. Assuming Perl isn't marking its
tests as version 13, you are correct:

> Any output line that is not a version, a plan, a test line, a
> diagnostic or a bail out is considered an “unknown” line. A TAP
> parser is required to not consider an unknown line as an error but
> may optionally choose to capture said line and hand it to the test
> harness, which may have custom behavior attached. This is to allow
> for forward compatability. Test::Harness silently ignores incorrect
> lines, but will become more stringent in the future. TAP::Harness
> reports TAP syntax errors at the end of a test run.

--Jacob

Re: Expose JIT counters/timing in pg_stat_statements

2022-02-25 Thread Magnus Hagander

On Fri, Feb 25, 2022 at 4:41 PM Magnus Hagander  wrote:
>
> On Fri, Feb 25, 2022 at 4:40 PM Julien Rouhaud  wrote:
> >
> > On Fri, Feb 25, 2022 at 04:19:27PM +0100, Magnus Hagander wrote:
> > >
> > > So just to be clear, you're basically thinking:
> > >
> > > jit_count = count of entries where jit_functions>0
> > > jit_functions = 
> > > jit_optimizatinos = count of entries where time spent on 
> > > jit_optimizations > 0
> > >
> > > etc?
> >
> > Yes exactly, so 3 new fields on top of the one you already added.
> >
> > > So we count the times with min/max like other times for the total one,
> > > but instead add a counter for each of the details?
> >
> > I don't understand this one.  Did you mean we *don't* count times with 
> > min/max?
> > If that's the case then yes :)
>
> Hmm. Yeah, that was a bit unclear. I mean we track total time with
> min/max, and detailed time not at all. For details, we only track
> count, not time.

Per some off-list discussion with Julien, we have clearly been talking
in slightly different terms. So let's summarize the options into what
theÿ́d actually be:

Option 0: what is int he patch now

Option 1:
jit_count - number of executions using jit
total_jit_time - for sum of functions+inlining+optimization+emission time
min_jit_time - for sum of functions+inlining+optimization+emission time
max_jit_time - for sum of functions+inlining+optimization+emission time
mean_jit_time - for sum of functions+inlining+optimization+emission time
stddev_jit_time - for sum of functions+inlining+optimization+emission time
jit_functions - number of functions
jit_inlining_count - number of executions where inlining happened
jit_optimization_count - number of executions where optimization happened
jit_emission_count - number of executions where emission happened

Option 2:
jit_count
jit_functions
jit_generation_time
jit_inlining_count
jit_inlining_time
jit_optimization_count
jit_optimization_time
jit_emission_count
jit_emission_time


(We can bikeshed on the exact names of the fields once we have decided
which option is preferred)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Expose JIT counters/timing in pg_stat_statements

2022-02-25 Thread Dmitry Dolgov

> On Fri, Feb 25, 2022 at 04:19:27PM +0100, Magnus Hagander wrote:
> On Fri, Feb 25, 2022 at 2:33 PM Julien Rouhaud  wrote:
> >
> > Hi,
> >
> > On Fri, Feb 25, 2022 at 02:06:29PM +0100, Magnus Hagander wrote:
> > > Here's a patch to add the sum of timings for JIT counters to
> > > pg_stat_statements, as a way to follow-up on if JIT is doing a good or
> > > a bad job in a configuration.
> >
> > +1, it seems like something quite useful.
>
> Given the amount of time often spent debugging JIT -- getting more
> insight is going to make it easier to tune it instead of like what
> unfortunately many people do and just turn it off..

Indeed, sounds convenient, although I wonder how exactly one would use it
to tune JIT? I'm curious, because I got used to situations when one
single long query takes much longer than expected due to JIT issues --
but it seems the target of this patch are situations when there are a
lot of long queries using JIT, and it's easier to analyze them via pgss?

> > > I decided to only store the total time for the timings, since there
> > > are 4 different timings and storing max/min/etc for each one of them
> > > would lead to a bit too much data. This can of course be reconsidered,
> > > but I think that's a reasonable tradeoff.
> >
> > I think the cumulated total time is enough.  Looking at the patch, I think 
> > we
> > should also cumulate the number of time jit was triggered, and
> > probably the same for each other main operation (optimization and inlining).
> > Otherwise the values may be wrong and look artificially low.
>
> So just to be clear, you're basically thinking:
>
> jit_count = count of entries where jit_functions>0
> jit_functions = 
> jit_optimizatinos = count of entries where time spent on jit_optimizations > 0

One interesting not-very-relevant for the patch thing I've noticed while
reading it, is that there seems to be no way to find out what fraction
of time jit_tuple_deforming is taking alone, it's sort of merged
together with jit_expressions in generation_counter.

Re: Readd use of TAP subtests

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 09:43:20 -0500, Andrew Dunstan wrote:
> AIUI TAP consumers are supposed to ignore lines they don't understand.

Are they?

In http://testanything.org/tap-version-13-specification.html there's:

"Lines written to standard output matching /^(not )?ok\b/ must be interpreted
as test lines. [...]All other lines must not be considered test output."

That says that all other lines aren't "test ouput". But what does that mean?
It certainly doesn't mean they can just be ignored, because obviously
^(TAP version|#|1..|Bail out) isn't to be ignored.

And then there's:


"
Anything else

  Any output that is not a version, a plan, a test line, a YAML block, a
  diagnostic or a bail out is incorrect. How a harness handles the incorrect
  line is undefined. Test::Harness silently ignores incorrect lines, but will
  become more stringent in the future. TAP::Harness reports TAP syntax errors at
  the end of a test run.
"

If I were to to implement a tap parser this wouldn't make me ignore lines.


Contrasting to that:
http://testanything.org/tap-specification.html

"
Anything else

  A TAP parser is required to not consider an unknown line as an error but may
  optionally choose to capture said line and hand it to the test harness,
  which may have custom behavior attached. This is to allow for forward
  compatability. Test::Harness silently ignores incorrect lines, but will
  become more stringent in the future. TAP::Harness reports TAP syntax errors
  at the end of a test run.
"

I honestly don't know what to make of that. Parsers are supposed to ignore
unknown lines, Test::Harness silently ignores, but also "TAP::Harness reports
TAP syntax errors"?  This seems like a self contradictory mess.

Greetings,

Andres Freund

Re: Expose JIT counters/timing in pg_stat_statements

2022-02-25 Thread Julien Rouhaud

On Fri, Feb 25, 2022 at 05:38:45PM +0100, Magnus Hagander wrote:
>
> Per some off-list discussion with Julien, we have clearly been talking
> in slightly different terms. So let's summarize the options into what
> theÿ́d actually be:
>
> Option 0: what is int he patch now
>
> Option 1:
> jit_count - number of executions using jit
> total_jit_time - for sum of functions+inlining+optimization+emission time
> min_jit_time - for sum of functions+inlining+optimization+emission time
> max_jit_time - for sum of functions+inlining+optimization+emission time
> mean_jit_time - for sum of functions+inlining+optimization+emission time
> stddev_jit_time - for sum of functions+inlining+optimization+emission time
> jit_functions - number of functions
> jit_inlining_count - number of executions where inlining happened
> jit_optimization_count - number of executions where optimization happened
> jit_emission_count - number of executions where emission happened
>
> Option 2:
> jit_count
> jit_functions
> jit_generation_time
> jit_inlining_count
> jit_inlining_time
> jit_optimization_count
> jit_optimization_time
> jit_emission_count
> jit_emission_time

I'm for option 2, I think it's important to have the timing details for
inlining and optimization and be able to compute correct stats.

Mingw task for Cirrus CI

2022-02-25 Thread Melih Mutlu

Hi All,

I've been working on adding Windows+MinGW environment into cirrus-ci tasks
(discussion about ci is here [1]).
It uses MSYS2 to set the environment. UCRT is chosen as C standard library,
instead of MSVCRT.
The task configures postgres with features that are available in MSYS2 (see
available packages [2]) and tap tests are enabled.
I already added the necessary docker image, you can find the related PR at
[3] and a successful cirruc-ci run with these changes at [4].
Attached patch adds a task runs on Windows with MinGW for cirrus-ci

However, I cannot run configure with --with-python, --with-perl or
--with-tcl.
There are two issues I encountered while trying to enable.  e.g. for
--with-python

1-  python_ldlibrary is set to "libpython3.9.dll.a". So the related line in
configure cannot retrieve "ldlibrary"
2-  src/pl/plpython/Makefile looks under "C:/Windows/system32/" for
PYTHONDLL. gendef cannot open that file, give an error like " failed to
open()" when creating a .def file.

To overcome those, I added the correct pattern to extract ldlibrary by
appending  "-e 's/\.dll.a$//'" at the end of the related line.
Then, I tried to use python dll from MSYS instead of windows/system32 by
changing PYTHONDLL path to "/ucrt64/bin/libpython3.9.dll". I'm not sure if
this is the correct dll.
Here is the diff of these two changes:

diff --git a/configure b/configure
index f3cb5c2b51..42ea580442 100755
--- a/configure
+++ b/configure
@@ -10536,7 +10536,7 @@ python_libdir=`${PYTHON} -c "import sysconfig;
print(' '.join(filter(None,syscon
 python_ldlibrary=`${PYTHON} -c "import sysconfig; print('
'.join(filter(None,sysconfig.get_config_vars('LDLIBRARY'"`

 # If LDLIBRARY exists and has a shlib extension, use it verbatim.
-ldlibrary=`echo "${python_ldlibrary}" | sed -e 's/\.so$//' -e 's/\.dll$//'
-e 's/\.dylib$//' -e 's/\.sl$//'`
+ldlibrary=`echo "${python_ldlibrary}" | sed -e 's/\.so$//' -e 's/\.dll$//'
-e 's/\.dylib$//' -e 's/\.sl$//' -e 's/\.dll.a$//'`
 if test -e "${python_libdir}/${python_ldlibrary}" -a
x"${python_ldlibrary}" != x"${ldlibrary}"
 then
ldlibrary=`echo "${ldlibrary}" | sed "s/^lib//"`
diff --git a/src/pl/plpython/Makefile b/src/pl/plpython/Makefile
index a83ae8865c..4254ef94d7 100644
--- a/src/pl/plpython/Makefile
+++ b/src/pl/plpython/Makefile
@@ -61,7 +61,8 @@ INCS =plpython.h \
 ifeq ($(PORTNAME), win32)

 pytverstr=$(subst .,,${python_version})
-PYTHONDLL=$(subst \,/,$(WINDIR))/system32/python${pytverstr}.dll
+#PYTHONDLL=$(subst \,/,$(WINDIR))/system32/python${pytverstr}.dll
+PYTHONDLL=/ucrt64/bin/libpython3.9.dll

 OBJS += libpython${pytverstr}.a



In the end, make check-world still fails, even though I was able to run
configure and make without any obvious error.
I see bunch of errors in tests like:
+ERROR:  language "plpython3u" does not exist
+HINT:  Use CREATE EXTENSION to load the language into the database.

Here is the logs from failed ci run:
https://api.cirrus-ci.com/v1/artifact/task/4645682031099904/log/build/src/pl/plpython/regression.diffs


Any thoughts on how postgres can be built with --with-python etc. on mingw?

Best,
Melih


[1]
https://www.postgresql.org/message-id/flat/20211001222752.wrz7erzh4cajvgp6%40alap3.anarazel.de

[2] https://packages.msys2.org/package/

[3] https://github.com/anarazel/pg-vm-images/pull/8

[4] https://cirrus-ci.com/build/4999469182746624

Re: Add parameter jit_warn_above_fraction

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 17:28:41 +0100, Magnus Hagander wrote:
> On Fri, Feb 25, 2022 at 5:20 PM Andres Freund  wrote:
> > On 2022-02-25 16:16:01 +0100, Magnus Hagander wrote:
> > > This patch adds a configuration parameter jit_warn_above_fraction that
> > > will cause a warning to be logged if the fraction of time spent on
> > > doing JIT is bigger than the specified one. For example, this can be
> > > used to track down those cases where JIT ends up taking 90% of the
> > > query runtime because of bad estimates...
> >
> > Hm. Could it make sense to do this as a auto_explain feature?
> 
> It could be. But I was looking for something a lot more "light weight"
> than having to install an extension. But yes, if we wanted to, we
> could certainly change jit_warn_above_fraction to be
> auto_explain.log_min_jit_fraction or something like that, and do
> basically the same thing.  But then, we could also have
> log_min_duration_statement be part of auto_explain instead, so it's
> all about where to draw the line :)

I guess it feels a tad on the "too narrow/specific" side of things for the
general code. We don't have log_min_duration_{parsing,planning,execution}
either.  But I also get it. So I just wanted to raise it ;)

Greetings,

Andres Freund

Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 12:51:22 +0300, Yura Sokolov wrote:
> > > +  * The usage_count starts out at 1 so that the buffer can survive one
> > > +  * clock-sweep pass.
> > > +  *
> > > +  * We use direct atomic OR instead of Lock+Unlock since no other backend
> > > +  * could be interested in the buffer. But StrategyGetBuffer,
> > > +  * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them 
> > > to
> > > +  * compare tag, and UnlockBufHdr does raw write to state. So we have to
> > > +  * spin if we found buffer locked.
> > 
> > So basically the first half of of the paragraph is wrong, because no, we
> > can't?
> 
> Logically, there are no backends that could be interesting in the buffer.
> Physically they do LockBufHdr/UnlockBufHdr just to check they are not 
> interesting.

Yea, but that's still being interested in the buffer...


> > > +  * Note that we write tag unlocked. It is also safe since there is 
> > > always
> > > +  * check for BM_VALID when tag is compared.
> > 
> > 
> > >*/
> > >   buf->tag = newTag;
> > > - buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> > > -BM_CHECKPOINT_NEEDED | BM_IO_ERROR | 
> > > BM_PERMANENT |
> > > -BUF_USAGECOUNT_MASK);
> > >   if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == 
> > > INIT_FORKNUM)
> > > - buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > > + new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > >   else
> > > - buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > > -
> > > - UnlockBufHdr(buf, buf_state);
> > > + new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > >  
> > > - if (oldPartitionLock != NULL)
> > > + buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> > > + while (unlikely(buf_state & BM_LOCKED))
> > 
> > I don't think it's safe to atomic in arbitrary bits. If somebody else has
> > locked the buffer header in this moment, it'll lead to completely bogus
> > results, because unlocking overwrites concurrently written contents (which
> > there shouldn't be any, but here there are)...
> 
> That is why there is safety loop in the case buf->state were locked just
> after first optimistic atomic_fetch_or. 99.999% times this loop will not
> have a job. But in case other backend did lock buf->state, loop waits
> until it releases lock and retry atomic_fetch_or.

> > And or'ing contents in also doesn't make sense because we it doesn't work to
> > actually unset any contents?
> 
> Sorry, I didn't understand sentence :((


You're OR'ing multiple bits into buf->state. LockBufHdr() only ORs in
BM_LOCKED. ORing BM_LOCKED is fine:
Either the buffer is not already locked, in which case it just sets the
BM_LOCKED bit, acquiring the lock. Or it doesn't change anything, because
BM_LOCKED already was set.

But OR'ing in multiple bits is *not* fine, because it'll actually change the
contents of ->state while the buffer header is locked.


> > Why don't you just use LockBufHdr/UnlockBufHdr?
> 
> This pair makes two atomic writes to memory. Two writes are heavier than
> one write in this version (if optimistic case succeed).

UnlockBufHdr doesn't use a locked atomic op. It uses a write barrier and an
unlocked write.

Greetings,

Andres Freund

Re: should vacuum's first heap pass be read-only?

2022-02-25 Thread Peter Geoghegan

On Fri, Feb 25, 2022 at 5:06 AM Dilip Kumar  wrote:
> Based on this discussion, IIUC, we are saying that now we will do the
> lazy_scan_heap every time like we are doing now.  And we will
> conditionally skip the index vacuum for all or some of the indexes and
> then based on how much index vacuum is done we will conditionally do
> the lazy_vacuum_heap_rel().  Is my understanding correct?

I can only speak for myself, but that sounds correct to me. IMO what
we really want here is to create lots of options for VACUUM. To do the
work of index vacuuming when it is most convenient, based on very
recent information about what's going on in each index. There at some
specific obvious ways that it might help. For example, it would be
nice if the failsafe could not really skip index vacuuming -- it could
just put it off until later, after relfrozenxid has been advanced to a
safe value.

Bear in mind that the cost of lazy_scan_heap is often vastly less than
the cost of vacuuming all indexes -- and so doing a bit more work
there than theoretically necessary is not necessarily a problem.
Especially if you have something like UUID indexes, where there is no
natural locality. Many tables have 10+ indexes. Even large tables.

> IMHO, if we are doing the heap scan every time and then we are going
> to get the same dead items again which we had previously collected in
> the conveyor belt.  I agree that we will not add them again into the
> conveyor belt but why do we want to store them in the conveyor belt
> when we want to redo the whole scanning again?

I don't think we want to, exactly. Maybe it's easier to store
redundant TIDs than to avoid storing them in the first place (we can
probably just accept some redundancy). There is a trade-off to be made
there. I'm not at all sure of what the best trade-off is, though.

> I think (without global indexes) the main advantage of using the
> conveyor belt is that if we skip the index scan for some of the
> indexes then we can save the dead item somewhere so that without
> scanning the heap again we have those dead items to do the index
> vacuum sometime in future

Global indexes are important in their own right, but ISTM that they
have similar needs to other things anyway. Having this flexibility is
even more important with global indexes, but the concepts themselves
are similar. We want options and maximum flexibility, everywhere.

> but if you are going to rescan the heap
> again next time before doing any index vacuuming then why we want to
> store them anyway.

It all depends, of course. The decision needs to be made using a cost
model. I suspect it will be necessary to try it out, and see.

--
Peter Geoghegan

Re: Readd use of TAP subtests

2022-02-25 Thread Andrew Dunstan



On 2/25/22 11:41, Andres Freund wrote:
> Hi,
>
> On 2022-02-25 09:43:20 -0500, Andrew Dunstan wrote:
>> AIUI TAP consumers are supposed to ignore lines they don't understand.
> Are they?
>
> In http://testanything.org/tap-version-13-specification.html there's:
>
> "Lines written to standard output matching /^(not )?ok\b/ must be interpreted
> as test lines. [...]All other lines must not be considered test output."
>
> That says that all other lines aren't "test ouput". But what does that mean?
> It certainly doesn't mean they can just be ignored, because obviously
> ^(TAP version|#|1..|Bail out) isn't to be ignored.



I don't think we're following spec 13.


>
> And then there's:
>
>
> "
> Anything else
>
>   Any output that is not a version, a plan, a test line, a YAML block, a
>   diagnostic or a bail out is incorrect. How a harness handles the incorrect
>   line is undefined. Test::Harness silently ignores incorrect lines, but will
>   become more stringent in the future. TAP::Harness reports TAP syntax errors 
> at
>   the end of a test run.
> "
>
> If I were to to implement a tap parser this wouldn't make me ignore lines.
>
>
> Contrasting to that:
> http://testanything.org/tap-specification.html
>
> "
> Anything else
>
>   A TAP parser is required to not consider an unknown line as an error but may
>   optionally choose to capture said line and hand it to the test harness,
>   which may have custom behavior attached. This is to allow for forward
>   compatability. Test::Harness silently ignores incorrect lines, but will
>   become more stringent in the future. TAP::Harness reports TAP syntax errors
>   at the end of a test run.
> "
>
> I honestly don't know what to make of that. Parsers are supposed to ignore
> unknown lines, Test::Harness silently ignores, but also "TAP::Harness reports
> TAP syntax errors"?  This seems like a self contradictory mess.
>

I agree it's a mess. Both of these "specs" are incredibly loose. I guess
that happens when the spec comes after the implementation.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Jeff Davis

On Thu, 2022-02-24 at 19:47 -0800, Andres Freund wrote:
> Why is it restricted to that? You could do sasl negotiation as well
> from what
> I can see? And that'd theoretically also allow to negotiate whether
> the client
> supports different ways of doing auth?  Not saying that that's easy,
> but I
> don't think it's a fundamental restriction.

Good point! It would only work with enhanced clients though -- maybe in
the future we'd make libpq pluggable with new auth methods?

> We have several useful authentication technologies built ontop of
> plaintext
> exchange. Radius, Ldap, Pam afaics could be implemented as an
> extension?

Yes, and it means that we won't have to extend that list in core in the
future when new methods become popular.

Regards,
Jeff Davis

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Jeff Davis

On Thu, 2022-02-24 at 20:47 -0500, Tom Lane wrote:
> ... and, since we can't readily enforce that the client only sends
> those cleartext passwords over suitably-encrypted connections, this
> could easily be a net negative for security.  Not sure that I think
> it's a good idea.

I don't understand your point. Can't you just use "hostssl" rather than
"host"?

Also there are some useful cases that don't really require SSL, like
when the client and host are on the same machine, or if you have a
network secured some other way.

Regards,
Jeff Davis

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Tom Lane

Jeff Davis  writes:
> On Thu, 2022-02-24 at 20:47 -0500, Tom Lane wrote:
>> ... and, since we can't readily enforce that the client only sends
>> those cleartext passwords over suitably-encrypted connections, this
>> could easily be a net negative for security.  Not sure that I think
>> it's a good idea.

> I don't understand your point. Can't you just use "hostssl" rather than
> "host"?

My point is that sending cleartext passwords over the wire is an
insecure-by-definition protocol that we shouldn't be encouraging
more use of.

regards, tom lane

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 09:33:45 -0800, Jeff Davis wrote:
> On Thu, 2022-02-24 at 20:47 -0500, Tom Lane wrote:
> > ... and, since we can't readily enforce that the client only sends
> > those cleartext passwords over suitably-encrypted connections, this
> > could easily be a net negative for security.  Not sure that I think
> > it's a good idea.
> 
> I don't understand your point. Can't you just use "hostssl" rather than
> "host"?

And the extension could check Port->ssl_in_use before 
sendAuthRequest(AUTH_REQ_PASSWORD)
if it wanted to restrict it.

Greetings,

Andres Freund

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-25 Thread Matthias van de Meent

On Fri, 25 Feb 2022 at 17:35, Julien Rouhaud  wrote:
>
> On Fri, Feb 25, 2022 at 08:53:50PM +0530, Nitin Jadhav wrote:
> > >
> > > I'm not sure what Matthias meant, but as far as I know there's no 
> > > fundamental
> > > difference between checkpoint with and without the CHECKPOINT_IMMEDIATE 
> > > flag,
> > > and there's also no scheduling for multiple checkpoints.
> > >
> > > Yes, the flags will remain the same but checkpoint.c will test both the 
> > > passed
> > > flags and the shmem flags to see whether a delay should be added or not, 
> > > which
> > > is the only difference in checkpoint processing for this flag.  See the 
> > > call to
> > > ImmediateCheckpointRequested() which will look at the value in shmem:
> > >
> > >/*
> > > * Perform the usual duties and take a nap, unless we're behind 
> > > schedule,
> > > * in which case we just try to catch up as quickly as possible.
> > > */
> > >if (!(flags & CHECKPOINT_IMMEDIATE) &&
> > >!ShutdownRequestPending &&
> > >!ImmediateCheckpointRequested() &&
> > >IsCheckpointOnSchedule(progress))
> >
> > I understand that the checkpointer considers flags as well as the
> > shmem flags and if CHECKPOINT_IMMEDIATE flag is set, it affects the
> > current checkpoint operation (No further delay) but does not change
> > the current flag value. Should we display this change in the kind
> > field of the view or not? Please share your thoughts.
>
> I think the fields should be added.  It's good to know that a checkpoint was
> trigger due to normal activity and should be spreaded, and then something
> upgraded it to an immediate checkpoint.  If you're desperately waiting for the
> end of a checkpoint for some reason and ask for an immediate checkpoint, 
> you'll
> certainly be happy to see that the checkpointer is aware of it.
>
> But maybe I missed something in the code, so let's wait for Matthias input
> about it.

The point I was trying to make was "If cps->ckpt_flags is
CHECKPOINT_IMMEDIATE, we hurry up to start the new checkpoint that is
actually immediate". That doesn't mean that this checkpoint was
created with IMMEDIATE or running using IMMEDIATE, only that optional
delays are now being skipped instead.

To let the user detect _why_ the optional delays are now being
skipped, I propose not to report this currently running checkpoint's
"flags | CHECKPOINT_IMMEDIATE", but to add reporting of the next
checkpoint's flags; which would allow the detection and display of the
CHECKPOINT_IMMEDIATE we're actually hurrying for (plus some more
interesting information flags.

-Matthias

PS. I just noticed that the checkpoint flags are also being parsed and
stringified twice in LogCheckpointStart; and adding another duplicate
in the current code would put that at 3 copies of effectively the same
code. Do we maybe want to deduplicate that into macros, similar to
LSN_FORMAT_ARGS?

Re: convert libpq uri-regress tests to tap test

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-24 08:46:23 -0800, Andres Freund wrote:
> I'm mildly inclined to only do 0001 and 0002 for now. We'd not loose msvc
> coverage, because it already doesn't build the test. Once we've ironed that
> stuff out, we could do 0003?

>From what I can see in the buildfarm client, we'd not loose (nor gain) any
buildfarm coverage either. It doesn't run the test today.

Greetings,

Andres Freund

Re: [PATCH] Add reloption for views to enable RLS

2022-02-25 Thread Dean Rasheed

On Fri, 18 Feb 2022 at 14:57, Laurenz Albe  wrote:
>
> Here is a new version, with improved documentation and the option renamed
> to "check_permissions_owner".  I just prefer the shorter form.
>

Re-reading this thread, I think I preferred the name
"security_invoker". The main objection seemed to come from the
potential confusion with SECURITY INVOKER/DEFINER functions, but I
think that's really a different thing. As long as the documentation
for the default behaviour is clear (which I think it was), then it
should be easy to explain how a security invoker view behaves
differently. Also, there's value in using the same terminology as
other databases, because many users will already be familiar with the
feature from those databases.

Some other review comments:

1). This new comment:

+   
+Be aware that USAGE privileges on schemas containing
+the underlying base relations are not checked.
+   

is not entirely accurate. It's more accurate to say that a user
creating or replacing a view must have CREATE privileges on the schema
containing the view and USAGE privileges on any schemas referred to in
the view query, whereas a user using the view only needs USAGE
privileges on the schema containing the view.

(Note that, for the view creator, USAGE is required on any schema
referred to in the query -- e.g., schemas containing functions as well
as base relations.)

2). The patch is adding a new field to RangeTblEntry which seems to be
unnecessary -- it's set, and copied around, but never read, so it
should just be removed.

3). Looking at this change:

-setRuleCheckAsUser((Node *) rule->actions, relation->rd_rel->relowner);
-setRuleCheckAsUser(rule->qual, relation->rd_rel->relowner);
+if (!(relation->rd_rel->relkind == RELKIND_VIEW
+  && !RelationSubqueryCheckPermsOwner(relation)))
+{
+setRuleCheckAsUser((Node *) rule->actions,
relation->rd_rel->relowner);
+setRuleCheckAsUser(rule->qual, relation->rd_rel->relowner);
+}

I think it should call setRuleCheckAsUser() in all cases. It might be
true that the rule fetched has checkAsUser set to InvalidOid
throughout its action and quals, but it seems unwise to rely on that
-- better to code defensively and explicitly set it in all cases.

4). In the same code block, I think the new behaviour should be
applied to SELECT rules only. The view may have other non-SELECT rules
(just as a table may have non-SELECT rules), created using CREATE
RULE, but their actions are independent of the view definition.
Currently their permissions are checked as the view/table owner, and
if anyone wanted to change that, it should be an option on the rule,
not the view (just as triggers can be made security definer or
invoker, depending on how the trigger function is defined).

(Note: I'm not suggesting that anyone actually spend any time adding
such an option to rules. Given all the pitfalls associated with rules,
I think their use should be discouraged, and no development effort
should be expended enhancing them.)

5). In the same function, the block of code that fetches rules and
triggers has been moved. I think it would be worth adding a comment to
explain why it's now important to extract the reloptions *before*
fetching the relation's rules and triggers.

6). The second set of tests added to rowsecurity.sql seem to have
nothing to do with RLS, and probably belong in updatable_views.sql,
and I think it would be worth adding a few more tests for things like
views on top of views.

Regards,
Dean

Re: Some optimisations for numeric division

2022-02-25 Thread Tom Lane

Dean Rasheed  writes:
> And another update following feedback from the cfbot.

This patchset LGTM.  On my machine there doesn't seem to be any
measurable performance change for the numeric regression test,
but numeric_big gets about 15% faster.

The only additional thought I have, based on your comments about
0001, is that maybe in mul_var we should declare var1digit as
being NumericDigit not int.  I tried that here and didn't see
any material change in the generated assembly code (using gcc
8.5.0), so you're right that modern gcc doesn't need any help
there; but I wonder if it could help on other compiler versions.

I'll mark this RFC.

regards, tom lane

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-25 Thread Julien Rouhaud

On Fri, Feb 25, 2022 at 06:49:42PM +0100, Matthias van de Meent wrote:
>
> The point I was trying to make was "If cps->ckpt_flags is
> CHECKPOINT_IMMEDIATE, we hurry up to start the new checkpoint that is
> actually immediate". That doesn't mean that this checkpoint was
> created with IMMEDIATE or running using IMMEDIATE, only that optional
> delays are now being skipped instead.

Ah, I now see what you mean.

> To let the user detect _why_ the optional delays are now being
> skipped, I propose not to report this currently running checkpoint's
> "flags | CHECKPOINT_IMMEDIATE", but to add reporting of the next
> checkpoint's flags; which would allow the detection and display of the
> CHECKPOINT_IMMEDIATE we're actually hurrying for (plus some more
> interesting information flags.

I'm still not convinced that's a sensible approach.  The next checkpoint will
be displayed in the view as CHECKPOINT_IMMEDIATE, so you will then know about
it.  I'm not sure that having that specific information in the view is
going to help, especially if users have to understand "a slow checkpoint is
actually fast even if it's displayed as slow if the next checkpoint is going to
be fast".  Saying "it's timed" (which imply slow) and "it's fast" is maybe
still counter intuitive, but at least have a better chance to see there's
something going on and refer to the doc if you don't get it.

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-25 Thread Julien Rouhaud

On Sat, Feb 26, 2022 at 02:30:36AM +0800, Julien Rouhaud wrote:
> On Fri, Feb 25, 2022 at 06:49:42PM +0100, Matthias van de Meent wrote:
> >
> > The point I was trying to make was "If cps->ckpt_flags is
> > CHECKPOINT_IMMEDIATE, we hurry up to start the new checkpoint that is
> > actually immediate". That doesn't mean that this checkpoint was
> > created with IMMEDIATE or running using IMMEDIATE, only that optional
> > delays are now being skipped instead.
> 
> Ah, I now see what you mean.
> 
> > To let the user detect _why_ the optional delays are now being
> > skipped, I propose not to report this currently running checkpoint's
> > "flags | CHECKPOINT_IMMEDIATE", but to add reporting of the next
> > checkpoint's flags; which would allow the detection and display of the
> > CHECKPOINT_IMMEDIATE we're actually hurrying for (plus some more
> > interesting information flags.
> 
> I'm still not convinced that's a sensible approach.  The next checkpoint will
> be displayed in the view as CHECKPOINT_IMMEDIATE, so you will then know about
> it.  I'm not sure that having that specific information in the view is
> going to help, especially if users have to understand "a slow checkpoint is
> actually fast even if it's displayed as slow if the next checkpoint is going 
> to
> be fast".  Saying "it's timed" (which imply slow) and "it's fast" is maybe
> still counter intuitive, but at least have a better chance to see there's
> something going on and refer to the doc if you don't get it.

Just to be clear, I do think that it's worthwhile to add some information that
some backends are waiting for that next checkpoint.  As discussed before, an
int for the number of backends looks like enough information to me.

Re: Commitfest manager for 2022-03

2022-02-25 Thread Greg Stark

I would like to volunteer.

On Fri, 25 Feb 2022 at 05:31, Julien Rouhaud  wrote:
>
> Hi,
>
> The final commitfest for pg15 will start in a few days, and I didn't see any
> discussion on it or anyone volunteering to be a CFM.
>
> I thought it would be a good idea to send this reminder now and avoid the same
> situation as the last commitfest, to avoid unnecessary pain for the CFM(s).
>
> Is there any volunteer?
>
> For the record there are already 246 active patches registered.
>
>


-- 
greg

Re: trigger example for plsample

2022-02-25 Thread Chapman Flack

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

This patch is straightforward, does what it says, and passes the tests.

Regarding the duplication of code between plsample_func_handler and
plsample_trigger_handler, perhaps that's for the best for now, as 3554 in
the same commitfest also touches plsample, so merge conflicts may be
minimized by not doing more invasive refactoring.

That would leave low-hanging fruit for a later patch that could refactor
plsample to reduce the duplication (maybe adding a validator at the same
time? That would also duplicate some of the checks in the existing handlers.)

I am not sure that structuring the trigger handler with separate compile and
execute steps is worth the effort for a simple example like plsample. The main
plsample_func_handler is not so structured.

It's likely that many real PLs will have some notion of compilation separate 
from
execution. But those will also have logic to do the compilation only once, and
somewhere to cache the result of that for reuse across calls, and those kinds of
details might make plsample's basic skeleton more complex than needed.

I know that in just looking at expected/plsample.out, I was a little distracted 
by
seeing multiple "compile" messages for the same trigger function in the same
session and wondering why that was.

So maybe it would be simpler and less distracting to assume that the PL targeted
by plsample is one that just has a simple interpreter that works from the 
source text.

Regards,
-Chap

The new status of this patch is: Waiting on Author

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Jonathan S. Katz


On 2/25/22 12:39 PM, Tom Lane wrote:

Jeff Davis  writes:

On Thu, 2022-02-24 at 20:47 -0500, Tom Lane wrote:

... and, since we can't readily enforce that the client only sends
those cleartext passwords over suitably-encrypted connections, this
could easily be a net negative for security.  Not sure that I think
it's a good idea.



I don't understand your point. Can't you just use "hostssl" rather than
"host"?


My point is that sending cleartext passwords over the wire is an
insecure-by-definition protocol that we shouldn't be encouraging
more use of.


This is my general feeling as well. We just spent a bunch of effort 
adding, refining, and making SCRAM the default method. I think doing 
anything that would drive more use of sending plaintext passwords, even 
over TLS, is counter to that.


I do understand arguments for (e.g. systems that require checking 
password complexity), but I wonder if it's better for us to delegate 
that to an external auth system. Regardless, I can get behind Andres' 
point to "check Port->ssl_in_use before sendAuthRequest(AUTH_REQ_PASSWORD)".


I'm generally in favor of being able to support additional 
authentication methods, the first one coming to mind is supporting OIDC. 
Having a pluggable auth infrastructure could possibly make such efforts 
easier. I'm definitely intrigued.


Jonathan


OpenPGP_signature
Description: OpenPGP digital signature

Re: C++ Trigger Framework

2022-02-25 Thread Bruce Momjian

On Tue, Feb 22, 2022 at 11:33:05PM +0200, Shmuel Kamensky wrote:
> Hi Nathan,
> 
> Thanks for the reply. I did indeed read that document and it's a great place 
> to
> get started. But it doesn't quite answer my questions. Do you personally have
> experience writing software in C++ that interacts with Postgres?

As far as I know, it should just work like C.  I know you can compile
the backend server using a C++ compiler because we get bug reports when
we break it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Using operators to do query hints

2022-02-25 Thread Bruce Momjian

On Tue, Feb 22, 2022 at 04:12:15PM -0500, Greg Stark wrote:
> I've been playing with an idea I had a while back. Basically that it
> would be useful to have some "noop" operators that are used purely to
> influence the planner.
> 
> For context I've suggested in the past that there are two categories of hints:
> 
> 1 Hints that override the planner's decisions with explicit planning
> decisions. These we don't like for a variety of reasons, notably
> because users don't have very good knowledge of all the plan types
> available and new plan types come out in the future.
> 
> 2 Hints that just influence the estimates that the planner uses to
> make its decisions. These seem more acceptable as it amounts to
> admitting the users know the distribution of their data and the
> behaviour of their functions better than the statistics. Also, there
> are plenty of cases where the statistics are really just wild guesses.
> They still allow the planner to make decisions about what join orders
> or types and so on given the updated data.

#2 is an interesting distinction.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Two noncritical bugs of pg_waldump

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-14 18:18:47 +0900, Kyotaro Horiguchi wrote:
> > If I give an empty file to the tool it complains as the follows.
> > 
> > > pg_waldump: fatal: could not read file "hoge": No such file or directory
> > 
> > No, the file exists.  The cause is it reads uninitialized errno to
> > detect errors from the system call.  read(2) is defined to set errno
> > always when it returns -1 and doesn't otherwise. Thus it seems to me
> > that it is better to check that the return value is less than zero
> > than to clear errno before the call to read().
> 
> So I post a patch contains only the indisputable part.

Thanks for the report and fix. Pushed. This was surprisingly painful, all but
one branch had conflicts...

Greetings,

Andres Freund

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Jeff Davis

On Fri, 2022-02-25 at 12:39 -0500, Tom Lane wrote:
> My point is that sending cleartext passwords over the wire is an
> insecure-by-definition protocol that we shouldn't be encouraging
> more use of.

We can require custom auth entries in pg_hba.conf to also specify
local, hostssl or hostgssenc.

It might annoy people who have a network secured at some other layer,
or who have the client on the same machine as the host. We could allow
plain "host" if someone specifies "customplain" explicitly.

Regards,
Jeff Davis

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

Bharath Rupireddy  writes:

> On Fri, Feb 25, 2022 at 12:36 AM David Christensen
>  wrote:
>>
>> Greetings,
>>
>> This patch adds the ability to specify a RelFileNode and optional BlockNum 
>> to limit output of
>> pg_waldump records to only those which match the given criteria.  This 
>> should be more performant
>> than `pg_waldump | grep` as well as more reliable given specific variations 
>> in output style
>> depending on how the blocks are specified.
>>
>> This currently affects only the main fork, but we could presumably add the 
>> option to filter by fork
>> as well, if that is considered useful.
>
> Thanks for the patch. This is not adding something that users can't do
> right now, but definitely improves the usability of the pg_waldump as
> it avoids external filterings. Also, it can give the stats/info at
> table and block level. So, +1 from my side.

Attached is V2 with additional feedback from this email, as well as the 
specification of the
ForkNumber and FPW as specifiable options.

Best,

David

>From 1b04f04317d364006371bdab0db9086f79138b25 Mon Sep 17 00:00:00 2001
From: David Christensen 
Date: Fri, 25 Feb 2022 12:52:56 -0600
Subject: [PATCH] Add additional filtering options to pg_waldump

This feature allows you to only output records that are targeting a specific RelFileNode and optional
BlockNumber within this relation, while specifying which ForkNum you want to filter to.

We also add the independent ability to filter via Full Page Write.
---
 doc/src/sgml/ref/pg_waldump.sgml |  48 
 src/bin/pg_waldump/pg_waldump.c  | 128 ++-
 2 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml
index 5735a161ce..f157175764 100644
--- a/doc/src/sgml/ref/pg_waldump.sgml
+++ b/doc/src/sgml/ref/pg_waldump.sgml
@@ -100,6 +100,44 @@ PostgreSQL documentation
   
  
 
+ 
+  -k block
+  --block=block
+  
+   
+Display only records touching the given block. (Requires also
+providing the relation via --relation.)
+   
+  
+ 
+
+ 
+  --fork=fork
+  
+   
+When using the --relation filter, output only records
+from the given fork.  The valid values here are 0
+for the main fork, 1 for the Free Space
+Map, 2 for the Visibility Map,
+and 3 for the Init fork.  If unspecified, defaults
+to the main fork.
+   
+  
+ 
+
+ 
+  -l tbl/db/rel
+  --relation=tbl/db/rel
+  
+   
+Display only records touching the given relation.  The relation is
+specified via tablespace OID, database OID, and relfilenode separated
+by slashes, for instance, 1234/12345/12345.  This
+is the same format used for relations in the WAL dump output.
+   
+  
+ 
+
  
   -n limit
   --limit=limit
@@ -183,6 +221,16 @@ PostgreSQL documentation

  
 
+ 
+   -w
+   --fullpage
+   
+   
+   Filter records to only those which have full page writes.
+   
+   
+ 
+
  
   -x xid
   --xid=xid
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..a527cd4dc6 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -55,6 +55,12 @@ typedef struct XLogDumpConfig
 	bool		filter_by_rmgr_enabled;
 	TransactionId filter_by_xid;
 	bool		filter_by_xid_enabled;
+	RelFileNode filter_by_relation;
+	bool		filter_by_relation_enabled;
+	BlockNumber	filter_by_relation_block;
+	bool		filter_by_relation_block_enabled;
+	ForkNumber	filter_by_relation_forknum;
+	bool		filter_by_fpw;
 } XLogDumpConfig;
 
 typedef struct Stats
@@ -394,6 +400,56 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	return count;
 }
 
+/*
+ * Boolean to return whether the given WAL record matches a specific relation and optional block
+ */
+static bool
+XLogRecordMatchesRelationBlock(XLogReaderState *record, RelFileNode matchRnode, BlockNumber matchBlock, ForkNumber matchFork)
+{
+	int			block_id;
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		RelFileNode rnode;
+		ForkNumber	forknum;
+		BlockNumber blk;
+
+		if (!XLogRecHasBlockRef(record, block_id))
+			continue;
+
+		XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blk);
+
+		if (forknum == matchFork &&
+			matchRnode.spcNode == rnode.spcNode &&
+			matchRnode.dbNode == rnode.dbNode &&
+			matchRnode.relNode == rnode.relNode &&
+			(matchBlock == InvalidBlockNumber || matchBlock == blk))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Boolean to return whether the given WAL record contains a full page write
+ */
+static bool
+XLogRecordHasFPW(XLogReaderState *record)
+{
+	int			block_id;
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		if (!XLogRecHasBlockRef(record, block_id))
+			continue;
+
+		if (XL

Re: Fix overflow in justify_interval related functions

2022-02-25 Thread Nathan Bossart

On Fri, Feb 25, 2022 at 10:30:57AM -0500, Joseph Koshakow wrote:
> Just checking because I'm not very familiar with the process,
> are there any outstanding items that I need to do for this patch?

Unless someone has additional feedback, I don't think so.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Tom Lane

Jeff Davis  writes:
> On Fri, 2022-02-25 at 12:39 -0500, Tom Lane wrote:
>> My point is that sending cleartext passwords over the wire is an
>> insecure-by-definition protocol that we shouldn't be encouraging
>> more use of.

> We can require custom auth entries in pg_hba.conf to also specify
> local, hostssl or hostgssenc.

That's just a band-aid, though.  It does nothing for the other
reasons not to want cleartext passwords, notably that you might
be giving your password to a compromised server.

I'm happy to add support for custom auth methods if they can use
a protocol that's safer than cleartext-password.  But if that's the
only feasible option, then we're just encouraging people to use
insecure methods.

I also have in mind here that there has been discussion of giving
libpq a feature to refuse, on the client side, to send cleartext
passwords.  (I don't recall right now if that's been mentioned
publicly or only among the security team, but it's definitely
under consideration.)

So I think this proposal needs more thought.  A server-side hook
alone is not a useful improvement IMO; it's closer to being an
attractive nuisance.

regards, tom lane

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-25 Thread Euler Taveira

On Fri, Feb 25, 2022, at 11:52 AM, Greg Stark wrote:
> On Tue, 25 Jan 2022 at 01:32, Andres Freund  wrote:
> >
> > Hi,
> >
> > I was looking the shared memory stats patch again.
> 
> Can you point me to this thread? I looked for it but couldn't find it.
https://postgr.es/m/20180629.173418.190173462.horiguchi.kyot...@lab.ntt.co.jp


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Race conditions in 019_replslot_limit.pl

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-22 18:06:24 -0800, Andres Freund wrote:
> On 2022-02-18 15:14:15 -0800, Andres Freund wrote:
> > I'm running out of ideas for how to try to reproduce this. I think we might
> > need some additional debugging information to get more information from the
> > buildfarm.
> 
> > I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing 
> > --verbose
> > to pg_basebackup in $node_primary3->backup(...).
> >
> > It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
> > ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots().
> 
> Planning to commit something like the attached.

This did provide us a bit more detail.

Seems to suggest something is holding a problematic lock in a way that I do not 
understand yet:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 
019_replslot_limit.pl LOG:  received replication command: 
CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( 
RESERVE_WAL)
...
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 
019_replslot_limit.pl DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to 
make
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 
019_replslot_limit.pl DEBUG:  replication slot exit hook, without active slot
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 
019_replslot_limit.pl DEBUG:  temporary replication slot cleanup: begin

last message from 1997084 until the immediate shutdown. Just checking for
temporary replication slots that need to be dropped requires
ReplicationSlotControlLock. Actually dropping also requires
ReplicationSlotAllocationLock

1997084 does have to a temporary replication slot to clean up.

2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:35] 
019_replslot_limit.pl DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to 
make
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:36] 
019_replslot_limit.pl DEBUG:  replication slot exit hook, without active slot
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:37] 
019_replslot_limit.pl DEBUG:  temporary replication slot cleanup: begin
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:38] 
019_replslot_limit.pl DEBUG:  temporary replication slot cleanup: done

1997083 succeeds in doing the cleanup. It does not have a temporary
replication slot. Making it look like somehow ReplicationSlotAllocationLock
hasn't been released.

2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:39] 
019_replslot_limit.pl DEBUG:  shmem_exit(0): 7 on_shmem_exit callbacks to make
...
2022-02-23 09:09:53.076 EST [2022-02-23 09:09:52 EST 1997072:87] LOG:  received 
immediate shutdown request
...
2022-02-23 09:09:53.095 EST [2022-02-23 09:09:52 EST 1997072:90] DEBUG:  server 
process (PID 1997084) exited with exit code 2

It's *possible*, but not likely, that somehow 1997084 just doesn't get
scheduled for a prolonged amount of time.

We could be more certain if we shut down the cluster in fast rather than
immediate mode. So I'm thinking of doing something like

# We've seen occasionales cases where multiple walsender pids are active. An
# immediate shutdown may hide evidence of a locking bug. So if multiple
# walsenders are observed, shut down in fast mode, and collect some more
# information.
if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
{
my ($stdout, $stderr);
$node_primary3->psql('postgres',
 "\\a\\t\nSELECT * FROM pg_stat_activity",
 stdout => \$stdout, stderr => \$stderr);
diag $stdout, $stderr;
$node_primary3->stop('fast');
$node_standby3->stop('fast');
die "could not determine walsender pid, can't continue";
}

Does that make sense? Better ideas?

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 16:25:01 -0300, Euler Taveira wrote:
> On Fri, Feb 25, 2022, at 11:52 AM, Greg Stark wrote:
> > On Tue, 25 Jan 2022 at 01:32, Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > I was looking the shared memory stats patch again.
> > 
> > Can you point me to this thread? I looked for it but couldn't find it.

> https://postgr.es/m/20180629.173418.190173462.horiguchi.kyot...@lab.ntt.co.jp

I'll post a rebased version as soon as this is resolved... I have a local one,
but it just works by nuking a bunch of tests / #ifdefing out code related to
this.

Greetings,

Andres Freund

Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)

2022-02-25 Thread Nathan Bossart

On Fri, Feb 25, 2022 at 08:31:37PM +0530, Bharath Rupireddy wrote:
> Thanks Satya and others for the inputs. Here's the v1 patch that
> basically allows async wal senders to wait until the sync standbys
> report their flush lsn back to the primary. Please let me know your
> thoughts.

I haven't had a chance to look too closely yet, but IIUC this adds a new
function that waits for synchronous replication.  This new function
essentially spins until the synchronous LSN has advanced.

I don't think it's a good idea to block sending any WAL like this.  AFAICT
it is possible that there will be a lot of synchronously replicated WAL
that we can send, and it might just be the last several bytes that cannot
yet be replicated to the asynchronous standbys.  І believe this patch will
cause the server to avoid sending _any_ WAL until the synchronous LSN
advances.

Perhaps we should instead just choose the SendRqstPtr based on the current
synchronous LSN.  Presumably there are other things we'd need to consider,
but in general, I think we ought to send as much WAL as possible for a
given call to XLogSendPhysical().

> I've done pgbench testing to see if the patch causes any problems. I
> ran tests two times, there isn't much difference in the txns per
> seconds (tps), although there's a delay in the async standby receiving
> the WAL, after all, that's the feature we are pursuing.

I'm curious what a longer pgbench run looks like when the synchronous
replicas are in the same region.  That is probably a more realistic
use-case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Jacob Champion

On Fri, 2022-02-25 at 13:40 -0500, Jonathan S. Katz wrote:
> I'm generally in favor of being able to support additional 
> authentication methods, the first one coming to mind is supporting OIDC. 
> Having a pluggable auth infrastructure could possibly make such efforts 
> easier. I'm definitely intrigued.

I'm hoping to dust off my OAuth patch and see if it can be ported on
top of this proposal.

--Jacob

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 14:10:39 -0500, Tom Lane wrote:
> I'm happy to add support for custom auth methods if they can use
> a protocol that's safer than cleartext-password.  But if that's the
> only feasible option, then we're just encouraging people to use
> insecure methods.

It looks like scram can be used without much trouble. All the necessary
infrastructure to implement it without duplication appears to be public.  The
plugin would need to get a secret from whereever and call
  CheckSASLAuth(&pg_be_scram_mech, port, shadow_pass, logdetail);

or if it needs to do something more complicated than CheckSASLAuth(), it can
use AUTH_REQ_SASL{,_FIN CONT} itself.

Of course whether it's viable depends on what the custom auth method wants to
do. But it's not a restriction of the authentication plugin model.

Greetings,

Andres Freund

Re: Add index scan progress to pg_stat_progress_vacuum

2022-02-25 Thread Nathan Bossart

On Wed, Feb 23, 2022 at 10:41:36AM -0800, Peter Geoghegan wrote:
> On Wed, Feb 23, 2022 at 10:02 AM Imseih (AWS), Sami  
> wrote:
>> If the failsafe kicks in midway through a vacuum, the number indexes_total 
>> will not be reset to 0. If INDEX_CLEANUP is turned off, then the value will 
>> be 0 at the start of the vacuum.
> 
> The way that this works with num_index_scans is that we "round up"
> when there has been non-zero work in lazy_vacuum_all_indexes(), but
> not if the precheck in lazy_vacuum_all_indexes() fails. That seems
> like a good model to generalize from here. Note that this makes
> INDEX_CLEANUP=off affect num_index_scans in much the same way as a
> VACUUM where the failsafe kicks in very early, during the initial heap
> pass. That is, if the failsafe kicks in before we reach lazy_vacuum()
> for the first time (which is not unlikely), or even in the
> lazy_vacuum_all_indexes() precheck, then num_index_scans will remain
> at 0, just like INDEX_CLEANUP=off.
> 
> The actual failsafe WARNING shows num_index_scans, possibly before it
> gets incremented one last time (by "rounding up"). So it's reasonably
> clear how this all works from that context (assuming that the
> autovacuum logging stuff, which reports num_index_scans, outputs a
> report for a table where the failsafe kicked in).

I am confused.  If failsafe kicks in during the middle of a vacuum, I
(perhaps naively) would expect indexes_total and indexes_processed to go to
zero, and I'd expect to no longer see the "vacuuming indexes" and "cleaning
up indexes" phases.  Otherwise, how would I know that we are now skipping
indexes?  Of course, you won't have any historical context about the index
work done before failsafe kicked in, but IMO it is misleading to still
include it in the progress view.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Commitfest manager for 2022-03

2022-02-25 Thread David Steele


Hi Greg,

On 2/25/22 12:39, Greg Stark wrote:

I would like to volunteer.

On Fri, 25 Feb 2022 at 05:31, Julien Rouhaud  wrote:


Is there any volunteer?


Since I've done the March CF for the last seven years, I thought it 
would be good if I chimed in. I've been agonizing a bit because I have 
travel planned during March and while I *could* do it I don't think I'd 
be able to do a good job.


I've been hoping somebody would volunteer, so I'm all in favor of you 
being CF.


Regards,
-David

Re: Race conditions in 019_replslot_limit.pl

2022-02-25 Thread Tom Lane

Andres Freund  writes:
> Seems to suggest something is holding a problematic lock in a way that I do 
> not understand yet:

> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
> 2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 
> 019_replslot_limit.pl LOG:  received replication command: 
> CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( 
> RESERVE_WAL)
> ...
> 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 
> 019_replslot_limit.pl DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to 
> make
> 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 
> 019_replslot_limit.pl DEBUG:  replication slot exit hook, without active slot
> 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 
> 019_replslot_limit.pl DEBUG:  temporary replication slot cleanup: begin

> last message from 1997084 until the immediate shutdown.

Hmm.  Maybe put a couple more debug messages into ReplicationSlotCleanup
and/or ReplicationSlotDropPtr?  It doesn't seem very clear where in that
sequence it's hanging up.

> We could be more certain if we shut down the cluster in fast rather than
> immediate mode. So I'm thinking of doing something like

Does that risk an indefinite hangup of the buildfarm run?

regards, tom lane

Re: Race conditions in 019_replslot_limit.pl

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 15:07:01 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Seems to suggest something is holding a problematic lock in a way that I do 
> > not understand yet:
> 
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
> > 2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 
> > 019_replslot_limit.pl LOG:  received replication command: 
> > CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( 
> > RESERVE_WAL)
> > ...
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 
> > 019_replslot_limit.pl DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks 
> > to make
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 
> > 019_replslot_limit.pl DEBUG:  replication slot exit hook, without active 
> > slot
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 
> > 019_replslot_limit.pl DEBUG:  temporary replication slot cleanup: begin
> 
> > last message from 1997084 until the immediate shutdown.
> 
> Hmm.  Maybe put a couple more debug messages into ReplicationSlotCleanup
> and/or ReplicationSlotDropPtr?  It doesn't seem very clear where in that
> sequence it's hanging up.

Yea, was thinking that as well.


I'm also wondering whether it's worth adding an assert, or at least a WARNING,
about no lwlocks held to the tail end of ShutdownPostgres?  I don't want to
add an LWLockReleaseAll() yet, before I understand what's actually happening.


> > We could be more certain if we shut down the cluster in fast rather than
> > immediate mode. So I'm thinking of doing something like
> 
> Does that risk an indefinite hangup of the buildfarm run?

I think not. The pg_ctl stop -m fast should time out after PGCTLTIMEOUT,
$self->_update_pid(-1); should notice it's not dead. The END{} block should
then shut it down in immediate mode.

Greetings,

Andres Freund

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-02-25 Thread Jacob Champion

On Fri, 2022-02-25 at 16:28 +, Jacob Champion wrote:
> Ha, opr_sanity caught my use of cstring. I'll work on a fix later
> today.

Fixed in v3.

--Jacob
From 2fde60a6bc3739f1894c8c264120e4fa0f04df64 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 14 Feb 2022 08:10:53 -0800
Subject: [PATCH v3] Add API to retrieve authn_id from SQL

The authn_id field in MyProcPort is currently only accessible to the
backend itself.  Add a SQL function, session_authn_id(), to expose the
field to triggers that may want to make use of it.
---
 src/backend/utils/adt/name.c  | 12 +++-
 src/include/catalog/catversion.h  |  2 +-
 src/include/catalog/pg_proc.dat   |  3 +++
 src/test/authentication/t/001_password.pl | 11 +++
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/name.c b/src/backend/utils/adt/name.c
index e8bba3670c..b892d25c29 100644
--- a/src/backend/utils/adt/name.c
+++ b/src/backend/utils/adt/name.c
@@ -23,6 +23,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
+#include "libpq/libpq-be.h"
 #include "libpq/pqformat.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -257,7 +258,7 @@ namestrcmp(Name name, const char *str)
 
 
 /*
- * SQL-functions CURRENT_USER, SESSION_USER
+ * SQL-functions CURRENT_USER, SESSION_USER, SESSION_AUTHN_ID
  */
 Datum
 current_user(PG_FUNCTION_ARGS)
@@ -271,6 +272,15 @@ session_user(PG_FUNCTION_ARGS)
 	PG_RETURN_DATUM(DirectFunctionCall1(namein, CStringGetDatum(GetUserNameFromId(GetSessionUserId(), false;
 }
 
+Datum
+session_authn_id(PG_FUNCTION_ARGS)
+{
+	if (!MyProcPort || !MyProcPort->authn_id)
+		PG_RETURN_NULL();
+
+	PG_RETURN_TEXT_P(cstring_to_text(MyProcPort->authn_id));
+}
+
 
 /*
  * SQL-functions CURRENT_SCHEMA, CURRENT_SCHEMAS
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 1addb568ef..14194afe1c 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*			mmddN */
-#define CATALOG_VERSION_NO	202202221
+#define CATALOG_VERSION_NO	202202251
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7f1ee97f55..3ddcbae55e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1508,6 +1508,9 @@
 { oid => '746', descr => 'session user name',
   proname => 'session_user', provolatile => 's', prorettype => 'name',
   proargtypes => '', prosrc => 'session_user' },
+{ oid => '9774', descr => 'session authenticated identity',
+  proname => 'session_authn_id', provolatile => 's', prorettype => 'text',
+  proargtypes => '', prosrc => 'session_authn_id' },
 
 { oid => '744',
   proname => 'array_eq', prorettype => 'bool',
diff --git a/src/test/authentication/t/001_password.pl b/src/test/authentication/t/001_password.pl
index 3e3079c824..2aa28ed547 100644
--- a/src/test/authentication/t/001_password.pl
+++ b/src/test/authentication/t/001_password.pl
@@ -82,6 +82,11 @@ test_role($node, 'scram_role', 'trust', 0,
 test_role($node, 'md5_role', 'trust', 0,
 	log_unlike => [qr/connection authenticated:/]);
 
+my $res = $node->safe_psql('postgres',
+	"SELECT session_authn_id() IS NULL;"
+);
+is($res, 't', "users with trust authentication have NULL authn_id");
+
 # For plain "password" method, all users should also be able to connect.
 reset_pg_hba($node, 'password');
 test_role($node, 'scram_role', 'password', 0,
@@ -91,6 +96,12 @@ test_role($node, 'md5_role', 'password', 0,
 	log_like =>
 	  [qr/connection authenticated: identity="md5_role" method=password/]);
 
+$res = $node->safe_psql('postgres',
+	"SELECT session_authn_id();",
+	connstr  => "user=md5_role"
+);
+is($res, 'md5_role', "users with md5 authentication have authn_id matching role name");
+
 # For "scram-sha-256" method, user "scram_role" should be able to connect.
 reset_pg_hba($node, 'scram-sha-256');
 test_role(
-- 
2.25.1

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2022-02-25 Thread Nathan Bossart

On Mon, Jan 31, 2022 at 04:25:21PM +0900, Michael Paquier wrote:
> At the end of the day, it may be better to just let this stuff be.
> Another argument for doing nothing is that this could cause
> hard-to-spot conflicts when it comes to back-patch something.

This one has been quiet for a while.  Should we mark it as
returned-with-feedback?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] pg_permissions

2022-02-25 Thread Chapman Flack

I would be happy to review this patch, but a look through the email leaves me
thinking it may still be waiting on a C implementation of pg_get_acl(). Is that
right? And perhaps a view rename to pg_privileges, following Peter's comment?

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-02-25 Thread Andres Freund

On 2022-02-25 20:19:24 +, Jacob Champion wrote:
> From 2fde60a6bc3739f1894c8c264120e4fa0f04df64 Mon Sep 17 00:00:00 2001
> From: Jacob Champion 
> Date: Mon, 14 Feb 2022 08:10:53 -0800
> Subject: [PATCH v3] Add API to retrieve authn_id from SQL

> The authn_id field in MyProcPort is currently only accessible to the
> backend itself.  Add a SQL function, session_authn_id(), to expose the
> field to triggers that may want to make use of it.

Looks to me like authn_id isn't synchronized to parallel workers right now. So
the function will return the wrong thing when executed as part of a parallel
query.

I don't think we should add further functions not prefixed with pg_.

Perhaps a few tests for less trivial authn_ids could be worthwhile?
E.g. certificate DNs.

Greetings,

Andres Freund

1 2 >

1 - 100 of 129 matches

Mail list logo