date:20210719

Re: corruption of WAL page header is never reported

2021-07-19 Thread Yugo NAGATA

On Mon, 19 Jul 2021 15:14:41 +0900 (JST)
Kyotaro Horiguchi  wrote:

> Hello.
> 
> At Sun, 18 Jul 2021 04:55:05 +0900, Yugo NAGATA  wrote 
> in 
> > Hello,
> > 
> > I found that any corruption of WAL page header found during recovery is 
> > never
> > reported in log messages. If wal page header is broken, it is detected in
> > XLogReaderValidatePageHeader called from  XLogPageRead, but the error 
> > messages
> > are always reset and never reported.
> 
> Good catch!  Currently recovery stops showing no reason if it is
> stopped by page-header errors.
> 
> > I attached a patch to fix it in this way.
> 
> However, it is a kind of a roof-over-a-roof.  What we should do is
> just omitting the check in XLogPageRead while in standby mode.

Your patch doesn't fix the issue that the error message is never reported in
standby mode. When a WAL page header is broken, the standby would silently 
repeat
retrying forever.

I think we have to let users know the corruption of WAL page header even in
standby mode, not? A corruption of WAL record header is always reported,
by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)

Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: Introduce pg_receivewal gzip compression tests

2021-07-19 Thread Michael Paquier

On Fri, Jul 16, 2021 at 02:08:57PM +0900, Michael Paquier wrote:
> This behavior is rather debatable, and it would be more instinctive to
> me to just skip any business related to the pre-padding if compression
> is enabled, at the cost of one extra callback in WalWriteMethod to
> grab the compression level (dir_open_for_write() skips that for
> compression) to allow receivelog.c to handle that.  But at the same
> time few users are going to care about that as pg_receivewal has most
> likely always the same set of options, so complicating this code is
> not really appealing either.

I have chewed on that over the weekend, and skipping the padding logic
if we are in compression mode in open_walfile() makes sense, so
attached is a patch that I'd like to backpatch.

Another advantage of this patch is the handling of ".gz" is reduced to
one code path instead of four.  That makes a bit easier the
introduction of new compression methods.

A second thing that was really confusing is that the name of the WAL
segment generated in this code path completely ignored the type of
compression.  This led to one confusing error message if failing to
open a segment for write where we'd mention a .partial file rather
than a .gz.partial file.  The versions of zlib I used on Windows
looked buggy so I cannot conclude there, but I am sure that this
should allow bowerbird to handle the test correctly.
--
Michael
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 3952a3f943..7af9009320 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -88,26 +88,29 @@ static bool
 open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 {
 	Walfile*f;
-	char		fn[MAXPGPATH];
+	char	   *fn;
 	ssize_t		size;
 	XLogSegNo	segno;
 
 	XLByteToSeg(startpoint, segno, WalSegSz);
 	XLogFileName(current_walfile_name, stream->timeline, segno, WalSegSz);
 
-	snprintf(fn, sizeof(fn), "%s%s", current_walfile_name,
-			 stream->partial_suffix ? stream->partial_suffix : "");
+	/* Note that this considers the compression used if necessary */
+	fn = stream->walmethod->get_file_name(current_walfile_name,
+		  stream->partial_suffix);
 
 	/*
 	 * When streaming to files, if an existing file exists we verify that it's
 	 * either empty (just created), or a complete WalSegSz segment (in which
 	 * case it has been created and padded). Anything else indicates a corrupt
-	 * file.
+	 * file. Compressed files have no need for padding, so just ignore this
+	 * case.
 	 *
 	 * When streaming to tar, no file with this name will exist before, so we
 	 * never have to verify a size.
 	 */
-	if (stream->walmethod->existsfile(fn))
+	if (stream->walmethod->compression() == 0 &&
+		stream->walmethod->existsfile(fn))
 	{
 		size = stream->walmethod->get_file_size(fn);
 		if (size < 0)
diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
index 158f7d176f..17fd71a450 100644
--- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl
+++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
@@ -72,13 +72,11 @@ $primary->command_ok(
 my @partial_wals = glob "$stream_dir/*\.partial";
 is(scalar(@partial_wals), 1, "one partial WAL segment was created");
 
-# Check ZLIB compression if available.  On Windows, some old versions
-# of zlib can cause some instabilities with this test, so disable it
-# for now.
+# Check ZLIB compression if available.
 SKIP:
 {
-	skip "postgres was not built with ZLIB support, or Windows is involved", 5
-	  if (!check_pg_config("#define HAVE_LIBZ 1") || $windows_os);
+	skip "postgres was not built with ZLIB support", 5
+	  if (!check_pg_config("#define HAVE_LIBZ 1"));
 
 	# Generate more WAL worth one completed, compressed, segment.
 	$primary->psql('postgres', 'SELECT pg_switch_wal();');
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index a15bbb20e7..3c0da70a04 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -68,20 +68,32 @@ dir_getlasterror(void)
 	return strerror(errno);
 }
 
+static char *
+dir_get_file_name(const char *pathname, const char *temp_suffix)
+{
+	char *tmppath = pg_malloc0(MAXPGPATH * sizeof(char));
+
+	snprintf(tmppath, MAXPGPATH, "%s%s%s",
+			 pathname, dir_data->compression > 0 ? ".gz" : "",
+			 temp_suffix ? temp_suffix : "");
+
+	return tmppath;
+}
+
 static Walfile
 dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_size)
 {
 	static char tmppath[MAXPGPATH];
+	char	   *filename;
 	int			fd;
 	DirectoryMethodFile *f;
 #ifdef HAVE_LIBZ
 	gzFile		gzfp = NULL;
 #endif
 
-	snprintf(tmppath, sizeof(tmppath), "%s/%s%s%s",
-			 dir_data->basedir, pathname,
-			 dir_data->compression > 0 ? ".gz" : "",
-			 temp_suffix ? temp_suffix : "");
+	filename = dir_get_file_name(pathname, temp_suffix);
+	snprintf(tmppath, sizeof(tmppath), "%s/%s",
+			 dir_data->basedir, filename);
 
 	/*
 	 * Open a file for non-compres

Re: Failure with 004_logrotate in prairiedog

2021-07-19 Thread Kyotaro Horiguchi

At Sun, 18 Jul 2021 12:32:20 +0900, Michael Paquier  wrote 
in 
> Hi all,
> 
> prairiedog has failed in a way that seems a bit obscure to me:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29
> 
> Here are the details of the failure:
> server signaled to rotate log file
> could not read
> "/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles":
> No such file or directory at t/004_logrotate.pl line 78
> ### Stopping node "primary" using mode immediate
> 
> update_metainfo_datafile() creates a temporary file renamed to
> current_logfiles with rename().  It should be atomic, though this
> error points out that this is not the case?  The previous steps of
> this test ensure that current_logfiles should exist.
> 
> We could use some eval blocks in this area, but a non-atomic rename()
> would cause problems in more areas.  Thoughts?

PostgresNode.logrotate() just invokes pg_ctl logrotate, which ends
with triggering log rotation by a signal.

When rotation happens, the metainfo file is once removed then
created. If slurp_file in the metafile-checking loop hits the gap, the
slurp_file fails with ENOENT.

For non-win32 platforms, the error is identifiable by #!{ENOENT} but
I'm not sure how we can identify the error for createFile(). $!
doesn't work, and $^E returns a human-readable string in the platform
language..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: alter table set TABLE ACCESS METHOD

2021-07-19 Thread Michael Paquier

On Thu, Jul 15, 2021 at 10:49:23PM -0500, Justin Pryzby wrote:
> Also, there were two redundant checks for multiple SET ACCESS METHOD commands.
> But one of them wasn't hit if the ALTER was setting the current AM due to the
> no-op test.

Yep.

> I think it's better to fail in every case, and not just sometimes (especially
> if we were to use ERRCODE_SYNTAX_ERROR).

Looks rather fine.

-   if (tab->newTableSpace)
+   if (OidIsValid(tab->newTableSpace))
This has no need to be part of this patch.

/*
-* If we have ALTER TABLE  SET TABLESPACE provide a list of
-* tablespaces
+* Complete with list of tablespaces (for SET TABLESPACE) or table AMs (for
+* SET ACCESS METHOD).
 */
+   else if (Matches("ALTER", "TABLE", MatchAny, "SET", "ACCESS", "METHOD"))
+   COMPLETE_WITH_QUERY(Query_for_list_of_table_access_methods);
else if (Matches("ALTER", "TABLE", MatchAny, "SET", "TABLESPACE"))
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
Nit, there is no need to merge both block here.  Let's keep them
separated.

+-- negative test
[...]
+-- negative test
Those descriptions could be better, and describe what they prevent
(aka no multiple subcommands SET ACCESS METHOD and not allowed on
partitioned tables).

> I included my 2ndary patch allowing to set the AM of partitioned table, same 
> as
> for a tablespace.

I would suggest to not hide this topic within a thread unrelated to
it, as this is not going to ease the feedback around it.  Let's start
a new thread if you feel this is necessary.

Jeff, you proposed to commit this patch upthread.  Are you planning to
look at that and do so?
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] logical decoding of two-phase transactions

2021-07-19 Thread Peter Smith

On Mon, Jul 19, 2021 at 4:41 PM Amit Kapila  wrote:
>
> On Mon, Jul 19, 2021 at 9:19 AM Peter Smith  wrote:
> >
> > On Mon, Jul 19, 2021 at 12:43 PM Amit Kapila  
> > wrote:
> > >
> > > On Mon, Jul 19, 2021 at 1:55 AM Tom Lane  wrote:
> > > >
> > > > Amit Kapila  writes:
> > > > > Pushed.
> > > >
> > > > I think you'd be way better off making the gid fields be "char *"
> > > > and pstrdup'ing the result of pq_getmsgstring.  Another possibility
> > > > perhaps is to use strlcpy, but I'd only go that way if it's important
> > > > to constrain the received strings to 200 bytes.
> > > >
> > >
> > > I think it is important to constrain length to 200 bytes for this case
> > > as here we receive a prepared transaction identifier which according
> > > to docs [1] has a max length of 200 bytes. Also, in
> > > ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with
> > > 200 as max length to copy prepare transaction identifier. So, I think
> > > it is better to use strlcpy here unless you or Peter feels otherwise.
> > >
> >
> > OK. I have implemented this reported [1] potential buffer overrun
> > using the constraining strlcpy, because the GID limitation of 200
> > bytes is already mentioned in the documentation [2].
> >
>
> This will work but I think it is better to use sizeof gid buffer as we
> are using in ParseCommitRecord() and ParseAbortRecord(). Tomorrow, if
> due to some unforeseen reason if we change the size of gid buffer to
> be different than the GIDSIZE then it will work seamlessly.
>

Modified as requested. PSA patch v2.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v2-0001-Fix-potential-buffer-overruns.patch
Description: Binary data

Re: badly calculated width of emoji in psql

2021-07-19 Thread Michael Paquier

On Wed, Jul 07, 2021 at 06:03:34PM +, Jacob Champion wrote:
> I would guess that's the key issue here. If we choose a particular
> width for emoji characters, is there anything keeping a terminal's font
> from doing something different anyway?

I'd say that we are doing our best in guessing what it should be,
then.  One cannot predict how fonts are designed.

> We could also keep the fragments as-is and generate a full interval
> table, like common/unicode_combining_table.h. It looks like there's
> roughly double the number of emoji intervals as combining intervals, so
> hopefully adding a second binary search wouldn't be noticeably slower.

Hmm.  Such things have a cost, and this one sounds costly with a
limited impact.  What do we gain except a better visibility with psql?

> In your opinion, would the current one-line patch proposal make things
> strictly better than they are today, or would it have mixed results?
> I'm wondering how to help this patch move forward for the current
> commitfest, or if we should maybe return with feedback for now.

Based on the following list, it seems to me that [u+1f300,u+0x1faff]
won't capture everything, like the country flags:
http://www.unicode.org/emoji/charts/full-emoji-list.html
--
Michael


signature.asc
Description: PGP signature

Re: Failure with 004_logrotate in prairiedog

2021-07-19 Thread Michael Paquier

On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote:
> When rotation happens, the metainfo file is once removed then
> created. If slurp_file in the metafile-checking loop hits the gap, the
> slurp_file fails with ENOENT.

I can read the following code, as of update_metainfo_datafile():
if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0)
ereport(LOG,
(errcode_for_file_access(),
 errmsg("could not rename file \"%s\" to \"%s\": %m",
LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE)));

This creates a temporary file that gets renamed to current_logfiles.
--
Michael


signature.asc
Description: PGP signature

Re: Toast compression method options

2021-07-19 Thread Dilip Kumar

On Sun, Jul 18, 2021 at 9:15 PM Dilip Kumar  wrote:
>
> On Wed, Jul 14, 2021 at 5:35 PM vignesh C  wrote:
> >
> > On Thu, May 6, 2021 at 7:24 PM Dilip Kumar  wrote
> >
> > The patch does not apply on Head anymore, could you rebase and post a
> > patch. I'm changing the status to "Waiting for Author".
>
> Okay, I will rebase and send it by next week.

I have rebased the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From a3afdc1e5a800662441cc6daab2666fafb6ee6ea Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Mon, 19 Jul 2021 11:21:03 +0530
Subject: [PATCH v2] Toast compression method options

---
 doc/src/sgml/ref/alter_table.sgml |   7 +-
 doc/src/sgml/ref/create_table.sgml|   5 +-
 src/backend/access/brin/brin_tuple.c  |   7 +-
 src/backend/access/common/indextuple.c|   7 +-
 src/backend/access/common/reloptions.c|  64 
 src/backend/access/common/toast_compression.c | 170 +++-
 src/backend/access/common/toast_internals.c   |   6 +-
 src/backend/access/table/toast_helper.c   |   7 +-
 src/backend/bootstrap/bootparse.y |   1 +
 src/backend/catalog/heap.c|  15 +-
 src/backend/catalog/index.c   |  43 -
 src/backend/catalog/toasting.c|   1 +
 src/backend/commands/cluster.c|   1 +
 src/backend/commands/foreigncmds.c|  44 -
 src/backend/commands/tablecmds.c  | 223 +-
 src/backend/nodes/copyfuncs.c |  16 +-
 src/backend/nodes/equalfuncs.c|  14 +-
 src/backend/nodes/outfuncs.c  |  14 +-
 src/backend/parser/gram.y |  28 +++-
 src/backend/parser/parse_utilcmd.c|   3 +-
 src/include/access/toast_compression.h|   8 +-
 src/include/access/toast_helper.h |   2 +
 src/include/access/toast_internals.h  |   2 +-
 src/include/catalog/heap.h|   2 +
 src/include/catalog/pg_attribute.h|   3 +
 src/include/commands/defrem.h |   7 +-
 src/include/nodes/nodes.h |   1 +
 src/include/nodes/parsenodes.h|  15 +-
 src/test/regress/expected/compression.out |  33 
 src/test/regress/expected/compression_1.out   |  36 +
 src/test/regress/expected/misc_sanity.out |   3 +-
 src/test/regress/sql/compression.sql  |  15 ++
 32 files changed, 674 insertions(+), 129 deletions(-)

diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 1c7f489..a224ada 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -54,7 +54,7 @@ ALTER TABLE [ IF EXISTS ] name
 ALTER [ COLUMN ] column_name SET ( attribute_option = value [, ... ] )
 ALTER [ COLUMN ] column_name RESET ( attribute_option [, ... ] )
 ALTER [ COLUMN ] column_name SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }
-ALTER [ COLUMN ] column_name SET COMPRESSION compression_method
+ALTER [ COLUMN ] column_name SET COMPRESSION compression_method [ WITH (compression_options) ]
 ADD table_constraint [ NOT VALID ]
 ADD table_constraint_using_index
 ALTER CONSTRAINT constraint_name [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
@@ -386,7 +386,7 @@ WITH ( MODULUS numeric_literal, REM
 

 
- SET COMPRESSION compression_method
+ SET COMPRESSION compression_method [ WITH (compression_options) ]
 
 
  
@@ -409,7 +409,8 @@ WITH ( MODULUS numeric_literal, REM
   addition, compression_method
   can be default, which selects the default behavior of
   consulting the  setting
-  at the time of data insertion to determine the method to use.
+  at the time of data insertion to determine the method to use.  The
+  compression options can be specified using WITH
  
 

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 15aed2f..e380f59 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -289,7 +289,7 @@ WITH ( MODULUS numeric_literal, REM

 

-COMPRESSION compression_method
+COMPRESSION compression_method [ WITH (compression_options) ]
 
  
   The COMPRESSION clause sets the compression method
@@ -308,7 +308,8 @@ WITH ( MODULUS numeric_literal, REM
   can be default to explicitly specify the default
   behavior, which is to consult the
setting at the time of
-  data insertion to determine the method to use.
+  data insertion to determine the method to use.  The compression options
+  can be specified using WITH
  
 

diff --git a/src/backend/access/brin/brin_tuple.c b/src/backend/access/brin/brin_tuple.c
index 09e563b..08b9f2c 100644
--- a/src/backend/access/brin/brin_tuple.c
+++ b/src/backend/access/brin/brin_tuple.c
@@ -38,6 +38,7

Re: Update maintenance_work_mem/autovacuum_work_mem to reflect the 1GB limitation with VACUUM

2021-07-19 Thread Masahiko Sawada

On Wed, Jul 7, 2021 at 8:44 PM David Rowley  wrote:
>
> On Sun, 4 Jul 2021 at 22:38, David Rowley  wrote:
> > I could do with a 2nd opinion about if we should just adjust the
> > maximum value for the autovacuum_work_mem GUC to 1GB in master.
> >
> > I'm also not sure if since we'd not backpatch the GUC max value
> > adjustment if we need to document the upper limit in the manual.
>
> I was just looking at this again and I see that GIN indexes are able
> to use more than 1GB of memory during VACUUM.

I think you meant that autovacuums can use more than 1GB of memory
during cleaning up a gin pending list (in ginInsertCleanup()). The
description updated by that commit is not true as of now as you
pointed out but IIUC it uses maintenance_work_mem *in addition to* the
same amount memory used by lazy vacuum. This memory usage seems rather
weird to me. Is it worh considering having gin pending list cleanup
use work_mem instead of maintenance_work_mem also in autovacuum cases
like btree indexes do? If we do that, the description will become
true, although we might need to update work_mem section somewhat.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: corruption of WAL page header is never reported

2021-07-19 Thread Kyotaro Horiguchi

At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA  wrote in 
> Your patch doesn't fix the issue that the error message is never reported in
> standby mode. When a WAL page header is broken, the standby would silently 
> repeat
> retrying forever.

Ok, I see your point and agree to that.

> I think we have to let users know the corruption of WAL page header even in
> standby mode, not? A corruption of WAL record header is always reported,
> by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)

Howeer, I'm still on the opinion that we don't need to check that
while in standby mode.

How about the attached?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From fe23fbf51569eb7a305bd9e619e918aec0b87bf7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 19 Jul 2021 14:49:34 +0900
Subject: [PATCH] Don't hide xlog page-header errors while recovery

The commit 0668719801 intended to handle the case of standby mode but
it inadvertently catches also while not in standby mode.  Change the
condition to catch only while in non-standby mode.

Addition to that, it is reasonable to show the error message even in
standby mode. Show the error message if any.

Author: Yugo NAGATA 
Reviewed-by: Kyotaro Horiguchi 
---
 src/backend/access/transam/xlog.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ee9515139..221e7d1f07 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12317,8 +12317,17 @@ retry:
 	 * Validating the page header is cheap enough that doing it twice
 	 * shouldn't be a big deal from a performance point of view.
 	 */
-	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+	if (StandbyMode &&
+		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
+		/*
+		 * in this case we consume this error right now then retry immediately,
+		 * the message is already translated
+		 */
+		if (xlogreader->errormsg_buf[0])
+			ereport(emode_for_corrupt_record(emode, EndRecPtr),
+	(errmsg_internal("%s", xlogreader->errormsg_buf)));
+
 		/* reset any error XLogReaderValidatePageHeader() might have set */
 		xlogreader->errormsg_buf[0] = '\0';
 		goto next_record_is_invalid;
-- 
2.27.0

Re: Skipping logical replication transactions on subscriber side

2021-07-19 Thread Amit Kapila

On Mon, Jul 19, 2021 at 12:10 PM Masahiko Sawada  wrote:
>
> On Sat, Jul 17, 2021 at 12:02 AM Masahiko Sawada  
> wrote:
> >
> > 1. How to clear apply worker errors. IIUC we've discussed that once
> > the apply worker skipped the transaction we leave the error entry
> > itself but clear its fields except for some fields such as failure
> > counts. But given that the stats messages could be lost, how can we
> > ensure to clear those error details? For table sync workers’ error, we
> > can have autovacuum workers periodically check entires of
> > pg_subscription_rel and clear the error entry if the table sync worker
> > completes table sync (i.g., checking if srsubstate = ‘r’). But there
> > is no such information for the apply workers and subscriptions. In
> > addition to sending the message clearing the error details just after
> > skipping the transaction, I thought that we can have apply workers
> > periodically send the message clearing the error details but it seems
> > not good.
>
> I think that the motivation behind the idea of leaving error entries
> and clearing theirs some fields is that users can check if the error
> is successfully resolved and the worker is working find. But we can
> check it also in another way, for example, checking
> pg_stat_subscription view. So is it worth considering leaving the
> apply worker errors as they are?
>

I think so. Basically, we will send the clear message after skipping
the exact but I think it is fine if that message is lost. At worst, it
will be displayed as the last error details. If there is another error
it will be overwritten or probably we should have a function *_reset()
which allows the user to reset a particular subscription's error info.

-- 
With Regards,
Amit Kapila.

Re: corruption of WAL page header is never reported

2021-07-19 Thread Kyotaro Horiguchi

me> Howeer, I'm still on the opinion that we don't need to check that
me> while in standby mode.

Of course it's typo of  "while not in standby mode".

sorry..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] logical decoding of two-phase transactions

2021-07-19 Thread Amit Kapila

On Mon, Jul 19, 2021 at 1:00 PM Peter Smith  wrote:
>
> On Mon, Jul 19, 2021 at 4:41 PM Amit Kapila  wrote:
> >
> > >
> > > OK. I have implemented this reported [1] potential buffer overrun
> > > using the constraining strlcpy, because the GID limitation of 200
> > > bytes is already mentioned in the documentation [2].
> > >
> >
> > This will work but I think it is better to use sizeof gid buffer as we
> > are using in ParseCommitRecord() and ParseAbortRecord(). Tomorrow, if
> > due to some unforeseen reason if we change the size of gid buffer to
> > be different than the GIDSIZE then it will work seamlessly.
> >
>
> Modified as requested. PSA patch v2.
>

LGTM. I'll push this tomorrow unless Tom or someone else has any comments.

-- 
With Regards,
Amit Kapila.

Re: Why ALTER SUBSCRIPTION ... SET (slot_name='none') requires subscription disabled?

2021-07-19 Thread Amit Kapila

On Fri, Jul 16, 2021 at 2:12 PM Japin Li  wrote:
>
>
> On Fri, 16 Jul 2021 at 14:06, Amit Kapila  wrote:
> > On Fri, Jul 9, 2021 at 8:20 AM Japin Li  wrote:
> >>
> >> On Thu, 08 Jul 2021 at 18:17, Amit Kapila  wrote:
> >> > On Thu, Jul 8, 2021 at 3:43 PM Japin Li  wrote:
> >>
> >> Please consider review v3 patch. v3-0001 adds slot_name verification in
> >> parse_subscription_options() and comments for why we need disable 
> >> subscription
> >> where set slot_name to NONE.
> >>
> >
> > I think we back-patch this bug-fix till v10 where it was introduced
> > and update the comments only in HEAD. So, accordingly, I moved the
> > changes into two patches and changed the comments a bit. Can you
> > please test the first patch in back-branches? I'll also do it
> > separately.
> >
>
> I try to back-patch to v10 stable to v14 stable, and attach two new patches:
> one for PG10 & PG11 stable, and the other is for PG12 to PG14 stable.
> v4 patch can be applied on HEAD. This modify looks good to me.
>

The patch you prepared for v14 was not getting applied cleanly, so I
did the required modifications and then pushed.

> How do we back-patch to back-branches? I try to use cherry-pick, but it 
> doesn't
> always work (without a doubt, it might be some difference between branches).
>

Yeah, we need to adjust the patch as per the back-branches code.

-- 
With Regards,
Amit Kapila.

Re: Added schema level support for publication.

2021-07-19 Thread Greg Nancarrow

On Fri, Jul 16, 2021 at 8:13 PM vignesh C  wrote:
>
> Modified.
>
> Thanks for the comments, these issues are fixed as part of the v12 patch 
> posted at [1].
> [1]  - 
> https://www.postgresql.org/message-id/CALDaNm3V9ny5dJM8nofLGJ3zDuDG0gS2dX%2BAhDph--U5y%2B4VbQ%40mail.gmail.com
>

There seems to be a problem with ALTER PUBLICATION ... SET TABLE ...
After that command, it still regards it as an empty (e) publication,
so I can then ALTER PUBLICATION ... ADD SCHEMA ...

e.g.

test_pub=# create schema myschema;
CREATE SCHEMA
test_pub=# CREATE TABLE myschema.test (key int, value text, data jsonb);
CREATE TABLE
test_pub=# create publication pub1;
CREATE PUBLICATION
test_pub=# \dRp+ pub1
 Publication pub1
 Owner | All tables | Inserts | Updates | Deletes | Truncates | Via
root | Pubtype
---++-+-+-+---+--+-
 gregn | f  | t   | t   | t   | t | f| e
(1 row)

test_pub=# alter publication pub1 set table myschema.test;
ALTER PUBLICATION
test_pub=# \dRp+ pub1
 Publication pub1
 Owner | All tables | Inserts | Updates | Deletes | Truncates | Via
root | Pubtype
---++-+-+-+---+--+-
 gregn | f  | t   | t   | t   | t | f| e
(1 row)

test_pub=# alter publication pub1 add schema myschema;
ALTER PUBLICATION
test_pub=# \dRp+ pub1
 Publication pub1
 Owner | All tables | Inserts | Updates | Deletes | Truncates | Via
root | Pubtype
---++-+-+-+---+--+-
 gregn | f  | t   | t   | t   | t | f| s
Schemas:
"myschema"


Regards,
Greg Nancarrow
Fujitsu Australia

Re: corruption of WAL page header is never reported

2021-07-19 Thread Yugo NAGATA

On Mon, 19 Jul 2021 17:47:07 +0900 (JST)
Kyotaro Horiguchi  wrote:

> At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA  wrote 
> in 
> > Your patch doesn't fix the issue that the error message is never reported in
> > standby mode. When a WAL page header is broken, the standby would silently 
> > repeat
> > retrying forever.
> 
> Ok, I see your point and agree to that.
> 
> > I think we have to let users know the corruption of WAL page header even in
> > standby mode, not? A corruption of WAL record header is always reported,
> > by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)
> 
> Howeer, I'm still on the opinion that we don't need to check that
> while in standby mode.
> 
> How about the attached?

On Mon, 19 Jul 2021 17:50:16 +0900 (JST)
Kyotaro Horiguchi  wrote:

> me> Howeer, I'm still on the opinion that we don't need to check that
> me> while in standby mode.
> 
> Of course it's typo of  "while not in standby mode".

Thanks for updating the patch. I agree with you.

I think it is nice to fix to perform the check only during standby mode
because it make a bit clearer why we check it immediately in XLogPageRead.

Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: corruption of WAL page header is never reported

2021-07-19 Thread Ranier Vilela

Em seg., 19 de jul. de 2021 às 06:15, Yugo NAGATA 
escreveu:

> On Mon, 19 Jul 2021 17:47:07 +0900 (JST)
> Kyotaro Horiguchi  wrote:
>
> > At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA 
> wrote in
> > > Your patch doesn't fix the issue that the error message is never
> reported in
> > > standby mode. When a WAL page header is broken, the standby would
> silently repeat
> > > retrying forever.
> >
> > Ok, I see your point and agree to that.
> >
> > > I think we have to let users know the corruption of WAL page header
> even in
> > > standby mode, not? A corruption of WAL record header is always
> reported,
> > > by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)
> >
> > Howeer, I'm still on the opinion that we don't need to check that
> > while in standby mode.
> >
> > How about the attached?
>
> On Mon, 19 Jul 2021 17:50:16 +0900 (JST)
> Kyotaro Horiguchi  wrote:
>
> > me> Howeer, I'm still on the opinion that we don't need to check that
> > me> while in standby mode.
> >
> > Of course it's typo of  "while not in standby mode".
>
> Thanks for updating the patch. I agree with you.
>
> I think it is nice to fix to perform the check only during standby mode
> because it make a bit clearer why we check it immediately in XLogPageRead.
>
And as I had reviewed, your first patch was wrong and now with the Kyotaro
version,
to keep the same behavior, it is necessary to reset the error, correct?

regards,
Ranier Vilela

Re: Replace remaining castNode(…, lfirst(…)) and friends calls with l*_node()

2021-07-19 Thread Dagfinn Ilmari Mannsåker

Peter Eisentraut  writes:

> On 15.07.21 13:12, Dagfinn Ilmari Mannsåker wrote:
>> ilm...@ilmari.org (Dagfinn Ilmari Mannsåker) writes:
>>
>>> Hi hackers,
>>>
>>> We've had node-casting versions of the list extraction macros since
>>> 2017, but several cases of the written-out version have been added since
>>> then (even by Tom, who added the l*_node() macros).
>>>
>>> Here's a patch to convert the remaining ones.  The macros were
>>> back-patched to all supported branches, so this won't create any
>>> new back-patching hazards.
>>
>> Added to the 2021-09 commit fest: https://commitfest.postgresql.org/34/3253/
>
> committed

Thanks!

- ilmari

Re: row filtering for logical replication

2021-07-19 Thread Amit Kapila

On Sat, Jul 17, 2021 at 3:05 AM Alvaro Herrera  wrote:
>
> On 2021-Jul-16, Greg Nancarrow wrote:
>
> > On Fri, Jul 16, 2021 at 3:50 PM Amit Kapila  wrote:
> > >
> > > I am not so sure about different filters for old and new rows but it
> > > makes sense to by default apply the filter to both old and new rows.
> > > Then also provide a way for user to specify if the filter can be
> > > specified to just old or new row.
> >
> > I'm having some doubts and concerns about what is being suggested.
>
> Yeah.  I think the idea that some updates fail to reach the replica,
> leaving the downstream database in a different state than it would be if
> those updates had reached it, is unsettling.  It makes me wish we raised
> an error at UPDATE time if both rows would not pass the filter test in
> the same way -- that is, if the old row passes the filter, then the new
> row must be a pass as well.
>

Hmm, do you mean to say that raise an error in walsender while
decoding if old or new doesn't match filter clause? How would
walsender come out of that error? Even, if seeing the error user
changed the filter clause for publication, I think it would still see
the old ones due to historical snapshot and keep on getting the same
error. One idea could be that we use the current snapshot to read the
publications catalog table, then the user would probably change the
filter or do something to move forward from this error. The other
options could be:

a. Just log it and move to the next row
b. send to stats collector some info about this which can be displayed
in a view and then move ahead
c. just skip it like any other row that doesn't match the filter clause.

I am not sure if there is any use of sending a row if one of the
old/new rows doesn't match the filter. Because if the old row doesn't
match but the new one matches the criteria, we will anyway just throw
such a row on the subscriber instead of applying it. OTOH, if old
matches but new doesn't match then it probably doesn't fit the analogy
that new rows should behave similarly to Inserts. I am of opinion that
we should do either (a) or (c) when one of the old or new rows doesn't
match the filter clause.

> Maybe a second option is to have replication change any UPDATE into
> either an INSERT or a DELETE, if the old or the new row do not pass the
> filter, respectively.  That way, the databases would remain consistent.
>

I guess such things should be handled via conflict resolution on the
subscriber side.

-- 
With Regards,
Amit Kapila.

postgresql.conf.sample missing units

2021-07-19 Thread Pavel Luzanov


Hello,

I found that the start section of the postgresql.conf file is missing a 
description of two units: bytes (appeared in version 11) and 
microseconds (in version 12).


The attached patch makes these changes to the postgresql.conf.sample file.

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ccaaf63850..b242a7fc8b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -24,7 +24,8 @@
 # "postgres -c log_connections=on".  Some parameters can be changed at run time
 # with the "SET" SQL command.
 #
-# Memory units:  kB = kilobytesTime units:  ms  = milliseconds
+# Memory units:  B  = bytesTime units:  us  = microseconds
+#kB = kilobytes ms  = milliseconds
 #MB = megabytes s   = seconds
 #GB = gigabytes min = minutes
 #TB = terabytes h   = hours

Re: corruption of WAL page header is never reported

2021-07-19 Thread Yugo NAGATA

On Mon, 19 Jul 2021 06:32:28 -0300
Ranier Vilela  wrote:

> Em seg., 19 de jul. de 2021 às 06:15, Yugo NAGATA 
> escreveu:
> 
> > On Mon, 19 Jul 2021 17:47:07 +0900 (JST)
> > Kyotaro Horiguchi  wrote:
> >
> > > At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA 
> > wrote in
> > > > Your patch doesn't fix the issue that the error message is never
> > reported in
> > > > standby mode. When a WAL page header is broken, the standby would
> > silently repeat
> > > > retrying forever.
> > >
> > > Ok, I see your point and agree to that.
> > >
> > > > I think we have to let users know the corruption of WAL page header
> > even in
> > > > standby mode, not? A corruption of WAL record header is always
> > reported,
> > > > by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)
> > >
> > > Howeer, I'm still on the opinion that we don't need to check that
> > > while in standby mode.
> > >
> > > How about the attached?
> >
> > On Mon, 19 Jul 2021 17:50:16 +0900 (JST)
> > Kyotaro Horiguchi  wrote:
> >
> > > me> Howeer, I'm still on the opinion that we don't need to check that
> > > me> while in standby mode.
> > >
> > > Of course it's typo of  "while not in standby mode".
> >
> > Thanks for updating the patch. I agree with you.
> >
> > I think it is nice to fix to perform the check only during standby mode
> > because it make a bit clearer why we check it immediately in XLogPageRead.
> >
> And as I had reviewed, your first patch was wrong and now with the Kyotaro
> version,
> to keep the same behavior, it is necessary to reset the error, correct?

Well, I think my first patch was not wrong. The difference with the latest
patch is just whether we perform the additional check when we are not in
standby mode or not. The behavior is basically the same although which function
detects and prints the page-header error in cases of crash recovery is 
different.
On the other hand, in your patch, the error message was always ommitted in cases
of crash recovery, and it seemed to me wrong.

Regards,
Yugo Nagata


-- 
Yugo NAGATA

Re: badly calculated width of emoji in psql

2021-07-19 Thread Pavel Stehule

po 19. 7. 2021 v 9:46 odesílatel Michael Paquier 
napsal:

> On Wed, Jul 07, 2021 at 06:03:34PM +, Jacob Champion wrote:
> > I would guess that's the key issue here. If we choose a particular
> > width for emoji characters, is there anything keeping a terminal's font
> > from doing something different anyway?
>
> I'd say that we are doing our best in guessing what it should be,
> then.  One cannot predict how fonts are designed.
>
> > We could also keep the fragments as-is and generate a full interval
> > table, like common/unicode_combining_table.h. It looks like there's
> > roughly double the number of emoji intervals as combining intervals, so
> > hopefully adding a second binary search wouldn't be noticeably slower.
>
> Hmm.  Such things have a cost, and this one sounds costly with a
> limited impact.  What do we gain except a better visibility with psql?
>

The benefit is correct displaying. I checked impact on server side, and
ucs_wcwidth is used just for calculation of error position. Any other usage
is only in psql.

Moreover, I checked unicode ranges, and I think so for common languages the
performance impact should be zero (because typically use ucs < 0x1100). The
possible (but very low) impact can be for some historic languages or
special symbols. It has not any impact for ranges that currently return
display width 2, because the new range is at the end of list.

I am not sure how wide usage of PQdsplen is outside psql, but I have no
reason to think so, so developers will prefer this function over built
functionality in any developing environment that supports unicode. So in
this case I have a strong opinion to prefer correctness of result against
current speed (note: I have an experience from pspg development, where this
operation is really on critical path, and I tried do some micro
optimization without strong effect - on very big unusual result (very wide,
very long (100K rows) the difference was about 500 ms (on pager side, it
does nothing else than string operations in this moment)).

Regards

Pavel

>
> > In your opinion, would the current one-line patch proposal make things
> > strictly better than they are today, or would it have mixed results?
> > I'm wondering how to help this patch move forward for the current
> > commitfest, or if we should maybe return with feedback for now.
>
> Based on the following list, it seems to me that [u+1f300,u+0x1faff]
> won't capture everything, like the country flags:
> http://www.unicode.org/emoji/charts/full-emoji-list.html
> --
> Michael
>

Re: Minimal logical decoding on standbys

2021-07-19 Thread Ibrar Ahmed

On Fri, Jul 16, 2021 at 1:07 PM Drouvot, Bertrand 
wrote:

> Hi Andres,
>
> On 6/22/21 12:38 PM, Drouvot, Bertrand wrote:
> > Hi Andres,
> >
> > On 6/14/21 7:41 AM, Drouvot, Bertrand wrote:
> >> Hi Andres,
> >>
> >> On 4/8/21 5:47 AM, Andres Freund wrote:
> >>> Hi,
> >>>
> >>> On 2021-04-07 13:32:18 -0700, Andres Freund wrote:
>  While working on this I found a, somewhat substantial, issue:
> 
>  When the primary is idle, on the standby logical decoding via
>  walsender
>  will typically not process the records until further WAL writes
>  come in
>  from the primary, or until a 10s lapsed.
> 
>  The problem is that WalSndWaitForWal() waits for the *replay* LSN to
>  increase, but gets woken up by walreceiver when new WAL has been
>  flushed. Which means that typically walsenders will get woken up at
>  the
>  same time that the startup process will be - which means that by the
>  time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
>  that the startup process already replayed the record and updated
>  XLogCtl->lastReplayedEndRecPtr.
> 
>  I think fixing this would require too invasive changes at this
>  point. I
>  think we might be able to live with 10s delay issue for one
>  release, but
>  it sure is ugly :(.
> >>> This is indeed pretty painful. It's a lot more regularly occuring if
> >>> you
> >>> either have a slot disk, or you switch around the order of
> >>> WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().
> >>>
> >>> - There's about which timeline to use. If you use pg_recvlogical and
> >>> you
> >>>restart the server, you'll see errors like:
> >>>
> >>>pg_recvlogical: error: unexpected termination of replication
> >>> stream: ERROR:  requested WAL segment 0003 has
> >>> already been removed
> >>>
> >>>the real filename is 00010003 - i.e. the timeline is
> >>>0.
> >>>
> >>>This isn't too hard to fix, but definitely needs fixing.
> >>
> >> Thanks, nice catch!
> >>
> >> From what I have seen, we are not going through InitXLOGAccess() on a
> >> Standby and in some cases (like the one you mentioned)
> >> StartLogicalReplication() is called without IdentifySystem() being
> >> called previously: this lead to ThisTimeLineID still set to 0.
> >>
> >> I am proposing a fix in the attached v18 by adding a check in
> >> StartLogicalReplication() and ensuring that ThisTimeLineID is retrieved.
> >>
> >>>
> >>> - ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
> >>>leading us to drop a slot that has been created since we signalled a
> >>>recovery conflict.  See
> >>>
> https://www.postgresql.org/message-id/20210408020913.zzprrlvqyvlt5cyy%40alap3.anarazel.de
> >>>
> >>>for some very similar issues.
> >>
> >> I have rewritten this part by following the same logic as the one
> >> used in 96540f80f8 (the commit linked to the thread you mentioned).
> >>
> >>>
> >>> - Given the precedent of max_slot_wal_keep_size, I think it's wrong to
> >>>just drop the logical slots. Instead we should just mark them as
> >>>invalid, like InvalidateObsoleteReplicationSlots().
> >>
> >> Makes fully sense and done that way in the attached patch.
> >>
> >> I am setting the slot's data.xmin and data.catalog_xmin as
> >> InvalidTransactionId to mark the slot(s) as invalid in case of conflict.
> >>
> >>> - There's no tests covering timeline switches, what happens if
> >>> there's a
> >>>promotion if logical decoding is currently ongoing.
> >>
> >> I'll now work on the tests.
> >>
> >>>
> >>> - The way ResolveRecoveryConflictWithLogicalSlots() builds the error
> >>>message is not good (and I've complained about it before...).
> >>
> >> I changed it and made it more simple.
> >>
> >> I also removed the details around mentioning xmin or catalog xmin (as
> >> I am not sure of the added value and they are currently also not
> >> mentioned during standby recovery snapshot conflict).
> >>
> >>>
> >>> Unfortunately I think the things I have found are too many for me to
> >>> address within the given time. I'll send a version with a somewhat
> >>> polished set of the changes I made in the next few days...
> >>
> >> Thanks for the review and feedback.
> >>
> >> Please find enclosed v18 with the changes I worked on.
> >>
> >> I still need to have a look on the tests.
> >
> > Please find enclosed v19 that also contains the changes related to
> > your TAP tests remarks, mainly:
> >
> > - get rid of 024 and add more tests in 026 (025 has been used in the
> > meantime)
> >
> > - test that logical decoding actually produces useful and correct results
> >
> > - test standby promotion and logical decoding behavior once done
> >
> > - useless "use" removal
> >
> > - check_confl_logicalslot() function removal
> >
> > - rewrote make_slot_active() to make use of poll_query_until() and
> > timeout
> >
> > - remove the useless eval()
> >
> > - remove the "C

Re: Re[3]: On login trigger: take three

2021-07-19 Thread Ibrar Ahmed

On Wed, Jul 7, 2021 at 5:55 AM Greg Nancarrow  wrote:

> On Sun, Jul 4, 2021 at 1:21 PM vignesh C  wrote:
> >
> > CFBot shows the following failure:
> > # poll_query_until timed out executing this query:
> > # SELECT '0/3046250' <= replay_lsn AND state = 'streaming' FROM
> > pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';
> > # expecting this output:
> > # t
> > # last actual query output:
> > # t
> > # with stderr:
> > # NOTICE: You are welcome!
> > # Looks like your test exited with 29 before it could output anything.
> > t/001_stream_rep.pl ..
> > Dubious, test returned 29 (wstat 7424, 0x1d00)
> >
>
> Thanks.
> I found that the patch was broken by commit f452aaf7d (the part
> "adjust poll_query_until to insist on empty stderr as well as a stdout
> match").
> So I had to remove a "RAISE NOTICE" (which was just an informational
> message) from the login trigger function, to satisfy the new
> poll_query_until expectations.
> Also, I updated a PG14 version check (now must check PG15 version).
>
> Regards,
> Greg Nancarrow
> Fujitsu Australia
>


The patch does not apply, and rebase is required.

Hunk #1 FAILED at 93.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/sysviews.out.rej


I am not changing the status because it is a minor change and the
patch is already "Ready for Committer".


-- 
Ibrar Ahmed

Re: Column Filtering in Logical Replication

2021-07-19 Thread Ibrar Ahmed

On Tue, Jul 13, 2021 at 7:44 PM Rahila Syed  wrote:

>
> Hi Tomas,
>
> Thank you for your comments.
>
>
>>
>> >
>> > Currently, this capability is not included in the patch. If the table on
>> > the subscriber
>> > server has lesser attributes than that on the publisher server, it
>> > throws an error at the
>> > time of CREATE SUBSCRIPTION.
>> >
>>
>> That's a bit surprising, to be honest. I do understand the patch simply
>> treats the filtered columns as "unchanged" because that's the simplest
>> way to filter the *data* of the columns. But if someone told me we can
>> "filter columns" I'd expect this to work without the columns on the
>> subscriber.
>>
>> OK, I will look into adding this.
>
>
>>
>> > However, need to carefully consider situations in which a server
>> > subscribes to multiple
>> > publications,  each publishing a different subset of columns of a
>> table.
>
> Isn't that pretty much the same situation as for multiple subscriptions
>> each with a different set of I/U/D operations? IIRC we simply merge
>> those, so why not to do the same thing here and merge the attributes?
>>
>>
> Yeah, I agree with the solution to merge the attributes, similar to how
> operations are merged. My concern was also from an implementation point
> of view, will it be a very drastic change. I now had a look at how remote
> relation
> attributes are acquired for comparison with local attributes at the
> subscriber.
> It seems that the publisher will need to send the information about the
> filtered columns
> for each publication specified during CREATE SUBSCRIPTION.
> This will be read at the subscriber side which in turn updates its cache
> accordingly.
> Currently, the subscriber expects all attributes of a published relation
> to be present.
> I will add code for this in the next version of the patch.
>
>  To nitpick, I find "Bitmapset *att_list" a bit annoying, because it's
>
> not really a list ;-)
>
>
> I will make this change with the next version
>
>
>
>>  FWIW "make check" fails for me with this version, due to segfault in
>> OpenTableLists. Apparenly there's some confusion - the code expects the
>> list to contain PublicationTable nodes, and tries to extract the
>> RangeVar from the elements. But the list actually contains RangeVar, so
>> this crashes and burns. See the attached backtrace.
>>
>>
> Thank you for the report, This is fixed in the attached version, now all
> publication
> function calls accept the PublicationTableInfo list.
>
> Thank you,
> Rahila Syed
>
>
>

The patch does not apply, and an rebase is required

Hunk #8 succeeded at 1259 (offset 99 lines).
Hunk #9 succeeded at 1360 (offset 99 lines).
1 out of 9 hunks FAILED -- saving rejects to file
src/backend/replication/pgoutput/pgoutput.c.rej
patching file src/include/catalog/pg_publication.h


Changing the status to "Waiting on Author"

-- 
Ibrar Ahmed

Re: shared-memory based stats collector

2021-07-19 Thread Ibrar Ahmed

On Wed, Apr 7, 2021 at 8:05 AM Kyotaro Horiguchi 
wrote:

> At Tue, 6 Apr 2021 09:32:16 -0700, Andres Freund 
> wrote in
> > Hi,
> >
> > On 2021-04-05 02:29:14 -0700, Andres Freund wrote:
> ..
> > I'm inclined to push patches
> > [PATCH v60 05/17] pgstat: split bgwriter and checkpointer messages.
> > [PATCH v60 06/17] pgstat: Split out relation stats handling from
> AtEO[Sub]Xact_PgStat() etc.
> > [PATCH v60 09/17] pgstat: xact level cleanups / consolidation.
> > [PATCH v60 10/17] pgstat: Split different types of stats into separate
> files.
> > [PATCH v60 12/17] pgstat: reorder file pgstat.c / pgstat.h contents.
>
> FWIW..
>
> 05 is a straight forward code-rearrange and reasonable to apply.
>
> 06 is same as above and it seems to make things cleaner.
>
> 09 mainly adds ensure_tabtat_xact_level() to remove repeated code
>   blocks a straight-forward way. I wonder if
>   pgstat_xact_stack_level_get() might better be
>   pgstat_get_xact_stack_level(), but I'm fine with the name in the
>   patch.
>
> 10 I found that the kind in "pgstat_kind" meant the placeholder for
>   specific types.  It looks good to separate them into smaller pieces.
>   It is also a simple rearrangement of code.
>
> > pgstat.c is very long, and it's hard to find an order that makes sense
> > and is likely to be maintained over time. Splitting the different
>
>   I deeply agree to "hard to find an order that makes sense".
>
> 12 I'm not sure how it looks after this patch (I failed to apply 09 at
>   my hand.), but it is also a simple rearrangement of code blocks.
>
> > to v14. They're just moving things around, so are fairly low risk. But
> > they're going to be a pain to maintain. And I think 10 and 12 make
> > pgstat.c a lot easier to understand.
>
> I think that pgstat.c doesn't get frequent back-patching.  It seems to
> me that at least 10 looks good.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
>
>
The patch does not apply, and require rebase,

1 out of 8 hunks FAILED -- saving rejects to file src/include/pgstat.h.rej
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 8758 (offset 34 lines).
patching file src/backend/postmaster/checkpointer.c
Hunk #3 succeeded at 496 with fuzz 1.
Hunk #4 FAILED at 576.
1 out of 6 hunks FAILED -- saving rejects to file
src/backend/postmaster/checkpointer.c.rej
patching file src/backend/postmaster/pgstat.c


I am changing the status to "Waiting on Author".

-- 
Ibrar Ahmed

Re: Improve join selectivity estimation using extended statistics

2021-07-19 Thread Ibrar Ahmed

On Mon, Mar 15, 2021 at 8:42 PM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 11.03.2021 03:47, Tomas Vondra wrote:
> > Hi Konstantin,
> >
> > Thanks for working on this! Using extended statistics to improve join
> > cardinality estimates was definitely on my radar, and this patch seems
> > like a good start.
> >
> > I had two basic ideas about how we might improve join estimates:
> >
> > (a) use per-table extended statistics to estimate join conditions
> >
> > (b) invent multi-table extended statistics (requires inventing how to
> > sample the tables in a coordinated way, etc.)
> >
> > This patch aims to do (a) which is perfectly reasonable - I think we can
> > achieve significant improvements this way. I have some ideas about (b),
> > but it seems harder and for a separate thread/patch.
> >
> >
> > The patch includes some *very* interesting ideas, but I think it's does
> > them too late and at the wrong level of abstraction. I mean that:
> >
> > 1) I don't think the code in clausesel.c should deal with extended
> > statistics directly - it requires far too much knowledge about different
> > types of extended stats, what clauses are supported by them, etc.
> > Allowing stats on expressions will make this even worse.
> >
> > Better do that in extended_stats.c, like statext_clauselist_selectivity.
> >
> > 2) in clauselist_selectivity_ext, functional dependencies are applied in
> > the part that processes remaining clauses, not estimated using extended
> > statistics. That seems a bit confusing, and I suspect it may lead to
> > issues - for example, it only processes the clauses incrementally, in a
> > particular order. That probably affects the  result, because it affects
> > which functional dependencies we can apply.
> >
> > In the example query that's not an issue, because it only has two Vars,
> > so it either can't apply anything (with one Var) or it can apply
> > everything (with two Vars). But with 3 or more Vars the order would
> > certainly matter, so it's problematic.
> >
> >
> > Moreover, it seems a bit strange that this considers dependencies only
> > on the inner relation. Can't that lead to issues with different join
> > orders producing different cardinality estimates?
> >
> >
> > I think a better approach would be to either modify the existing block
> > dealing with extended stats for a single relation to also handle join
> > conditions. Or perhaps we should invent a separate block, dealing with
> > *pairs* of relations? And it should deal with *all* join clauses for
> > that pair of relations at once, not one by one.
> >
> > As for the exact implementation, I'd imagine we call overall logic to be
> > something like (for clauses on two joined relations):
> >
> > - pick a subset of clauses with the same type of extended statistics on
> > both sides (MCV, ndistinct, ...), repeat until we can't apply more
> > statistics
> >
> > - estimate remaining clauses either using functional dependencies or in
> > the regular (old) way
> >
> >
> > As for how to use other types of extended statistics, I think eqjoinsel
> > could serve as an inspiration. We should look for an MCV list and
> > ndistinct stats on both sides of the join (possibly on some subset of
> > clauses), and then do the same thing eqjoinsel does, just with multiple
> > columns.
> >
> > Note: I'm not sure what to do when we find the stats only on one side.
> > Perhaps we should assume the other side does not have correlations and
> > use per-column statistics (seems reasonable), or maybe just not apply
> > anything (seems a bit too harsh).
> >
> > Anyway, if there are some non-estimated clauses, we could try applying
> > functional dependencies similarly to what this patch does. It's also
> > consistent with statext_clauselist_selectivity - that also tries to
> > apply MCV lists first, and only then we try functional dependencies.
> >
> >
> > BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the
> > selectivity function for restriction (non-join) clauses, so maybe we
> > should be looking at oprjoin when dealing with joins? Not sure.
> >
> >
> > One bit that I find *very* interesting is the calc_joinrel_size_estimate
> > part, with this comment:
> >
> >/*
> > * Try to take in account functional dependencies between attributes
> > * of clauses pushed-down to joined relations and retstrictlist
> > * clause. Right now we consider only case of restrictlist consists of
> > * one clause.
> > */
> >
> > If I understand the comment and the code after it, it essentially tries
> > to apply extended statistics from both the join clauses and filters at
> > the relation level. That is, with a query like
> >
> >  SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10
> >
> > we would be looking at statistics on t1(a,b), because we're interested
> > in estimating conditional probability distribution
> >
> > P(t1.a = a? | t1.b = 10)
> >
> > I think that's extremely interesting and powerfu

Re: row filtering for logical replication

2021-07-19 Thread Dilip Kumar

On Mon, Jul 19, 2021 at 3:12 PM Amit Kapila  wrote:
> a. Just log it and move to the next row
> b. send to stats collector some info about this which can be displayed
> in a view and then move ahead
> c. just skip it like any other row that doesn't match the filter clause.
>
> I am not sure if there is any use of sending a row if one of the
> old/new rows doesn't match the filter. Because if the old row doesn't
> match but the new one matches the criteria, we will anyway just throw
> such a row on the subscriber instead of applying it.

But at some time that will be true even if we skip the row based on
(a) or (c) right.  Suppose the OLD row was not satisfying the
condition but the NEW row is satisfying the condition, now even if we
skip this operation then in the next operation on the same row even if
both OLD and NEW rows are satisfying the filter the operation will
just be dropped by the subscriber right? because we did not send the
previous row when it first updated to value which were satisfying the
condition.  So basically, any row is inserted which did not satisfy
the condition first then post that no matter how many updates we do to
that row either it will be skipped by the publisher because the OLD
row was not satisfying the condition or it will be skipped by the
subscriber as there was no matching row.

> > Maybe a second option is to have replication change any UPDATE into
> > either an INSERT or a DELETE, if the old or the new row do not pass the
> > filter, respectively.  That way, the databases would remain consistent.

Yeah, I think this is the best way to keep the data consistent.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: refactoring basebackup.c

2021-07-19 Thread tushar


On 7/8/21 9:26 PM, Robert Haas wrote:

Here at last is a new version.
if i try to perform pg_basebackup using "-t server " option against 
localhost V/S remote machine ,

i can see difference in backup size.

data directory whose size is

[edb@centos7tushar bin]$ du -sch data/
578M    data/
578M    total

-h=localhost

[edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/all_data2*-h 
localhost*   -Xnone --no-manifest -P -v

pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
NOTICE:  all required WAL segments have been archived
329595/329595 kB (100%), 1/1 tablespace
pg_basebackup: base backup completed

[edb@centos7tushar bin]$ du -sch /tmp/all_data2
322M    /tmp/all_data2
322M    total
[edb@centos7tushar bin]$

-h=remote

[edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/all_data2 *-h 
* -Xnone --no-manifest -P -v

pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
NOTICE:  all required WAL segments have been archived
170437/170437 kB (100%), 1/1 tablespace
pg_basebackup: base backup completed

[edb@0 bin]$ du -sch /tmp/all_data2
167M    /tmp/all_data2
167M    total
[edb@0 bin]$

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: badly calculated width of emoji in psql

2021-07-19 Thread Laurenz Albe

On Mon, 2021-07-19 at 16:46 +0900, Michael Paquier wrote:
> > In your opinion, would the current one-line patch proposal make things
> > strictly better than they are today, or would it have mixed results?
> > I'm wondering how to help this patch move forward for the current
> > commitfest, or if we should maybe return with feedback for now.
> 
> Based on the following list, it seems to me that [u+1f300,u+0x1faff]
> won't capture everything, like the country flags:
> http://www.unicode.org/emoji/charts/full-emoji-list.html

That could be adapted; the question is if the approach as such is
desirable or not.  This is necessarily a moving target, at the rate
that emojis are created and added to Unicode.

My personal feeling is that something simple and perhaps imperfect
as my one-liner that may miss some corner cases would be ok, but
anything that saps more performance or is complicated would not
be worth the effort.

Yours,
Laurenz Albe

Re: row filtering for logical replication

2021-07-19 Thread Amit Kapila

On Wed, Jul 14, 2021 at 8:03 PM Tomas Vondra
 wrote:
>
> On 7/14/21 2:50 PM, Amit Kapila wrote:
> > On Wed, Jul 14, 2021 at 3:58 PM Tomas Vondra
> >
> > I think apart from the above, it might be good if we can find what
> > some other databases does in this regard?
> >
>
> Yeah, that might tell us what the users would like to do with it. I did
> some quick search, but haven't found much :-( The one thing I found is
> that Debezium [1] allows accessing both the "old" and "new" rows through
> value.before and value.after, and use both for filtering.
>

Okay, but does it apply a filter to both rows for an Update event?

>
> [1]
> https://wanna-joke.com/wp-content/uploads/2015/01/german-translation-comics-science.jpg
>

This link doesn't provide Debezium information, seems like a typo.

-- 
With Regards,
Amit Kapila.

Re: refactoring basebackup.c

2021-07-19 Thread Dilip Kumar

On Fri, Jul 16, 2021 at 12:43 PM Dilip Kumar  wrote:
>
> On Mon, Jul 12, 2021 at 5:51 PM tushar  wrote:
> >
> > On 7/8/21 9:26 PM, Robert Haas wrote:
> > > Here at last is a new version.
> > Please refer this scenario ,where backup target using
> > --server-compression is closing the server
> > unexpectedly if we don't provide -no-manifest option
> >
> > [tushar@localhost bin]$ ./pg_basebackup --server-compression=gzip4  -t
> > server:/tmp/data_1  -Xnone
> > NOTICE:  WAL archiving is not enabled; you must ensure that all required
> > WAL segments are copied through other means to complete the backup
> > pg_basebackup: error: could not read COPY data: server closed the
> > connection unexpectedly
> >  This probably means the server terminated abnormally
> >  before or while processing the request.
> >
>
> I think the problem is that bbsink_gzip_end_archive() is not
> forwarding the end request to the next bbsink.  The attached patch so
> fix it.

I was going through the patch, I think the refactoring made the base
backup code really clean and readable.  I have a few minor
suggestions.

v3-0003

1.
+Assert(sink->bbs_next != NULL);
+bbsink_begin_archive(sink->bbs_next, gz_archive_name);

I have noticed that the interface for forwarding the request to next
bbsink is not uniform, for example bbsink_gzip_begin_archive() is
calling bbsink_begin_archive(sink->bbs_next, gz_archive_name); for
forwarding the request to next bbsink where as
bbsink_progress_begin_backup() is calling
bbsink_forward_begin_backup(sink); I think it will be good if we keep
the usage uniform.

2.
I have noticed that bbbsink_copytblspc_* are not forwarding the
request to the next sink, thats probably because we assume this should
always be the last sink.  I agree that its true for this patch but the
commit message of the patch says that in future this might change, so
wouldn't it be good to keep the interface generic? I mean
bbsink_copytblspc_new(), should take the next sink as an input and the
caller can pass it as NULL.  And the other apis can also try to
forward the request if next is not NULL?

3.
It would make more sense to order the function in
basebackup_progress.c same as done in other file i.e
bbsink_progress_begin_backup, bbsink_progress_archive_contents and
then bbsink_progress_end_archive, and this will also be in sync with
function pointer declaration in bbsink_ops.

v3-0005-
4.
+ *
+ * 'copystream' sends a starts a single COPY OUT operation and transmits
+ * all the archives and the manifest if present during the course of that

typo 'copystream' sends a starts a single COPY OUT -->  'copystream'
sends a single COPY OUT

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: 2021-07 CF now in progress

2021-07-19 Thread Ibrar Ahmed

On Mon, Jul 12, 2021 at 4:59 PM Ibrar Ahmed  wrote:

>
> Hackers,
>
> The Commitfest 2021-07 is now in progress. It is one of the biggest one.
> Total number of patches of this commitfest is 342.
>
> Needs review: 204.
> Waiting on Author: 40.
> Ready for Committer: 18.
> Committed: 57.
> Moved to next CF: 3.
> Withdrawn: 15.
> Rejected: 3.
> Returned with Feedback: 2.
> Total: 342.
>
> If you are a patch author, please check http://commitfest.cputube.org to
> be sure your patch still applies, compiles, and passes tests.
>
> We need your involvement and participation in reviewing the patches. Let's
> try and make this happen.
>
> --
> Regards.
> Ibrar Ahmed
>

Over the past one week, statuses of 47 patches have been changed from
"Needs review". This still leaves us with 157 patches
requiring reviews. As always, your continuous support is appreciated to get
us over the line.

Please look at the patches requiring review in the current commitfest.
Test, provide feedback where needed, and update the patch status.

Total: 342.

Needs review: 157.
Waiting on Author: 74.
Ready for Committer: 15.
Committed: 68.
Moved to next CF: 3.
Withdrawn: 19.
Rejected: 4.
Returned with Feedback: 2.

-- 
Ibrar Ahmed

RE: Skipping logical replication transactions on subscriber side

2021-07-19 Thread houzj.f...@fujitsu.com

On July 19, 2021 2:40 PM Masahiko Sawada  wrote:
> I've attached the updated version patch that incorporated all comments
> I got so far except for the clearing error details part I mentioned
> above. After getting a consensus on those parts, I'll incorporate the
> idea into the patches.

Hi Sawada-san,

I am interested in this feature.
After having a look at the patch, I have a few questions about it.
(Sorry in advance if I missed something)

1) In 0002 patch, it introduces a new view called pg_stat_subscription_errors.
   Since it won't be cleaned automatically after we resolve the conflict, do we
   need a reset function to clean the statistics in it ? Maybe something
   similar to pg_stat_reset_replication_slot which clean the
   pg_stat_replication_slots.

2) For 0003 patch, When I am faced with a conflict, I set skip_xid = xxx, and
   then I resolve the conflict. If I reset skip_xid after resolving the
   conflict, will the change(which cause the conflict before) be applied again ?

3) For 0003 patch, if user set skip_xid to a wrong xid which have not been
   assigned, and then will the change be skipped when the xid is assigned in
   the future even if it doesn't cause any conflicts ?

Besides, It might be better to add some description of patch in each patch's
commit message which will make it easier for new reviewers to follow.

Best regards,
Houzj

Re: .ready and .done files considered harmful

2021-07-19 Thread Dipesh Pandit

Hi,

> I agree, I missed this part. The .history file should be given higher
preference.
> I will take care of it in the next patch.

Archiver does not have access to shared memory and the current timeline ID
is not available at archiver. In order to keep track of timeline switch we
have
to push a notification from backend to archiver.  Backend can send a signal
to notify archiver about the timeline change. Archiver can register this
notification and perform a full directory scan to make sure that archiving
history files take precedence over archiving WAL files.

> If a history file is found we are not updating curFileTLI and
> nextLogSegNo, so it will attempt the previously found segment.  This
> is fine because it will not find that segment and it will rescan the
> directory.  But I think we can do better, instead of searching the
> same old segment in the previous timeline we can search that old
> segment in the new TL so that if the TL switch happened within the
> segment then we will find the segment and we will avoid the directory
> search.

This could have been done with the approach mentioned in patch v1 but now
considering archiving history file takes precedence over WAL files we cannot
update the "curFileTLI" whenever a history file is found.

> So everytime archiver will start with searching segno=0 in timeline=0.
> Instead of doing this can't we first scan the directory and once we
> get the first segment to archive then only we can start predicting the
> next wal segment?

Done.

> This comment is a bit confusing with the name of the variable
nextLogSegNo.
> I think the name of the variable is appropriate here, but maybe we can
reword
> the comment something like:

Done.

I have incorporated these changes and updated a new patch. PFA, patch v2.

Thanks,
Dipesh
From 22b0e8fb9c778fbfdc945a647f82f6bbd8d6ec0a Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then backend sends a notification to
archiver. Archiver registers the timeline switch and performs a full
directory scan to make sure that archiving history files takes
precedence over archiving WAL files
---
 src/backend/access/transam/xlog.c|   6 ++
 src/backend/access/transam/xlogarchive.c |  10 +++
 src/backend/postmaster/pgarch.c  | 124 ---
 src/include/access/xlogarchive.h |   1 +
 src/include/postmaster/pgarch.h  |   1 +
 5 files changed, 133 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c7c928f..40969a9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8130,6 +8130,12 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
+	 * If archiver is active, send notification that timeline has switched.
+	 */
+	if (XLogArchivingActive() && ArchiveRecoveryRequested)
+		XLogArchiveNotifyTLISwitch();
+
+	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
 	 * required for consistency, but the last restartpoint might be far back,
 	 * and in case of a crash, recovering from it might take a longer than is
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e..1968872 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -507,6 +507,16 @@ XLogArchiveNotifySeg(XLogSegNo segno)
 }
 
 /*
+ * Signal archiver to notify timeline switch
+ */
+void
+XLogArchiveNotifyTLISwitch(void)
+{
+	if (IsUnderPostmaster)
+		PgArchNotifyTLISwitch();
+}
+
+/*
  * XLogArchiveForceDone
  *
  * Emit notification forcibly that an XLOG segment file has been successfully
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..e23b9bc 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -90,12 +90,22 @@ static PgArchData *PgArch = NULL;
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t ready_to_stop = false;
+static volatile sig_atomic_t timeline_switch = false;
+
+/*
+ * Log segment number to get next WAL file in a sequence.
+ */
+static XLogSegNo nextLogSegNo = 0;
+
+/* Flag to specify a full directory scan to find next log file */
+static bool dirScan = true;
 
 /* --
  * Local function forward declarations
  * --
  */
 static void pgarch_waken_stop(SIGNAL_ARGS);
+static void pgarch_timeline_switch(

Re: O_DIRECT on macOS

2021-07-19 Thread John Naylor

On Mon, Jul 19, 2021 at 12:55 AM Thomas Munro 
wrote:
>
> On Mon, Jul 19, 2021 at 4:42 PM Tom Lane  wrote:
> > prairiedog thinks that Assert is too optimistic about whether all
> > those flags exist.
>
> Fixed.
>
> (Huh, I received no -committers email for 2dbe8905.)

It didn't show up in the archives, either. Neither did your follow-up
04cad8f7bc.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: refactoring basebackup.c

2021-07-19 Thread tushar


On 7/16/21 12:43 PM, Dilip Kumar wrote:

I think the problem is that bbsink_gzip_end_archive() is not
forwarding the end request to the next bbsink.  The attached patch so
fix it.


Thanks Dilip. Reported issue seems to be fixed now with your patch

[edb@centos7tushar bin]$ ./pg_basebackup --server-compression=gzip4  -t 
server:/tmp/data_2 -v  -Xnone -R

pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
NOTICE:  all required WAL segments have been archived
pg_basebackup: base backup completed
[edb@centos7tushar bin]$

OR

[edb@centos7tushar bin]$ ./pg_basebackup   -t server:/tmp/pv1 -Xnone   
--server-compression=gzip4 -r 1024  -P

NOTICE:  all required WAL segments have been archived
23133/23133 kB (100%), 1/1 tablespace
[edb@centos7tushar bin]$

Please refer this scenario ,where -R option is working with '-t server' 
but not with -Ft


--not working

[edb@centos7tushar bin]$ ./pg_basebackup --server-compression=gzip4  
-Ft  -D ccv   -Xnone  -R --no-manifest

pg_basebackup: error: unable to parse archive: base.tar.gz
pg_basebackup: only tar archives can be parsed
pg_basebackup: the -R option requires pg_basebackup to parse the archive
pg_basebackup: removing data directory "ccv"

--working

[edb@centos7tushar bin]$ ./pg_basebackup --server-compression=gzip4 -t   
server:/tmp/ccv    -Xnone  -R --no-manifest

NOTICE:  all required WAL segments have been archived
[edb@centos7tushar bin]$

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-19 Thread James Coleman

On Sat, Jul 17, 2021 at 4:36 AM David Rowley  wrote:
>
>  On Sat, 17 Jul 2021 at 01:14, James Coleman  wrote:
> > The only remaining question I have is whether or not costing needs to
> > change, given the very significant speedup for datum sort.
>
> I'm looking at cost_tuplesort and the only thing that I think might
> make sense would be to adjust how the input_bytes value is calculated.
> For now, that's done with the following function that's used in quite
> a number of places.
>
> static double
> relation_byte_size(double tuples, int width)
> {
> return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
> }
>
> It seems, at least in the case of Sort, that using SizeofHeapTupleHead
> is just always wrong as it should be SizeofMinimalTupleHeader. I know
> that's also the case for Memoize too. I've not checked the other
> locations.
>
> The only thing I can really see that we might do would be not add the
> MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
> We'd need to pass down the number of attributes from
> create_sort_path() so we'd know when and when not to add that. I'm not
> saying that we should do this. I'm just saying that I don't really see
> what else we might do.
>
> I can imagine another patch might just want to do a complete overhaul
> of all locations that use relation_byte_size().  There are various
> things that function just does not account for. e.g, the fact that we
> allocate chunks in powers of 2 and that there's a chunk header added
> on.  Of course, "width" is just an estimate, so maybe trying to
> calculate something too precisely wouldn't be too wise. However,
> there's a bit of a chicken and the egg problem there as there'd be
> little incentive to improve "width" unless we started making more
> accurate use of the value.
>
> Anyway, none of the above take into account that the Datum sort is
> just a little faster, The only thing that exists in the existing cost
> modal that we could use to adjust the cost of an in memory sort is the
> comparison_cost.  The problem there is that the comparison is exactly
> the same in both Datum and Tuple sorts. The only thing that really
> changes between Datum and Tuple sort is the fact that we don't make a
> MinimalTuple when doing a Datum sort.  The cost modal, unfortunately,
> does not account for that.   That kinda makes me think that we should
> do nothing as if we start to account for making MemoryTuples then
> we'll just penalise Tuple sorts and that might cause someone to be
> upset.

To be clear up front: I'm in favor of the patch, and I don't want to
put unnecessary stumbling blocks up for it getting committed. So if we
decide to proceed as is, that's fine with me.

But I'm not sure that the "cost model, unfortunately, does not account
for that" is entirely accurate. The end of cost_tuplesort contains a
cost "per extracted tuple". It does, however, note that it doesn't
charge cpu_tuple_cost, which maybe is what you'd want to fully
incorporate this into the model. But given this run_cost isn't about
accounting for comparison cost (that's been done earlier) which is the
part that'd be the same between tuple and datum sort, it seems to me
that we could lower the cpu_operator_cost here by something like 10%
if it's byref and 30% if it's byval?

James

Re: Re[3]: On login trigger: take three

2021-07-19 Thread Greg Nancarrow

On Mon, Jul 19, 2021 at 8:18 PM Ibrar Ahmed  wrote:
>
> The patch does not apply, and rebase is required.
>

Attached a rebased patch (minor updates to the test code).

Regards,
Greg Nancarrow
Fujitsu Australia


v17-0001-on_client_connect_event_trigger.patch
Description: Binary data

Re: row filtering for logical replication

2021-07-19 Thread Tomas Vondra

On 7/19/21 1:00 PM, Dilip Kumar wrote:
> On Mon, Jul 19, 2021 at 3:12 PM Amit Kapila  wrote:
>> a. Just log it and move to the next row
>> b. send to stats collector some info about this which can be displayed
>> in a view and then move ahead
>> c. just skip it like any other row that doesn't match the filter clause.
>>
>> I am not sure if there is any use of sending a row if one of the
>> old/new rows doesn't match the filter. Because if the old row doesn't
>> match but the new one matches the criteria, we will anyway just throw
>> such a row on the subscriber instead of applying it.
> 
> But at some time that will be true even if we skip the row based on
> (a) or (c) right.  Suppose the OLD row was not satisfying the
> condition but the NEW row is satisfying the condition, now even if we
> skip this operation then in the next operation on the same row even if
> both OLD and NEW rows are satisfying the filter the operation will
> just be dropped by the subscriber right? because we did not send the
> previous row when it first updated to value which were satisfying the
> condition.  So basically, any row is inserted which did not satisfy
> the condition first then post that no matter how many updates we do to
> that row either it will be skipped by the publisher because the OLD
> row was not satisfying the condition or it will be skipped by the
> subscriber as there was no matching row.
> 

I have a feeling it's getting overly complicated, to the extent that
it'll be hard to explain to users and reason about. I don't think
there's a "perfect" solution for cases when the filter expression gives
different answers for old/new row - it'll always be surprising for some
users :-(

So maybe the best thing is to stick to the simple approach already used
e.g. by pglogical, which simply user the new row when available (insert,
update) and old one for deletes.

I think that behaves more or less sensibly and it's easy to explain.

All the other things (e.g. turning UPDATE to INSERT, advanced conflict
resolution etc.) will require a lot of other stuff, and I see them as
improvements of this simple approach.

>>> Maybe a second option is to have replication change any UPDATE into
>>> either an INSERT or a DELETE, if the old or the new row do not pass the
>>> filter, respectively.  That way, the databases would remain consistent.
> 
> Yeah, I think this is the best way to keep the data consistent.
> 

It'd also require REPLICA IDENTITY FULL, which seems like it'd add a
rather significant overhead.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: row filtering for logical replication

2021-07-19 Thread Tomas Vondra

On 7/19/21 1:30 PM, Amit Kapila wrote:
> On Wed, Jul 14, 2021 at 8:03 PM Tomas Vondra
>  wrote:
>>
>> On 7/14/21 2:50 PM, Amit Kapila wrote:
>>> On Wed, Jul 14, 2021 at 3:58 PM Tomas Vondra
>>>
>>> I think apart from the above, it might be good if we can find what
>>> some other databases does in this regard?
>>>
>>
>> Yeah, that might tell us what the users would like to do with it. I did
>> some quick search, but haven't found much :-( The one thing I found is
>> that Debezium [1] allows accessing both the "old" and "new" rows through
>> value.before and value.after, and use both for filtering.
>>
> 
> Okay, but does it apply a filter to both rows for an Update event?
> 
>>
>> [1]
>> https://wanna-joke.com/wp-content/uploads/2015/01/german-translation-comics-science.jpg
>>
> 
> This link doesn't provide Debezium information, seems like a typo.
> 

Uh, yeah - I copied a different link. I meant to send this one:

https://debezium.io/documentation/reference/configuration/filtering.html


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: speed up verifying UTF-8

2021-07-19 Thread Vladimir Sitnikov

Thank you,

It looks like it is important to have shrx for x86 which appears only when
-march=x86-64-v3 is used (see
https://github.com/golang/go/issues/47120#issuecomment-877629712 ).
Just in case: I know x86 wound not use fallback implementation, however,
the sole purpose of shift-based DFA is to fold all the data-dependent ops
into a single instruction.

An alternative idea: should we optimize for validation of **valid** inputs
rather than optimizing the worst case?
In other words, what if the implementation processes all characters always
and uses a slower method in case of validation failure?
I would guess it is more important to be faster with accepting valid input
rather than "faster to reject invalid input".

In shift-DFA approach, it would mean the validation loop would be simpler
with fewer branches (see https://godbolt.org/z/hhMxhT6cf ):

static inline int
pg_is_valid_utf8(const unsigned char *s, const unsigned char *end) {
uint64 class;
uint64 state = BGN;
while (s < end) { // clang unrolls the loop
class = ByteCategory[*s++];
state = class >> (state & DFA_MASK); // <-- note that AND is fused
into the shift operation
}
return (state & DFA_MASK) != ERR;
}

Note: GCC does not seem to unroll "while(s> (state & DFA_MASK);
}
}
while(s < end) {
class = ByteCategory[*s++];
state = class >> (state & DFA_MASK);
}
return (state & DFA_MASK) != ERR;
}



static int pg_utf8_verifystr2(const unsigned char *s, int len) {
if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid,
then just accept it
return s + len;
}
// slow path: the string is not valid, perform a slower analysis
return s + ;
}

Vladimir

Re: Addition of authenticated ID to pg_stat_activity

2021-07-19 Thread Aleksander Alekseev

Hi Michael,

> Just to make the discussion move on, attached is an updated version
> doing that.

The code seems OK, but I have mixed feelings about the way that the
VIEW currently works.

Here is what I get when a single user is connected via a UNIX socket:

43204 (master) =# select * from pg_stat_connection;
  pid  | authenticated_id | client_addr | client_hostname | client_port
---+--+-+-+-
 25806 |  | | |
 25808 |  | | |
 43204 |  | | |  -1
 25804 |  | | |
 25803 |  | | |
 25805 |  | | |
(6 rows)

I bet we could be more user-friendly than this. To begin with, the
documentation says:

+  
+   The pg_stat_connection view will have one row
+   per server process, showing information related to
+   the current connection of that process.
+  

[...]

+ 
+  
+   client_addr inet
+  
+  
+   IP address of the client connected to this backend.
+   If this field is null, it indicates either that the client is
+   connected via a Unix socket on the server machine or that this is an
+   internal process such as autovacuum.
+  
+ 

Any reason why we shouldn't simply exclude internal processes from the
view? Do they have a connection that the VIEW could show?

Secondly, maybe instead of magic constants like -1, we could use an
additional text column, e.g. connection_type: "unix"? Thirdly, not
sure if client_hostname is really needed, isn't address:port pair
enough to identify the client? Lastly, introducing a new GUC for
truncating values in a single view, which can only be set at server
start, doesn't strike me as a great idea. What is the worst-case
scenario if instead we will always allocate
`strlen(MyProcPort->authn_id)` and the user will truncate the result
manually if needed?

-- 
Best regards,
Aleksander Alekseev

Re: Question about non-blocking mode in libpq

2021-07-19 Thread Yugo NAGATA

On Tue, 13 Jul 2021 11:59:49 +0900
Yugo NAGATA  wrote:

> Hello,
> 
> During reading the documentation of libpq [1] , I found the following
> description:
> 
>  In the nonblocking state, calls to PQsendQuery, PQputline, PQputnbytes,
>  PQputCopyData, and PQendcopy will not block but instead return an error
>  if they need to be called again.
> 
> [1] https://www.postgresql.org/docs/devel/libpq-async.html
> 
> However, looking into the code, PQsendQuery seems not to return an error
> in non-bloking mode even if unable to send all data. In such cases,
> pqSendSome will return 1 but it doesn't cause an error. Moreover,
> we would not need to call PQsendQuery again. Indead, we need to call
> PQflush until it returns 0, as documented with regard to PQflush.
> 
> Do we need to fix the description of PQsetnonblocking?

I have further questions. Reading the following statement:

 "In the nonblocking state, calls to PQsendQuery, PQputline, PQputnbytes,
 PQputCopyData, and PQendcopy will not block" 

this seems to me that this is a list of functions that could block in blocking
mode, but I wander PQflush also could block because it calls pqSendSome, right?

Also, in the last paragraph of the section, I can find the following:

 "After sending any command or data on a nonblocking connection, call PQflush. 
..."

However, ISTM we don't need to call PQflush in non-bloking mode and we can
call PQgetResult immediately because PQgetResult internally calls pqFlush
until it returns 0 (or -1).

/*
 * If data remains unsent, send it.  Else we might be waiting for the
 * result of a command the backend hasn't even got yet.
 */
while ((flushResult = pqFlush(conn)) > 0) 
{
if (pqWait(false, true, conn))
{
flushResult = -1;
break;
}
}

Therefore, I wander the last paragraph of this section is
now unnecessary. right?

-- 
Yugo NAGATA

Re: O_DIRECT on macOS

2021-07-19 Thread Tom Lane

Thomas Munro  writes:
> On Mon, Jul 19, 2021 at 4:42 PM Tom Lane  wrote:
>> prairiedog thinks that Assert is too optimistic about whether all
>> those flags exist.

> Fixed.

Hmm ... we used to have to avoid putting #if constructs in the arguments
of macros (such as StaticAssertStmt).  Maybe that's not a thing anymore
with C99, and in any case this whole stanza is fairly platform-specific
so we may not run into a compiler that complains.  But my hindbrain wants
to see this done with separate statements, eg

#if defined(O_CLOEXEC)
StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
 "PG_O_DIRECT collides with O_CLOEXEC");
#endif

regards, tom lane

Re: Failure with 004_logrotate in prairiedog

2021-07-19 Thread Tom Lane

Kyotaro Horiguchi  writes:
> When rotation happens, the metainfo file is once removed then
> created. If slurp_file in the metafile-checking loop hits the gap, the
> slurp_file fails with ENOENT.

Oh!  Yeah, that's dumb, we should fix it to use rename().  Can't blame
platform's rename() if it's not being used.

regards, tom lane

Re: Failure with 004_logrotate in prairiedog

2021-07-19 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote:
>> When rotation happens, the metainfo file is once removed then
>> created. If slurp_file in the metafile-checking loop hits the gap, the
>> slurp_file fails with ENOENT.

> I can read the following code, as of update_metainfo_datafile():
> if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0)

Yeah, ignore my previous message.  There is an unlink up at the top
of the function, which fooled me in my caffeine-deprived state.
But that path is only taken when logging was just turned off, so
we must remove the now-irrelevant metafile.

regards, tom lane

Re: postgresql.conf.sample missing units

2021-07-19 Thread John Naylor

On Mon, Jul 19, 2021 at 5:44 AM Pavel Luzanov 
wrote:
>
> Hello,
>
> I found that the start section of the postgresql.conf file is missing a
> description of two units: bytes (appeared in version 11) and
> microseconds (in version 12).
>
> The attached patch makes these changes to the postgresql.conf.sample file.

Seems like an oversight. I'll commit this soon barring objections.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: refactoring basebackup.c

2021-07-19 Thread Dilip Kumar

On Mon, Jul 19, 2021 at 6:02 PM tushar  wrote:
>
> On 7/16/21 12:43 PM, Dilip Kumar wrote:
> > I think the problem is that bbsink_gzip_end_archive() is not
> > forwarding the end request to the next bbsink.  The attached patch so
> > fix it.
>
> Thanks Dilip. Reported issue seems to be fixed now with your patch

Thanks for the confirmation.

> Please refer this scenario ,where -R option is working with '-t server'
> but not with -Ft
>
> --not working
>
> [edb@centos7tushar bin]$ ./pg_basebackup --server-compression=gzip4
> -Ft  -D ccv   -Xnone  -R --no-manifest
> pg_basebackup: error: unable to parse archive: base.tar.gz
> pg_basebackup: only tar archives can be parsed
> pg_basebackup: the -R option requires pg_basebackup to parse the archive
> pg_basebackup: removing data directory "ccv"

As per the error message and code, if we are giving -R then we need to
inject recovery-conf file and that is only supported with tar format
but since you are enabling server compression which is no more .tar
format so it is giving an error.

> --working
>
> [edb@centos7tushar bin]$ ./pg_basebackup --server-compression=gzip4 -t
> server:/tmp/ccv-Xnone  -R --no-manifest
> NOTICE:  all required WAL segments have been archived
> [edb@centos7tushar bin]$

I am not sure why this is working, from the code I could not find if
the backup target is server then are we doing anything with the -R
option or we are just silently ignoring it

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

2021-07-19 Thread John Naylor

On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> It looks like it is important to have shrx for x86 which appears only
when -march=x86-64-v3 is used (see
https://github.com/golang/go/issues/47120#issuecomment-877629712 ).
> Just in case: I know x86 wound not use fallback implementation, however,
the sole purpose of shift-based DFA is to fold all the data-dependent ops
into a single instruction.

I saw mention of that instruction, but didn't understand how important it
was, thanks.

> An alternative idea: should we optimize for validation of **valid**
inputs rather than optimizing the worst case?
> In other words, what if the implementation processes all characters
always and uses a slower method in case of validation failure?
> I would guess it is more important to be faster with accepting valid
input rather than "faster to reject invalid input".

> static int pg_utf8_verifystr2(const unsigned char *s, int len) {
> if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid,
then just accept it
> return s + len;
> }
> // slow path: the string is not valid, perform a slower analysis
> return s + ;
> }

That might be workable. We have to be careful because in COPY FROM,
validation is performed on 64kB chunks, and the boundary could fall in the
middle of a multibyte sequence. In the SSE version, there is this comment:

+ /*
+ * NB: This check must be strictly greater-than, otherwise an invalid byte
+ * at the end might not get detected.
+ */
+ while (len > sizeof(__m128i))

...which should have more detail on this.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Added documentation for cascade and restrict option of drop statistics

2021-07-19 Thread vignesh C

On Mon, 19 Jul 2021, 09:16 Michael Paquier,  wrote:
>
> On Sun, Jul 18, 2021 at 12:37:48PM +0900, Michael Paquier wrote:
> > Indeed, good catch.  The other commands document that, so let's fix
> > it.
>
> Applied.

Thanks for committing this patch.

Regards,
Vignesh

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2021-07-19 Thread Fabien COELHO




I attached the updated patch.


# About pgbench error handling v15

Patches apply cleanly. Compilation, global and local tests ok.

 - v15.1: refactoring is a definite improvement.
   Good, even if it is not very useful (see below).

   While restructuring, maybe predefined variables could be make readonly
   so that a script which would update them would fail, which would be a
   good thing. Maybe this is probably material for an independent patch.

 - v15.2: see detailed comments below

# Doc

Doc build is ok.

ISTM that "number of tries" line would be better placed between the 
#threads and #transactions lines. What do you think?


Aggregate logging description: "{ failures | ... }" seems misleading 
because it suggests we have one or the other, whereas it can also be 
empty. I suggest: "{ | failures | ... }" to show the empty case.


Having a full example with retries in the doc is a good thing, and 
illustrates in passing that running with a number of clients on a small 
scale does not make much sense because of the contention on 
tellers/branches. I'd wonder whether the number of tries is set too high, 
though, ISTM that an application should give up before 100? I like that 
the feature it is also limited by latency limit.


Minor editing:

"there're" -> "there are".

"the --time" -> "the --time option".

The overall English seems good, but I'm not a native speaker. As I already 
said, a native
speaker proofreading would be nice.


From a technical writing point of view, maybe the documentation could be 
improved a bit,

but I'm not a ease on that subject. Some comments:

"The latency for failed transactions and commands is not computed separately." 
is unclear,
please use a positive sentence to tell what is true instead of what is not and 
the reader
has to guess. Maybe: "The latency figures include failed transactions which 
have reached
the maximum number of tries or the transaction latency limit.".

"The main report contains the number of failed transactions if it is non-zero." 
ISTM that
this is a pain for scripts which would like to process these reports data, 
because the data
may or may not be there. I'm sure to write such scripts, which explains my 
concern:-)

"If the total number of retried transactions is non-zero…" should it rather be "not 
one",
because zero means unlimited retries?

The section describing the various type of errors that can occur is a good 
addition.

Option "--report-latencies" changed to "--report-per-commands": I'm fine with 
this change.

# FEATURES

--failures-detailed: I'm not convinced that this option should not always be 
on, but
this is not very important, so let it be.

--verbose-errors: I still think this is only for debugging, but let it be.

Copying variables: ISTM that we should not need to save the variables 
states… no clearing, no copying should be needed. The restarted 
transaction simply overrides the existing variables which is what the 
previous version was doing anyway. The scripts should write their own 
variables before using them, and if they don't then it is the user 
problem. This is important for performance, because it means that after a 
client has executed all scripts once the variable array is stable and does 
not incur significant maintenance costs. The only thing that needs saving 
for retry is the speudo-random generator state. This suggest simplifying 
or removing "RetryState".


# CODE

The semantics of "cnt" is changed. Ok, the overall counters and their 
relationships make sense, and it simplifies the reporting code. Good.


In readCommandResponse: ISTM that PGRES_NONFATAL_ERROR is not needed and 
could be dealt with the default case. We are only interested in 
serialization/deadlocks which are fatal errors?


doRetry: for consistency, given the assert, ISTM that it should return 
false if duration has expired, by testing end_time or timer_exceeded.


checkTransactionStatus: this function does several things at once with 2 
booleans, which make it not very clear to me. Maybe it would be clearer if 
it would just return an enum (in trans, not in trans, conn error, other 
error). Another reason to do that is that on connection error pgbench 
could try to reconnect, which would be an interesting later extension, so 
let's pave the way for that.  Also, I do not think that the function 
should print out a message, it should be the caller decision to do that.


verbose_errors: there is more or less repeated code under RETRY and 
FAILURE, which should be factored out in a separate function. The 
advanceConnectionFunction is long enough. Once this is done, there is no 
need for a getLatencyUsed function.


I'd put cleaning up the pipeline in a function. I do not understand why 
the pipeline mode is not exited in all cases, the code checks for the 
pipeline status twice in a few lines. I'd put this cleanup in the sync 
function as well, report to the caller (advanceConnectionState) if there 
was an error, which would be managed there.

Re: refactoring basebackup.c

2021-07-19 Thread Mark Dilger

> On Jul 8, 2021, at 8:56 AM, Robert Haas  wrote:
> 
> The interesting
> patches in terms of functionality are 0006 and 0007;

The difficulty in v3-0007 with pg_basebackup only knowing how to parse tar 
archives seems to be a natural consequence of not sufficiently abstracting out 
the handling of the tar format.  If the bbsink and bbstreamer abstractions 
fully encapsulated a set of parsing callbacks, then pg_basebackup wouldn't 
contain things like:

streamer = bbstreamer_tar_parser_new(streamer);

but instead would use the parser callbacks without knowledge of whether they 
were parsing tar vs. cpio vs. whatever.  It just seems really odd that 
pg_basebackup is using the extensible abstraction layer and then defeating the 
purpose by knowing too much about the format.  It might even be a useful 
exercise to write cpio support into this patch set rather than waiting until 
v16, just to make sure the abstraction layer doesn't have tar-specific 
assumptions left over.

printf(_("  -F, --format=p|t   output format (plain (default), 
tar)\n"));

printf(_("  -z, --gzip compress tar output\n"));
printf(_("  -Z, --compress=0-9 compress tar output with given 
compression level\n"));

This is the pre-existing --help output, not changed by your patch, but if you 
anticipate that other output formats will be supported in future releases, 
perhaps it's better not to write the --help output in such a way as to imply 
that -z and -Z are somehow connected with the choice of tar format?  Would 
changing the --help now make for less confusion later?  I'm just asking...

The new options to pg_basebackup should have test coverage in 
src/bin/pg_basebackup/t/010_pg_basebackup.pl, though I expect you are waiting 
to hammer out the interface before writing the tests.

> the rest is
> preparatory refactoring.

patch v3-0001:

The new function AppendPlainCommandOption writes too many spaces, which does no 
harm, but seems silly, resulting in lines like:

  LOG:  received replication command: BASE_BACKUP ( LABEL 'pg_basebackup base 
backup',  PROGRESS,  WAIT 0,  MANIFEST 'yes')

patch v3-0003:

The introduction of the sink abstraction seems incomplete, as basebackup.c 
still has knowledge of things like tar headers.  Calls like 
_tarWriteHeader(sink, ...) feel like an abstraction violation.  I expected 
perhaps this would get addressed in later patches, but it doesn't.

+ * 'bbs_buffer' is the buffer into which data destined for the bbsink
+ * should be stored. It must be a multiple of BLCKSZ.
+ *
+ * 'bbs_buffer_length' is the allocated length of the buffer.

The length must be a multiple of BLCKSZ, not the pointer.

patch-v3-0005:

+ * 'copystream' sends a starts a single COPY OUT operation and transmits

too many verbs.

+ * Regardless of which method is used, we sent a result set with

"is used" vs. "sent" verb tense mismatch.

+ * So we only check it after the number of bytes sine the last check reaches

typo.  s/sine/since/

-* (2) we need to inject backup_manifest or recovery configuration into it.
+* (2) we need to inject backup_manifest or recovery configuration into
+* it.

src/bin/pg_basebackup/pg_basebackup.c contains word wrap changes like the above 
which would better be left to a different commit, if done at all.

+   if (state.manifest_file !=NULL)

Need a space after !=

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Support for NSS as a libpq TLS backend

2021-07-19 Thread Jacob Champion

On Wed, 2021-06-23 at 15:48 +0200, Daniel Gustafsson wrote:
> Attached is a rebased version which incorporates your recent patchset for
> resource handling, as well as the connect_ok test patch.

With v38 I do see the "one-time function was previously called and
failed" message you mentioned before, as well as some PR_Assert()
crashes. Looks like it's just due to the placement of
SSL_ClearSessionCache(); gating it behind the conn->nss_context check
ensures that we don't call it if no NSS context actually exists. Patch
attached (0001).

--

Continuing my jog around the patch... client connections will crash if
hostaddr is provided rather than host, because SSL_SetURL can't handle
a NULL argument. I'm running with 0002 to fix it for the moment, but
I'm not sure yet if it does the right thing for IP addresses, which the
OpenSSL side has a special case for.

Early EOFs coming from the server don't currently have their own error
message, which leads to a confusingly empty

connection to server at "127.0.0.1", port 47447 failed: 

0003 adds one, to roughly match the corresponding OpenSSL message.

While I was fixing that I noticed that I was getting a "unable to
verify certificate" error message for the early EOF case, even with
sslmode=require. That error message is being printed to conn-
>errorMessage during pg_cert_auth_handler(), even if we're not
verifying certificates, and then that message is included in later
unrelated failures. 0004 patches that.

--Jacob
From a675f4b6808c695d3691ad8a1a5e004ed71312c3 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Tue, 15 Jun 2021 15:59:39 -0700
Subject: [PATCH 1/4] nss: move SSL_ClearSessionCache

Don't clear the cache if no context exists.
---
 src/interfaces/libpq/fe-secure-nss.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/interfaces/libpq/fe-secure-nss.c b/src/interfaces/libpq/fe-secure-nss.c
index 613e7d6a1d..87e18fdde2 100644
--- a/src/interfaces/libpq/fe-secure-nss.c
+++ b/src/interfaces/libpq/fe-secure-nss.c
@@ -141,8 +141,6 @@ pgtls_close(PGconn *conn)
 	 * All NSS references must be cleaned up before we close out the
 	 * context.
 	 */
-	SSL_ClearSessionCache();
-
 	if (conn->pr_fd)
 	{
 		PRStatus	status;
@@ -161,6 +159,10 @@ pgtls_close(PGconn *conn)
 	if (conn->nss_context)
 	{
 		SECStatus	status;
+
+		/* The session cache must be cleared, or we'll leak references. */
+		SSL_ClearSessionCache();
+
 		status = NSS_ShutdownContext(conn->nss_context);
 
 		if (status != SECSuccess)
-- 
2.25.1

From 2a23b00084538d9d9849a1ca46d054c1c84c53c3 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 19 Jul 2021 10:29:45 -0700
Subject: [PATCH 2/4] nss: handle NULL host

...in the case that we're using a hostaddr instead.
---
 src/interfaces/libpq/fe-secure-nss.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/interfaces/libpq/fe-secure-nss.c b/src/interfaces/libpq/fe-secure-nss.c
index 87e18fdde2..cc66f5bb8c 100644
--- a/src/interfaces/libpq/fe-secure-nss.c
+++ b/src/interfaces/libpq/fe-secure-nss.c
@@ -183,6 +183,7 @@ pgtls_open_client(PGconn *conn)
 	PRFileDesc *model;
 	NSSInitParameters params;
 	SSLVersionRange desired_range;
+	const char *host;
 
 #ifdef ENABLE_THREAD_SAFETY
 #ifdef WIN32
@@ -449,9 +450,11 @@ pgtls_open_client(PGconn *conn)
 
 	/*
 	 * Specify which hostname we are expecting to talk to for the ClientHello
-	 * SNI extension.
+	 * SNI extension. TODO: what if the host is an IP address?
 	 */
-	SSL_SetURL(conn->pr_fd, (conn->connhost[conn->whichhost]).host);
+	host = conn->connhost[conn->whichhost].host;
+	if (host && host[0])
+		SSL_SetURL(conn->pr_fd, host);
 
 	status = SSL_ForceHandshake(conn->pr_fd);
 
-- 
2.25.1

From 9caa50893fc3b8d3447ce14d4b93876aa7ffe91a Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 19 Jul 2021 12:14:00 -0700
Subject: [PATCH 3/4] nss: fix spurious "unable to verify certificate" message

That message gets written to conn->errorMessage even if sslmode isn't
verifying the certificate contents, which can lead to a confusing UX if
the connection later fails for some other reason.
---
 src/interfaces/libpq/fe-secure-nss.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/src/interfaces/libpq/fe-secure-nss.c b/src/interfaces/libpq/fe-secure-nss.c
index cc66f5bb8c..4490fb03e4 100644
--- a/src/interfaces/libpq/fe-secure-nss.c
+++ b/src/interfaces/libpq/fe-secure-nss.c
@@ -858,12 +858,6 @@ pg_cert_auth_handler(void *arg, PRFileDesc *fd, PRBool checksig, PRBool isServer
 		if (!pq_verify_peer_name_matches_certificate(conn))
 			status = SECFailure;
 	}
-	else
-	{
-		printfPQExpBuffer(&conn->errorMessage,
-		  libpq_gettext("unable to verify certificate: %s"),
-		  pg_SSLerrmessage(PR_GetError()));
-	}
 
 	CERT_DestroyCertificate(server_cert);
 	return status;
@@ -923,6 +917,8 @@ pg_bad_cert_handler(void *arg, PRFileDesc *fd)
 	if (!arg)
 		return SECFailure;
 
+	err = PORT_GetError();
+

Avoid stack frame setup in performance critical routines using tail calls

2021-07-19 Thread Andres Freund

Hi,

We have a few routines that are taking up a meaningful share of nearly all
workloads. They are worth micro-optimizing, even though they rarely the most
expensive parts of a profile. The memory allocation infrastructure is an
example of that.

When looking at a profile one can often see that a measurable percentage of
the time is spent doing stack frame setup in functions like palloc(),
AllocSetAlloc(). E.g. here's a perf profile annotating palloc(), the first
column showing the percentage of the time the relevant instruction was
sampled:

   │  void *
   │  palloc(Size size)
   │  {
 11.62 │push  %rbp
  5.89 │mov   %rsp,%rbp
 11.79 │push  %r12
   │mov   %rdi,%r12
  6.07 │push  %rbx
   │  /* duplicates MemoryContextAlloc to avoid increased overhead */
   │  void   *ret;
   │  MemoryContext context = CurrentMemoryContext;
   │mov   CurrentMemoryContext,%rbx
   │
   │  AssertArg(MemoryContextIsValid(context));
   │  AssertNotInCriticalSection(context);
   │
   │  if (!AllocSizeIsValid(size))
  5.86 │cmp   $0x3fff,%rdi
   │  → ja14fa5b 
   │  elog(ERROR, "invalid memory alloc request size %zu", size);
   │
   │  context->isReset = false;
 17.71 │movb  $0x0,0x4(%rbx)
   │
   │  ret = context->methods->alloc(context, size);
  5.98 │mov   0x10(%rbx),%rax
   │mov   %rdi,%rsi
   │mov   %rbx,%rdi
 35.08 │  → callq *(%rax)


The stack frame setup bit is the push ... bit.

At least on x86-64 unixoid systems, that overhead can be avoided in certain
circumstances! The simplest case is if the function doesn't do any function
calls of its own. If simple enough (i.e. no register spilling), the compiler
will just not set up a stack frame - nobody could need it.

The slightly more complicated case is that of a function that only does a
"tail call", i.e. the only function call is just before returning (there can
be multiple such paths though). If the function is simple enough, gcc/clang
will then not use the "call" instruction to call the function (which would
require a proper stack frame being set up), but instead just jump to the other
function. Which ends up reusing the current function's stack frame,
basically. When that called function returns using 'ret', it'll reuse the
location pushed onto the call stack by the caller of the "original" function,
and return to its caller. Having optimized away the need to maintain one stack
frame level, and one indirection when returning from the inner function (which
just would do its own ret).

For that to work, there obviously cannot be any instructions in the function
before calling the inner function. Which brings us back to the palloc example
from above.

As an experiment, if i change the code for palloc() to omit the if (ret == NULL)
check, the assembly (omitting source for brevity) from:

  61c9a0:   55  push   %rbp
  61c9a1:   48 89 e5mov%rsp,%rbp
  61c9a4:   41 54   push   %r12
  61c9a6:   49 89 fcmov%rdi,%r12
  61c9a9:   53  push   %rbx
  61c9aa:   48 8b 1d 2f f2 2a 00mov0x2af22f(%rip),%rbx# 
8cbbe0 
  61c9b1:   48 81 ff ff ff ff 3fcmp$0x3fff,%rdi
  61c9b8:   0f 87 9d 30 b3 ff   ja 14fa5b 
  61c9be:   c6 43 04 00 movb   $0x0,0x4(%rbx)
  61c9c2:   48 8b 43 10 mov0x10(%rbx),%rax
  61c9c6:   48 89 femov%rdi,%rsi
  61c9c9:   48 89 dfmov%rbx,%rdi
  61c9cc:   ff 10   callq  *(%rax)
  61c9ce:   48 85 c0test   %rax,%rax
  61c9d1:   0f 84 b9 30 b3 ff   je 14fa90 
  61c9d7:   5b  pop%rbx
  61c9d8:   41 5c   pop%r12
  61c9da:   5d  pop%rbp
  61c9db:   c3  retq

to

  61c8c0:   48 89 femov%rdi,%rsi
  61c8c3:   48 8b 3d 16 f3 2a 00mov0x2af316(%rip),%rdi# 
8cbbe0 
  61c8ca:   48 81 fe ff ff ff 3fcmp$0x3fff,%rsi
  61c8d1:   0f 87 c3 31 b3 ff   ja 14fa9a 
  61c8d7:   c6 47 04 00 movb   $0x0,0x4(%rdi)
  61c8db:   48 8b 47 10 mov0x10(%rdi),%rax
  61c8df:   ff 20   jmpq   *(%rax)

It's not hard to see why that would be faster, I think.


Of course, we cannot just omit that check. But I think this is an argument for
why it is not a great idea to have such a check in palloc() - it prevents the
use of the above optimization, and it adds a branch to a performance critical
function, though there already existing branches in aset.c etc that
specifically know about this case.

The code in palloc() does this check after context->methods->a

Re: slab allocator performance issues

2021-07-19 Thread Andres Freund

Hi,

On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote:
> Sounds great! Thanks for investigating this and for the improvements.
> 
> It might be good to do some experiments to see how the changes affect memory
> consumption for practical workloads. I'm willing to spend soem time on that,
> if needed.

I've attached my changes. They're in a rough shape right now, but I
think good enough for an initial look.

Greetings,

Andres Freund
>From af4cd1f0b64cd52d7eab342493e3dfd6b0d8388e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 19 Jul 2021 12:55:51 -0700
Subject: [PATCH 1/2] WIP: optimize allocations by separating hot from cold
 paths.

---
 src/include/nodes/memnodes.h|   4 +-
 src/include/utils/memutils.h|  13 +
 src/backend/utils/mmgr/aset.c   | 537 ++--
 src/backend/utils/mmgr/generation.c |  22 +-
 src/backend/utils/mmgr/mcxt.c   | 179 +++---
 src/backend/utils/mmgr/slab.c   |  14 +-
 6 files changed, 354 insertions(+), 415 deletions(-)

diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index e6a757d6a07..8a42d2ff999 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -57,10 +57,10 @@ typedef void (*MemoryStatsPrintFunc) (MemoryContext context, void *passthru,
 
 typedef struct MemoryContextMethods
 {
-	void	   *(*alloc) (MemoryContext context, Size size);
+	void	   *(*alloc) (MemoryContext context, Size size, int flags);
 	/* call this free_p in case someone #define's free() */
 	void		(*free_p) (MemoryContext context, void *pointer);
-	void	   *(*realloc) (MemoryContext context, void *pointer, Size size);
+	void	   *(*realloc) (MemoryContext context, void *pointer, Size size, int flags);
 	void		(*reset) (MemoryContext context);
 	void		(*delete_context) (MemoryContext context);
 	Size		(*get_chunk_space) (MemoryContext context, void *pointer);
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index ff872274d44..2f75b4cca46 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -147,6 +147,19 @@ extern void MemoryContextCreate(MemoryContext node,
 
 extern void HandleLogMemoryContextInterrupt(void);
 extern void ProcessLogMemoryContextInterrupt(void);
+extern void *MemoryContextAllocationFailure(MemoryContext context, Size size, int flags);
+
+extern void MemoryContextSizeFailure(MemoryContext context, Size size, int flags) pg_attribute_noreturn();
+
+static inline void
+MemoryContextCheckSize(MemoryContext context, Size size, int flags)
+{
+	if (unlikely(!AllocSizeIsValid(size)))
+	{
+		if (!(flags & MCXT_ALLOC_HUGE) || !AllocHugeSizeIsValid(size))
+			MemoryContextSizeFailure(context, size, flags);
+	}
+}
 
 /*
  * Memory-context-type-specific functions
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 77872e77bcd..00878354392 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -263,9 +263,9 @@ static AllocSetFreeList context_freelists[2] =
 /*
  * These functions implement the MemoryContext API for AllocSet contexts.
  */
-static void *AllocSetAlloc(MemoryContext context, Size size);
+static void *AllocSetAlloc(MemoryContext context, Size size, int flags);
 static void AllocSetFree(MemoryContext context, void *pointer);
-static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
+static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size, int flags);
 static void AllocSetReset(MemoryContext context);
 static void AllocSetDelete(MemoryContext context);
 static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
@@ -704,266 +704,10 @@ AllocSetDelete(MemoryContext context)
 	free(set);
 }
 
-/*
- * AllocSetAlloc
- *		Returns pointer to allocated memory of given size or NULL if
- *		request could not be completed; memory is added to the set.
- *
- * No request may exceed:
- *		MAXALIGN_DOWN(SIZE_MAX) - ALLOC_BLOCKHDRSZ - ALLOC_CHUNKHDRSZ
- * All callers use a much-lower limit.
- *
- * Note: when using valgrind, it doesn't matter how the returned allocation
- * is marked, as mcxt.c will set it to UNDEFINED.  In some paths we will
- * return space that is marked NOACCESS - AllocSetRealloc has to beware!
- */
-static void *
-AllocSetAlloc(MemoryContext context, Size size)
+static inline void *
+AllocSetAllocReturnChunk(AllocSet set, Size size, AllocChunk chunk, Size chunk_size)
 {
-	AllocSet	set = (AllocSet) context;
-	AllocBlock	block;
-	AllocChunk	chunk;
-	int			fidx;
-	Size		chunk_size;
-	Size		blksize;
-
-	AssertArg(AllocSetIsValid(set));
-
-	/*
-	 * If requested size exceeds maximum for chunks, allocate an entire block
-	 * for this request.
-	 */
-	if (size > set->allocChunkLimit)
-	{
-		chunk_size = MAXALIGN(size);
-		blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
-		block = (AllocBlock) malloc(blksize);
-		if (block == NULL)
-			return NULL;
-
-		context->mem_allocated += blksize;
-
-		block->aset = set;
-		block-

Re: slab allocator performance issues

2021-07-19 Thread Ranier Vilela

Em seg., 19 de jul. de 2021 às 17:56, Andres Freund 
escreveu:

> Hi,
>
> On 2021-07-18 19:23:31 +0200, Tomas Vondra wrote:
> > Sounds great! Thanks for investigating this and for the improvements.
> >
> > It might be good to do some experiments to see how the changes affect
> memory
> > consumption for practical workloads. I'm willing to spend soem time on
> that,
> > if needed.
>
> I've attached my changes. They're in a rough shape right now, but I
> think good enough for an initial look.
>
Hi Andres, I take a look.

Perhaps you would agree with me that in the most absolute of times, malloc
will not fail.
So it makes more sense to test:
if (ret != NULL)
than
if (ret == NULL)

What might help branch prediction.
With this change wins too, the possibility
to reduce the scope of some variable.

Example:
+static void * pg_noinline
+AllocSetAllocLarge(AllocSet set, Size size, int flags)
+{
+ AllocBlock block;
+Size chunk_size;
+Size blksize;
+
+   /* check size, only allocation path where the limits could be hit */
+   MemoryContextCheckSize(&set->header, size, flags);
+
+   AssertArg(AllocSetIsValid(set));
+
+   chunk_size = MAXALIGN(size);
+   blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+   block = (AllocBlock) malloc(blksize);
+   if (block != NULL)
+ {
+  AllocChunk chunk;
+
+  set->header.mem_allocated += blksize;
+
+ block->aset = set;
+ block->freeptr = block->endptr = ((char *) block) + blksize;
+
+ /*
+  * Stick the new block underneath the active allocation block, if
any,
+  * so that we don't lose the use of the space remaining therein.
+  */
+ if (set->blocks != NULL)
+ {
+block->prev = set->blocks;
+block->next = set->blocks->next;
+if (block->next)
+block->next->prev = block;
+ set->blocks->next = block;
+}
+else
+{
+ block->prev = NULL;
+ block->next = NULL;
+ set->blocks = block;
+}
+
+chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
+chunk->size = chunk_size;
+
+return AllocSetAllocReturnChunk(set, size, chunk, chunk_size);
+  }
+
+  return NULL;
+}

regards,
Ranier Vilela

Micro-optimizations to avoid some strlen calls.

2021-07-19 Thread Ranier Vilela

Hi,

There are some places, where strlen can have an overhead.
This patch tries to fix this.

Pass check-world at linux ubuntu (20.04) 64 bits.

regards,
Ranier Vilela
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index c14ca27c5e..2036072990 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2443,11 +2443,11 @@ ChooseIndexColumnNames(List *indexElems)
 			if (lc2 == NULL)
 break;			/* found nonconflicting name */
 
-			sprintf(nbuf, "%d", i);
+			nlen = sprintf(nbuf, "%d", i);
 
 			/* Ensure generated names are shorter than NAMEDATALEN */
 			nlen = pg_mbcliplen(origname, strlen(origname),
-NAMEDATALEN - 1 - strlen(nbuf));
+NAMEDATALEN - 1 - nlen);
 			memcpy(buf, origname, nlen);
 			strcpy(buf + nlen, nbuf);
 			curname = buf;
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 8cc23ef7fb..4619f3887e 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -3014,6 +3014,8 @@ PerformRadiusTransaction(const char *server, const char *secret, const char *por
 	uint8		encryptedpassword[RADIUS_MAX_PASSWORD_LENGTH];
 	uint8	   *md5trailer;
 	int			packetlength;
+	size_t  secretlen;
+	size_t		passwdlen;
 	pgsocket	sock;
 
 #ifdef HAVE_IPV6
@@ -3077,15 +3079,17 @@ PerformRadiusTransaction(const char *server, const char *secret, const char *por
 	 * and then: e[i] = p[i] XOR MD5(secret + e[i-1]) for the following ones
 	 * (if necessary)
 	 */
-	encryptedpasswordlen = ((strlen(passwd) + RADIUS_VECTOR_LENGTH - 1) / RADIUS_VECTOR_LENGTH) * RADIUS_VECTOR_LENGTH;
-	cryptvector = palloc(strlen(secret) + RADIUS_VECTOR_LENGTH);
-	memcpy(cryptvector, secret, strlen(secret));
+	secretlen = strlen(secret);
+	passwdlen = strlen(passwd);
+	encryptedpasswordlen = ((passwdlen + RADIUS_VECTOR_LENGTH - 1) / RADIUS_VECTOR_LENGTH) * RADIUS_VECTOR_LENGTH;
+	cryptvector = palloc(secretlen + RADIUS_VECTOR_LENGTH);
+	memcpy(cryptvector, secret, secretlen);
 
 	/* for the first iteration, we use the Request Authenticator vector */
 	md5trailer = packet->vector;
 	for (i = 0; i < encryptedpasswordlen; i += RADIUS_VECTOR_LENGTH)
 	{
-		memcpy(cryptvector + strlen(secret), md5trailer, RADIUS_VECTOR_LENGTH);
+		memcpy(cryptvector + secretlen, md5trailer, RADIUS_VECTOR_LENGTH);
 
 		/*
 		 * .. and for subsequent iterations the result of the previous XOR
@@ -3093,7 +3097,7 @@ PerformRadiusTransaction(const char *server, const char *secret, const char *por
 		 */
 		md5trailer = encryptedpassword + i;
 
-		if (!pg_md5_binary(cryptvector, strlen(secret) + RADIUS_VECTOR_LENGTH, encryptedpassword + i))
+		if (!pg_md5_binary(cryptvector, secretlen + RADIUS_VECTOR_LENGTH, encryptedpassword + i))
 		{
 			ereport(LOG,
 	(errmsg("could not perform MD5 encryption of password")));
@@ -3104,7 +3108,7 @@ PerformRadiusTransaction(const char *server, const char *secret, const char *por
 
 		for (j = i; j < i + RADIUS_VECTOR_LENGTH; j++)
 		{
-			if (j < strlen(passwd))
+			if (j < passwdlen)
 encryptedpassword[j] = passwd[j] ^ encryptedpassword[j];
 			else
 encryptedpassword[j] = '\0' ^ encryptedpassword[j];
@@ -3286,7 +3290,7 @@ PerformRadiusTransaction(const char *server, const char *secret, const char *por
 		 * Verify the response authenticator, which is calculated as
 		 * MD5(Code+ID+Length+RequestAuthenticator+Attributes+Secret)
 		 */
-		cryptvector = palloc(packetlength + strlen(secret));
+		cryptvector = palloc(packetlength + secretlen);
 
 		memcpy(cryptvector, receivepacket, 4);	/* code+id+length */
 		memcpy(cryptvector + 4, packet->vector, RADIUS_VECTOR_LENGTH);	/* request
@@ -3295,10 +3299,10 @@ PerformRadiusTransaction(const char *server, const char *secret, const char *por
 		if (packetlength > RADIUS_HEADER_LENGTH)	/* there may be no
 	 * attributes at all */
 			memcpy(cryptvector + RADIUS_HEADER_LENGTH, receive_buffer + RADIUS_HEADER_LENGTH, packetlength - RADIUS_HEADER_LENGTH);
-		memcpy(cryptvector + packetlength, secret, strlen(secret));
+		memcpy(cryptvector + packetlength, secret, secretlen);
 
 		if (!pg_md5_binary(cryptvector,
-		   packetlength + strlen(secret),
+		   packetlength + secretlen,
 		   encryptedpassword))
 		{
 			ereport(LOG,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 122c2b05bd..7c8069cbe1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4277,10 +4277,11 @@ static void
 report_fork_failure_to_client(Port *port, int errnum)
 {
 	char		buffer[1000];
+	size_t		buflen;
 	int			rc;
 
 	/* Format the error message packet (always V2 protocol) */
-	snprintf(buffer, sizeof(buffer), "E%s%s\n",
+	buflen = snprintf(buffer, sizeof(buffer), "E%s%s\n",
 			 _("could not fork new process for connection: "),
 			 strerror(errnum));
 
@@ -4291,7 +4292,7 @@ report_fork_failure_to_client(Port *port, int errnum)
 	/* We'll retry after EINTR, but ignore all other failures

Re: O_DIRECT on macOS

2021-07-19 Thread Thomas Munro

On Tue, Jul 20, 2021 at 2:13 AM Tom Lane  wrote:
> Hmm ... we used to have to avoid putting #if constructs in the arguments
> of macros (such as StaticAssertStmt).  Maybe that's not a thing anymore
> with C99, and in any case this whole stanza is fairly platform-specific
> so we may not run into a compiler that complains.  But my hindbrain wants
> to see this done with separate statements, eg
>
> #if defined(O_CLOEXEC)
> StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
>  "PG_O_DIRECT collides with O_CLOEXEC");
> #endif

Ok, done.

While I was here again, I couldn't resist trying to extend this to
Solaris, since it looked so easy.  I don't have access, but I tested
on Illumos by undefining O_DIRECT.  Thoughts?
From 3016ede1dfd972badac65954d6e908f77e3c134b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 19 Jul 2021 21:47:16 +
Subject: [PATCH 1/2] Support direct I/O on Solaris.

Extend the change done for macOS by commit 2dbe8905 to cover Solaris.
Note that this doesn't affect Illumos (it gained O_DIRECT).

Discussion: https://postgr.es/m/CA%2BhUKG%2BADiyyHe0cun2wfT%2BSVnFVqNYPxoO6J9zcZkVO7%2BNGig%40mail.gmail.com
---
 src/backend/storage/file/fd.c | 19 +++
 src/bin/pg_test_fsync/pg_test_fsync.c | 27 +--
 src/include/storage/fd.h  |  9 ++---
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5d5e8ae94e..78ac2caa8f 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1057,11 +1057,13 @@ BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	int			fd;
 
 tryAgain:
-#ifdef PG_O_DIRECT_USE_F_NOCACHE
+#if defined(PG_O_DIRECT_USE_F_NOCACHE) || \
+	defined(PG_O_DIRECT_USE_DIRECTIO_ON)
 
 	/*
 	 * The value we defined to stand in for O_DIRECT when simulating it with
-	 * F_NOCACHE had better not collide with any of the standard flags.
+	 * an extra system call had better not collide with any of the standard
+	 * flags.
 	 */
 	StaticAssertStmt((PG_O_DIRECT &
 	  (O_APPEND |
@@ -1089,10 +1091,19 @@ tryAgain:
 
 	if (fd >= 0)
 	{
-#ifdef PG_O_DIRECT_USE_F_NOCACHE
+#if defined(PG_O_DIRECT_USE_F_NOCACHE) || \
+	defined(PG_O_DIRECT_USE_DIRECTIO_ON)
 		if (fileFlags & PG_O_DIRECT)
 		{
-			if (fcntl(fd, F_NOCACHE, 1) < 0)
+			int			rc;
+
+#if defined(PG_O_DIRECT_USE_F_NOCACHE)
+			rc = fcntl(fd, F_NOCACHE, 1);
+#endif
+#if defined(PG_O_DIRECT_USE_DIRECTIO_ON)
+			rc = directio(fd, DIRECTIO_ON);
+#endif
+			if (rc < 0)
 			{
 int			save_errno = errno;
 
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index fef31844fa..f5e3868c10 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -221,6 +221,8 @@ handle_args(int argc, char *argv[])
 	printf(_("O_DIRECT supported on this platform for open_datasync and open_sync.\n"));
 #elif defined(F_NOCACHE)
 	printf(_("F_NOCACHE supported on this platform for open_datasync and open_sync.\n"));
+#elif defined(DIRECTIO_ON)
+	printf(_("DIRECTIO_ON supported on this platform for open_datasync and open_sync.\n"));
 #else
 	printf(_("Direct I/O is not supported on this platform.\n"));
 #endif
@@ -271,14 +273,27 @@ open_direct(const char *path, int flags, mode_t mode)
 
 	fd = open(path, flags, mode);
 
-#if !defined(O_DIRECT) && defined(F_NOCACHE)
-	if (fd >= 0 && fcntl(fd, F_NOCACHE, 1) < 0)
+#if !defined(O_DIRECT) && \
+	(defined(F_NOCACHE) || \
+	 defined(DIRECTIO_ON))
+	if (fd >= 0)
 	{
-		int			save_errno = errno;
+		int			rc;
 
-		close(fd);
-		errno = save_errno;
-		return -1;
+#if defined(F_NOCACHE)
+		rc = fcntl(fd, F_NOCACHE, 1);
+#endif
+#if defined(DIRECTIO_ON)
+		rc = directio(fd, DIRECTIO_ON);
+#endif
+		if (rc < 0)
+		{
+			int			save_errno = errno;
+
+			close(fd);
+			errno = save_errno;
+			return -1;
+		}
 	}
 #endif
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2d843eb992..b04988f818 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -82,15 +82,18 @@ extern int	max_safe_fds;
 /*
  * O_DIRECT is not standard, but almost every Unix has it.  We translate it
  * to the appropriate Windows flag in src/port/open.c.  We simulate it with
- * fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper.  We use the name
- * PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good
- * idea on a Unix).
+ * extra calls on macOS and Solaris inside fd.c's open() wrapper.  We use the
+ * name PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a
+ * good idea on a Unix).
  */
 #if defined(O_DIRECT)
 #define		PG_O_DIRECT O_DIRECT
 #elif defined(F_NOCACHE)
 #define		PG_O_DIRECT 0x8000
 #define		PG_O_DIRECT_USE_F_NOCACHE
+#elif defined(DIRECTIO_ON)
+#define		PG_O_DIRECT 0x8000
+#define		PG_O_DIRECT_USE_DIRECTIO_ON
 #else
 #define		PG_O_DIRECT 0
 #endif
-- 
2.30.2

From 7c99892354500f0

Re: OpenSSL 3.0.0 compatibility

2021-07-19 Thread Daniel Gustafsson

> On 3 Jul 2021, at 17:00, Peter Eisentraut  
> wrote:
> 
> On 12.03.21 08:51, Peter Eisentraut wrote:
>> On 11.03.21 11:41, Daniel Gustafsson wrote:
>>> .. and apply the padding changes as proposed in a patch upthread like this 
>>> (these
>>> work for all OpenSSL versions I've tested, and I'm rather more puzzled as to
>>> why we got away with not having them in the past):
>> Yes, before proceeding with this, we should probably understand why these 
>> changes are effective and why they haven't been required in the past.
> 
> I took another look at this with openssl-3.0.0-beta1.  The issue with the 
> garbled padding output is still there.  What I found is that pgcrypto has 
> been using the encryption and decryption API slightly incorrectly.  You are 
> supposed to call EVP_DecryptUpdate() followed by EVP_DecryptFinal_ex() (and 
> similarly for encryption), but pgcrypto doesn't do the second one.  (To be 
> fair, this API was added to OpenSSL after pgcrypto first appeared.)  The 
> "final" functions take care of the padding.  We have been getting away with 
> it like this because we do the padding manually elsewhere.

That does make a lot of sense, following the code and API docs I concur with
your findings.

> ..apparently, something has changed in OpenSSL 3.0.0 in that if padding is 
> enabled in OpenSSL, EVP_DecryptUpdate() doesn't flush the last normal block 
> until the "final" function is called.

Skimming the code I wasn't able to find something off the cuff, but there has
been work done to postpone/move padding for constant time operations so maybe
it's related to that.

> Your proposed fix was to explicitly disable padding, and then this problem 
> goes away.  You can still call the "final" functions, but they won't do 
> anything, except check that there is no more data left, but we already check 
> that elsewhere.

In earlier versions, _Final also closed the context to ensure nothing can leak
from there, but I'm not sure if 1.0.1 constitutes as earlier.  Still calling it
from the finish function seems like a good idea though.

> Another option is to throw out our own padding code and let OpenSSL handle 
> it.  See attached demo patch.  But that breaks the non-OpenSSL code in 
> internal.c, so we'd have to re-add the padding code there.  So this isn't 
> quite as straightforward an option.

While the PX cleanup is nice, since we can't get rid of all the padding we
simply shift the complexity to another place where I'd be wary of introducing
bugs into quite stable code.

> (At least, with the patch we can confirm that the OpenSSL padding works 
> consistently with our own implementation.)

+1

> So I think your proposed patch is sound and a good short-term and low-risk 
> solution

The attached 0001 disables the padding.  I've tested this with OpenSSL 1.0.1,
1.0.2, 1.1.1 and Git HEAD at e278127cbfa2709d.

Another aspect of OpenSSL 3 compatibility is that of legacy cipher support, and
as we concluded upthread it's best to leave that to the user to define in
openssl.cnf.  The attached 0002 adds alternative output files for 3.0.0
installations without the legacy provider loaded, as well as adds a note in the
pgcrypto docs to enable it in case DES is needed.  It does annoy me a bit that
we don't load the openssl.cnf file for 1.0.1 if we start mentioning it in the
docs for other versions, but it's probably not worth the effort to fix it given
the lack of complaints so far (it needs a call to OPENSSL_config(NULL); guarded
to HAVE_ macros for 1.0.1).

--
Daniel Gustafsson   https://vmware.com/

0002-Add-alternative-output-for-OpenSSL-3-without-legacy-.patch
Description: Binary data

0001-Disable-OpenSSL-EVP-digest-padding-in-pgcrypto.patch
Description: Binary data

Transactions and indexes

2021-07-19 Thread Chris Cleveland

Noob here.

I'm trying to implement a new type of index access method. I don't think I
can use the built-in storage manager and buffer manager efficiently because
files with 8k blocks aren't going to work. I really need to implement a
log-structured file.

I'm confused on how to handle transactions and visibility. I don't see
anything in the index action method functions (am*()) that tell me when to
commit or rollback new index entries, or which transaction we're currently
in so I can know whether recently-added index entries should be visible to
the current scan. I'm guessing that all that magically happens in the
storage and buffer managers.

So... how do I handle this? Is there some way for me to implement my own
storage manager that manages visibility?

I'd be grateful for any guidance.


-- 
Chris Cleveland
312-339-2677 mobile

Re: O_DIRECT on macOS

2021-07-19 Thread Tom Lane

Thomas Munro  writes:
> While I was here again, I couldn't resist trying to extend this to
> Solaris, since it looked so easy.  I don't have access, but I tested
> on Illumos by undefining O_DIRECT.  Thoughts?

I can try that on the gcc farm in a bit.

regards, tom lane

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-19 Thread Andres Freund

Hi,

On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
> BTW is the implementation of the radix tree approach available
> somewhere? If so I'd like to experiment with that too.
>
> >
> > I have toyed with implementing adaptively large radix nodes like
> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
> > gotten it quite working.
>
> That seems promising approach.

I've since implemented some, but not all of the ideas of that paper
(adaptive node sizes, but not the tree compression pieces).

E.g. for

select prepare(
100, -- max block
20, -- # of dead tuples per page
10, -- dead tuples interval within a page
1 -- page inteval
);
attach  sizeshuffledordered
array69 ms  120 MB  84.87 s  8.66 s
intset  173 ms   65 MB  68.82 s 11.75 s
rtbm201 ms   67 MB  11.54 s  1.35 s
tbm 232 ms  100 MB   8.33 s  1.26 s
vtbm162 ms   58 MB  10.01 s  1.22 s
radix88 ms   42 MB  11.49 s  1.67 s

and for
select prepare(
100, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1 -- page inteval
);

attach  sizeshuffledordered
array24 ms   60MB   3.74s1.02 s
intset   97 ms   49MB   3.14s0.75 s
rtbm138 ms   36MB   0.41s0.14 s
tbm 198 ms  101MB   0.41s0.14 s
vtbm118 ms   27MB   0.39s0.12 s
radix33 ms   10MB   0.28s0.10 s

(this is an almost unfairly good case for radix)

Running out of time to format the results of the other testcases before
I have to run, unfortunately. radix uses 42MB both in test case 3 and
4.


The radix tree code isn't good right now. A ridiculous amount of
duplication etc. The naming clearly shows its origins from a buffer
mapping radix tree...


Currently in a bunch of the cases 20% of the time is spent in
radix_reaped(). If I move that into radix.c and for bfm_lookup() to be
inlined, I get reduced overhead. rbtm for example essentially already
does that, because it does splitting of ItemPointer in rtbm.c.


I've attached my current patches against your tree.

Greetings,

Andres Freund
>From 5dfbe02000aefd3e085bdea0ec809247e1fb71b3 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 19 Jul 2021 16:03:28 -0700
Subject: [PATCH 1/3] Fix build warnings.

---
 bdbench/bdbench.c |  2 +-
 bdbench/rtbm.c|  4 ++--
 bdbench/vtbm.c| 10 +++---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/bdbench/bdbench.c b/bdbench/bdbench.c
index 800567d..1df5c53 100644
--- a/bdbench/bdbench.c
+++ b/bdbench/bdbench.c
@@ -655,7 +655,7 @@ _bench(LVTestType *lvtt)
 	fclose(f);
 #endif
 
-	elog(NOTICE, "\"%s\": dead tuples %lu, index tuples %lu, mathed %d, mem %zu",
+	elog(NOTICE, "\"%s\": dead tuples %lu, index tuples %lu, matched %d, mem %zu",
 		 lvtt->name,
 		 lvtt->dtinfo.nitems,
 		 IndexTids_cache->dtinfo.nitems,
diff --git a/bdbench/rtbm.c b/bdbench/rtbm.c
index 025d2a9..eac277a 100644
--- a/bdbench/rtbm.c
+++ b/bdbench/rtbm.c
@@ -449,9 +449,9 @@ dump_entry(RTbm *rtbm, DtEntry *entry)
 		}
 	}
 
-	elog(NOTICE, "%s (offset %d len %d)",
+	elog(NOTICE, "%s (offset %llu len %d)",
 		 str.data,
-		 entry->offset, len);
+		 (long long unsigned) entry->offset, len);
 }
 
 static int
diff --git a/bdbench/vtbm.c b/bdbench/vtbm.c
index c59d6e1..63320f5 100644
--- a/bdbench/vtbm.c
+++ b/bdbench/vtbm.c
@@ -72,7 +72,8 @@ vtbm_add_tuples(VTbm *vtbm, const BlockNumber blkno,
 	DtEntry *entry;
 	bool	found;
 	char	oldstatus;
-	int wordnum, bitnum;
+	int wordnum = 0;
+	int bitnum;
 
 	entry = dttable_insert(vtbm->dttable, blkno, &found);
 	Assert(!found);
@@ -216,8 +217,10 @@ vtbm_dump(VTbm *vtbm)
 		 vtbm->bitmap_size, vtbm->npages);
 	for (int i = 0; i < vtbm->npages; i++)
 	{
+		char *bitmap;
+
 		entry = entries[i];
-		char *bitmap = &(vtbm->bitmap[entry->offset]);
+		bitmap = &(vtbm->bitmap[entry->offset]);
 
 		appendStringInfo(&str, "[%5d] : ", entry->blkno);
 		for (int off = 0; off < entry->len; off++)
@@ -239,6 +242,7 @@ vtbm_dump_blk(VTbm *vtbm, BlockNumber blkno)
 {
 	DtEntry *entry;
 	StringInfoData str;
+	char *bitmap;
 
 	initStringInfo(&str);
 
@@ -252,7 +256,7 @@ vtbm_dump_blk(VTbm *vtbm, BlockNumber blkno)
 		return;
 	}
 
-	char *bitmap = &(vtbm->bitmap[entry->offset]);
+	bitmap = &(vtbm->bitmap[entry->offset]);
 
 	appendStringInfo(&str, "[%5d] : ", entry->blkno);
 	for (int off = 1; off < entry->len; off++)
-- 
2.32.0.rc2

>From 5ba05ffad4a9605a6fb5a24fe625542aee226ec8 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 19 Jul 2021 16:04:55 -0700
Subject: [PATCH 2/3] Add radix tree.

---
 bdbench/radix.c | 3088 +++
 bdbench/radix.h |   76 ++
 2 files changed, 3164 insertions(+)
 create mode 100644 bdbench/radix.c
 create mode 100644 bdbench/radix.h

diff --git a/bdbench/radix.c b/bdbench/radix.c
new file mode 100644
index 000..c7061f0
--- /dev/null
+++ b/bdbench/radix.c
@@ -0,0

Re: Transactions and indexes

2021-07-19 Thread Peter Geoghegan

On Mon, Jul 19, 2021 at 4:31 PM Chris Cleveland
 wrote:
> I'm confused on how to handle transactions and visibility.

In Postgres, indexes are not considered to be part of the logical
database. They're just data structures that point to TIDs in the
table. To an index, each TID is just another object -- it doesn't
possess any built-in idea about MVCC.

In practice the indexes may be able to surmise certain things about
MVCC and versioning, as an optimization -- but that is all speculative
and relies on cooperation from the table AM side. Also, the
implementation of unique indexes knows more than zero about versions,
since that's just necessary. These two cases may or may not be
considered exceptions to the general rule. I suppose that it's a
matter of perspective.

> So... how do I handle this? Is there some way for me to implement my own 
> storage manager that manages visibility?

This is the responsibility of a table AM, not any one index AM. In
general we assume that each table AM implements something very much
like heapam's VACUUM implementation. Index AMs may also have
opportunistic cleanup of their own, as an optimization (actually this
is what I was referring to).

Theoretically index AMs and table AMs are orthogonal things. How true
that will be in a world with more than one mature table AM remains to
be seen.

-- 
Peter Geoghegan

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-19 Thread Andres Freund

Hi,

On 2021-07-19 16:49:15 -0700, Andres Freund wrote:
> E.g. for
> 
> select prepare(
> 100, -- max block
> 20, -- # of dead tuples per page
> 10, -- dead tuples interval within a page
> 1 -- page inteval
> );
> attach  sizeshuffled  ordered
> array69 ms  120 MB  84.87 s  8.66 s
> intset  173 ms   65 MB  68.82 s 11.75 s
> rtbm201 ms   67 MB  11.54 s  1.35 s
> tbm 232 ms  100 MB   8.33 s  1.26 s
> vtbm162 ms   58 MB  10.01 s  1.22 s
> radix88 ms   42 MB  11.49 s  1.67 s
> 
> and for
> select prepare(
> 100, -- max block
> 10, -- # of dead tuples per page
> 1, -- dead tuples interval within a page
> 1 -- page inteval
> );
> 
> attach  sizeshuffled  ordered
> array24 ms   60MB   3.74s1.02 s
> intset   97 ms   49MB   3.14s0.75 s
> rtbm138 ms   36MB   0.41s0.14 s
> tbm 198 ms  101MB   0.41s0.14 s
> vtbm118 ms   27MB   0.39s0.12 s
> radix33 ms   10MB   0.28s0.10 s

Oh, I forgot: The performance numbers are with the fixes in
https://www.postgresql.org/message-id/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
applied.

Greetings,

Andres Freund

Re: Rename of triggers for partitioned tables

2021-07-19 Thread Alvaro Herrera

I found this coding too convoluted, so I rewrote it in a different way.
You tests pass with this, but I admit I haven't double-checked them yet;
I'll do that next.

I don't think we need to give a NOTICE when the trigger name does not
match; it doesn't really matter that the trigger was named differently
before the command, does it?

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 6d4b7ee92a..da6320f6a1 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -71,6 +71,10 @@ int			SessionReplicationRole = SESSION_REPLICATION_ROLE_ORIGIN;
 static int	MyTriggerDepth = 0;
 
 /* Local function prototypes */
+static void renametrig_internal(Relation tgrel, Relation targetrel,
+HeapTuple trigtup, const char *newname);
+static void renametrig_partition(Relation tgrel, Oid partitionId,
+ Oid parentTriggerOid, const char *newname);
 static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
 static bool GetTupleForTrigger(EState *estate,
 			   EPQState *epqstate,
@@ -1441,6 +1445,14 @@ renametrig(RenameStmt *stmt)
 	/* Have lock already, so just need to build relcache entry. */
 	targetrel = relation_open(relid, NoLock);
 
+	/*
+	 * On partitioned tables, this operation recurses to partitions, unless
+	 * caller requested not to.  Lock all tables upfront, if needed.
+	 */
+	if (stmt->relation->inh &&
+		targetrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		(void) find_all_inheritors(relid, AccessExclusiveLock, NULL);
+
 	/*
 	 * Scan pg_trigger twice for existing triggers on relation.  We do this in
 	 * order to ensure a trigger does not exist with newname (The unique index
@@ -1489,27 +1501,25 @@ renametrig(RenameStmt *stmt)
 	{
 		Form_pg_trigger trigform;
 
-		/*
-		 * Update pg_trigger tuple with new tgname.
-		 */
-		tuple = heap_copytuple(tuple);	/* need a modifiable copy */
 		trigform = (Form_pg_trigger) GETSTRUCT(tuple);
 		tgoid = trigform->oid;
 
-		namestrcpy(&trigform->tgname,
-   stmt->newname);
+		/* Rename the trigger on this relation ... */
+		renametrig_internal(tgrel, targetrel, tuple, stmt->newname);
 
-		CatalogTupleUpdate(tgrel, &tuple->t_self, tuple);
+		/* ... and if it is partitioned, recurse to its partitions */
+		if (stmt->relation->inh &&
+			targetrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		{
+			PartitionDesc partdesc = RelationGetPartitionDesc(targetrel, true);
 
-		InvokeObjectPostAlterHook(TriggerRelationId,
-  tgoid, 0);
+			for (int i = 0; i < partdesc->nparts; i++)
+			{
+Oid		partitionId = partdesc->oids[i];
 
-		/*
-		 * Invalidate relation's relcache entry so that other backends (and
-		 * this one too!) are sent SI message to make them rebuild relcache
-		 * entries.  (Ideally this should happen automatically...)
-		 */
-		CacheInvalidateRelcache(targetrel);
+renametrig_partition(tgrel, partitionId, trigform->oid, stmt->newname);
+			}
+		}
 	}
 	else
 	{
@@ -1533,6 +1543,92 @@ renametrig(RenameStmt *stmt)
 	return address;
 }
 
+/*
+ * Subroutine for renametrig -- perform the actual work of renaming one
+ * trigger on one table.
+ */
+static void
+renametrig_internal(Relation tgrel, Relation targetrel, HeapTuple trigtup,
+	const char *newname)
+{
+	HeapTuple	tuple;
+	Form_pg_trigger newtgform;
+
+	/*
+	 * Update pg_trigger tuple with new tgname.
+	 */
+	tuple = heap_copytuple(trigtup);	/* need a modifiable copy */
+	newtgform = (Form_pg_trigger) GETSTRUCT(tuple);
+
+	namestrcpy(&newtgform->tgname, newname);
+
+	CatalogTupleUpdate(tgrel, &tuple->t_self, tuple);
+
+	InvokeObjectPostAlterHook(TriggerRelationId, newtgform->oid, 0);
+
+	/*
+	 * Invalidate relation's relcache entry so that other backends (and
+	 * this one too!) are sent SI message to make them rebuild relcache
+	 * entries.  (Ideally this should happen automatically...)
+	 */
+	CacheInvalidateRelcache(targetrel);
+}
+
+/*
+ * Subroutine for renametrig -- Helper for recursing to partitions when
+ * renaming triggers on a partitioned table.
+ */
+static void
+renametrig_partition(Relation tgrel, Oid partitionId, Oid parentTriggerOid,
+	 const char *newname)
+{
+	SysScanDesc	tgscan;
+	ScanKeyData	key;
+	HeapTuple	tuple;
+	int		found = 0 PG_USED_FOR_ASSERTS_ONLY;
+
+	/*
+	 * Given a relation and the OID of a trigger on parent relation, find the
+	 * corresponding trigger in the child and rename that trigger to the given
+	 * name.
+	 */
+	ScanKeyInit(&key,
+Anum_pg_trigger_tgrelid,
+BTEqualStrategyNumber, F_OIDEQ,
+ObjectIdGetDatum(partitionId));
+	tgscan = systable_beginscan(tgrel, TriggerRelidNameIndexId, true,
+NULL, 1, &key);
+	while (HeapTupleIsValid(tuple = systable_getnext(tgscan)))
+	{
+		Form_pg_trigger	tgform = (Form_pg_trigger) GETSTRUCT(tuple);
+		Relation		partitionRel;
+
+		if (tgform->tgparentid != parentTriggerOid)
+			continue;	/* not our trigger */
+
+		Assert(f

Re: row filtering for logical replication

2021-07-19 Thread Greg Nancarrow

On Mon, Jul 19, 2021 at 11:32 PM Tomas Vondra
 wrote:
>
> I have a feeling it's getting overly complicated, to the extent that
> it'll be hard to explain to users and reason about. I don't think
> there's a "perfect" solution for cases when the filter expression gives
> different answers for old/new row - it'll always be surprising for some
> users :-(
>
> So maybe the best thing is to stick to the simple approach already used
> e.g. by pglogical, which simply user the new row when available (insert,
> update) and old one for deletes.
>
> I think that behaves more or less sensibly and it's easy to explain.
>
> All the other things (e.g. turning UPDATE to INSERT, advanced conflict
> resolution etc.) will require a lot of other stuff, and I see them as
> improvements of this simple approach.
>

+1
My thoughts on this are very similar.


Regards,
Greg Nancarrow
Fujitsu Australia

Re: Rename of triggers for partitioned tables

2021-07-19 Thread Arne Roland

Hi!

From: Alvaro Herrera 
Sent: Tuesday, July 20, 2021 02:03
To: Arne Roland
Cc: vignesh C; Zhihong Yu; Pg Hackers
Subject: Re: Rename of triggers for partitioned tables

I found this coding too convoluted, so I rewrote it in a different way.
You tests pass with this, but I admit I haven't double-checked them yet;
I'll do that next.

Is your patch based on master? It doesn't apply at my end.

I don't think we need to give a NOTICE when the trigger name does not
match; it doesn't really matter that the trigger was named differently
before the command, does it?

I'd expect the command
ALTER TRIGGER name ON table_name RENAME TO new_name;
to rename a trigger named "name". We are referring the trigger via it's name 
after all. If a child is named differently we break with that assumption. I 
think informing the user about that, is very valuable.

Regards
Arne

Re: O_DIRECT on macOS

2021-07-19 Thread Tom Lane

I wrote:
> Thomas Munro  writes:
>> While I was here again, I couldn't resist trying to extend this to
>> Solaris, since it looked so easy.  I don't have access, but I tested
>> on Illumos by undefining O_DIRECT.  Thoughts?

> I can try that on the gcc farm in a bit.

Hmm, it compiles cleanly, but something seems drastically wrong,
because performance is just awful.  On the other hand, I don't
know what sort of storage is underlying this instance, so maybe
that's to be expected?  If I set fsync = off, the speed seems
comparable to what wrasse reports, but with fsync on it's like

test tablespace   ... ok87990 ms
parallel group (20 tests, in groups of 1):  boolean char name varchar text int2 
int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn 
regproc
 boolean  ... ok 3229 ms
 char ... ok 2758 ms
 name ... ok 2229 ms
 varchar  ... ok 7373 ms
 text ... ok  722 ms
 int2 ... ok  342 ms
 int4 ... ok 1303 ms
 int8 ... ok 1095 ms
 oid  ... ok 1086 ms
 float4   ... ok 6360 ms
 float8   ... ok 5224 ms
 bit  ... ok 6254 ms
 numeric  ... ok44304 ms
 txid ... ok  377 ms
 uuid ... ok 3946 ms
 enum ... ok33189 ms
 money... ok  622 ms
 rangetypes   ... ok17301 ms
 pg_lsn   ... ok  798 ms
 regproc  ... ok  145 ms

(I stopped running it at that point...)

Also, the results of pg_test_fsync seem wrong; it refuses to run
tests for the cases we're interested in:

$ pg_test_fsync 
5 seconds per test
DIRECTIO_ON supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  n/a*
fdatasync 8.324 ops/sec  120139 usecs/op
fsync 0.906 ops/sec  1103936 usecs/op
fsync_writethrough  n/a
open_sync  n/a*
* This file system and its mount options do not support direct
  I/O, e.g. ext4 in journaled mode.

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  n/a*
fdatasync 7.329 ops/sec  136449 usecs/op
fsync 0.788 ops/sec  1269258 usecs/op
fsync_writethrough  n/a
open_sync  n/a*
* This file system and its mount options do not support direct
  I/O, e.g. ext4 in journaled mode.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
 1 * 16kB open_sync write  n/a*
 2 *  8kB open_sync writes n/a*
 4 *  4kB open_sync writes n/a*
 8 *  2kB open_sync writes n/a*
16 *  1kB open_sync writes n/a*

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close  16.388 ops/sec   61020 usecs/op
write, close, fsync   9.084 ops/sec  110082 usecs/op

Non-sync'ed 8kB writes:
write 39855.686 ops/sec  25 usecs/op

regards, tom lane

RE: row filtering for logical replication

2021-07-19 Thread houzj.f...@fujitsu.com

Hi,

I am interested in this feature and took a quick a look at the patch.
Here are a few comments.

(1)
+   appendStringInfo(&cmd, "%s", q);

We'd better use appendStringInfoString(&cmd, q);


(2)
+   whereclause = transformWhereClause(pstate,
+  
copyObject(pri->whereClause),
+  
EXPR_KIND_PUBLICATION_WHERE,
+  
"PUBLICATION");
+
+   /* Fix up collation information */
+   assign_expr_collations(pstate, whereclause);

Is it better to invoke eval_const_expressions or canonicalize_qual here to
simplify the expression ?


(3)
+   appendPQExpBuffer(&buf,
+ ", 
pg_get_expr(pr.prqual, c.oid)");
+   else
+   appendPQExpBuffer(&buf,
+ ", NULL");

we'd better use appendPQExpBufferStr instead of appendPQExpBuffer here.

(4)
nodeTag(expr) == T_FuncCall)

It might looks clearer to use IsA(expr, FuncCall) here.

Best regards,
Houzj

Re: Rename of triggers for partitioned tables

2021-07-19 Thread Arne Roland

> We are referring the trigger via it's name after all. If a child is named 
> differently we break with that assumption. I think informing the user about 
> that, is very valuable.


To elaborate on that:

While we to similar things for things like set schema, here it has a functional 
relevance.


ALTER TRIGGER a5 ON table_name RENAME TO a9;


suggests that the trigger is now fired later from now on. Which obviously might 
not be the case, if one of the child triggers have had a different name.




I like your naming suggestions. I'm looking forward to test this patch when it 
applies at my end!


Regards
Arne

Re: Skipping logical replication transactions on subscriber side

2021-07-19 Thread Masahiko Sawada

On Mon, Jul 19, 2021 at 5:47 PM Amit Kapila  wrote:
>
> On Mon, Jul 19, 2021 at 12:10 PM Masahiko Sawada  
> wrote:
> >
> > On Sat, Jul 17, 2021 at 12:02 AM Masahiko Sawada  
> > wrote:
> > >
> > > 1. How to clear apply worker errors. IIUC we've discussed that once
> > > the apply worker skipped the transaction we leave the error entry
> > > itself but clear its fields except for some fields such as failure
> > > counts. But given that the stats messages could be lost, how can we
> > > ensure to clear those error details? For table sync workers’ error, we
> > > can have autovacuum workers periodically check entires of
> > > pg_subscription_rel and clear the error entry if the table sync worker
> > > completes table sync (i.g., checking if srsubstate = ‘r’). But there
> > > is no such information for the apply workers and subscriptions. In
> > > addition to sending the message clearing the error details just after
> > > skipping the transaction, I thought that we can have apply workers
> > > periodically send the message clearing the error details but it seems
> > > not good.
> >
> > I think that the motivation behind the idea of leaving error entries
> > and clearing theirs some fields is that users can check if the error
> > is successfully resolved and the worker is working find. But we can
> > check it also in another way, for example, checking
> > pg_stat_subscription view. So is it worth considering leaving the
> > apply worker errors as they are?
> >
>
> I think so. Basically, we will send the clear message after skipping
> the exact but I think it is fine if that message is lost. At worst, it
> will be displayed as the last error details. If there is another error
> it will be overwritten or probably we should have a function *_reset()
> which allows the user to reset a particular subscription's error info.

That makes sense. I'll incorporate this idea in the next version patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Skipping logical replication transactions on subscriber side

2021-07-19 Thread Masahiko Sawada

On Mon, Jul 19, 2021 at 8:38 PM houzj.f...@fujitsu.com
 wrote:
>
> On July 19, 2021 2:40 PM Masahiko Sawada  wrote:
> > I've attached the updated version patch that incorporated all comments
> > I got so far except for the clearing error details part I mentioned
> > above. After getting a consensus on those parts, I'll incorporate the
> > idea into the patches.
>
> Hi Sawada-san,
>
> I am interested in this feature.
> After having a look at the patch, I have a few questions about it.

Thank you for having a look at the patches!

>
> 1) In 0002 patch, it introduces a new view called pg_stat_subscription_errors.
>Since it won't be cleaned automatically after we resolve the conflict, do 
> we
>need a reset function to clean the statistics in it ? Maybe something
>similar to pg_stat_reset_replication_slot which clean the
>pg_stat_replication_slots.

Agreed. As Amit also mentioned, providing a reset function to clean
the statistics seems a good idea. If the message clearing the stats
that is sent after skipping the transaction gets lost, the user is
able to reset those stats manually.

>
> 2) For 0003 patch, When I am faced with a conflict, I set skip_xid = xxx, and
>then I resolve the conflict. If I reset skip_xid after resolving the
>conflict, will the change(which cause the conflict before) be applied 
> again ?

The apply worker checks skip_xid when it reads the subscription.
Therefore, if you reset skip_xid before the apply worker restarts and
skips the transaction, the change is applied. But if you reset
skip_xid after the apply worker skips transaction, the change is
already skipped and your resetting skip_xid has no effect.

>
> 3) For 0003 patch, if user set skip_xid to a wrong xid which have not been
>assigned, and then will the change be skipped when the xid is assigned in
>the future even if it doesn't cause any conflicts ?

Yes. Currently, setting a correct xid is the user's responsibility. I
think it would be better to disable it or emit WARNING/ERROR when the
user mistakenly set the wrong xid if we find out a convenient way to
detect that.

>
> Besides, It might be better to add some description of patch in each patch's
> commit message which will make it easier for new reviewers to follow.

I'll add commit messages in the next version patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Rename of triggers for partitioned tables

2021-07-19 Thread Alvaro Herrera

On 2021-Jul-20, Arne Roland wrote:

> Is your patch based on master? It doesn't apply at my end.

It does ... master being dd498998a3 here,

$ patch -p1 < /tmp/renametrig-8.patch 
patching file src/backend/commands/trigger.c
patching file src/backend/parser/gram.y
patching file src/test/regress/expected/triggers.out
patching file src/test/regress/sql/triggers.sql

applies fine.

>> I don't think we need to give a NOTICE when the trigger name does not
>> match; it doesn't really matter that the trigger was named differently
>> before the command, does it?

> I'd expect the command
> ALTER TRIGGER name ON table_name RENAME TO new_name;
> to rename a trigger named "name". We are referring the trigger via it's name 
> after all. If a child is named differently we break with that assumption. I 
> think informing the user about that, is very valuable.

Well, it does rename a trigger named 'name' on the table 'table_name',
as well as all its descendant triggers.  I guess I am surprised that
anybody would rename a descendant trigger in the first place.  I'm not
wedded to the decision of removing the NOTICE, though  ... are there any
other votes for that, anyone?

Thanks

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: Why ALTER SUBSCRIPTION ... SET (slot_name='none') requires subscription disabled?

2021-07-19 Thread Japin Li



On Mon, 19 Jul 2021 at 17:02, Amit Kapila  wrote:
> On Fri, Jul 16, 2021 at 2:12 PM Japin Li  wrote:
>>
>>
>> On Fri, 16 Jul 2021 at 14:06, Amit Kapila  wrote:
>> > On Fri, Jul 9, 2021 at 8:20 AM Japin Li  wrote:
>> >>
>> >> On Thu, 08 Jul 2021 at 18:17, Amit Kapila  wrote:
>> >> > On Thu, Jul 8, 2021 at 3:43 PM Japin Li  wrote:
>> >>
>> >> Please consider review v3 patch. v3-0001 adds slot_name verification in
>> >> parse_subscription_options() and comments for why we need disable 
>> >> subscription
>> >> where set slot_name to NONE.
>> >>
>> >
>> > I think we back-patch this bug-fix till v10 where it was introduced
>> > and update the comments only in HEAD. So, accordingly, I moved the
>> > changes into two patches and changed the comments a bit. Can you
>> > please test the first patch in back-branches? I'll also do it
>> > separately.
>> >
>>
>> I try to back-patch to v10 stable to v14 stable, and attach two new patches:
>> one for PG10 & PG11 stable, and the other is for PG12 to PG14 stable.
>> v4 patch can be applied on HEAD. This modify looks good to me.
>>
>
> The patch you prepared for v14 was not getting applied cleanly, so I
> did the required modifications and then pushed.
>
>> How do we back-patch to back-branches? I try to use cherry-pick, but it 
>> doesn't
>> always work (without a doubt, it might be some difference between branches).
>>
>
> Yeah, we need to adjust the patch as per the back-branches code.

Thanks!

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: row filtering for logical replication

2021-07-19 Thread Alvaro Herrera

On 2021-Jul-19, Tomas Vondra wrote:

> I have a feeling it's getting overly complicated, to the extent that
> it'll be hard to explain to users and reason about. I don't think
> there's a "perfect" solution for cases when the filter expression gives
> different answers for old/new row - it'll always be surprising for some
> users :-(
> 
> So maybe the best thing is to stick to the simple approach already used
> e.g. by pglogical, which simply user the new row when available (insert,
> update) and old one for deletes.

OK, no objection to that plan.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No es bueno caminar con un hombre muerto"

Re: Transactions and indexes

2021-07-19 Thread Peter Geoghegan

On Mon, Jul 19, 2021 at 7:20 PM Chris Cleveland
 wrote:
> Thank you. Does this mean I can implement the index AM and return TIDs 
> without having to worry about transactions at all?

Yes. That's the upside of the design -- it makes it easy to add new
transactional index AMs. Which is one reason why Postgres has so many.

> Also, as far as I can tell, the only way that TIDs are removed from the index 
> is in ambulkdelete(). Is this accurate?

It doesn't have to be the only way, but in practice it can be. Depends
on the index AM. The core code relies on ambulkdelete() to make sure
that all TIDs dead in the table are gone from the index. This allows
VACUUM to finally physically recycle the previously referenced TIDs in
the table structure, without risk of index scans finding the wrong
thing.

> Does that mean that my index will be returning TIDs for deleted items and I 
> don't have to worry about that either?

If you assume that you're using heapam (the standard table AM), then
yes. Otherwise I don't know -- it's ambiguous.

> Don't TIDs get reused? What happens when my index returns an old TID which is 
> now pointing to a new record?

This can't happen because, as I said, the table cannot recycle
TIDs/line pointers until it's known that this cannot happen (because
VACUUM already cleaned out all the garbage index tuples).

> This is going to make it really hard to implement Top X queries of the type 
> you get from a search engine. A search engine will normally maintain an 
> internal buffer (usually a priority queue) of a fixed size, X, and add tuples 
> to it along with their relevance score. The buffer only remembers the Top X 
> tuples with the highest score. In this way the search engine can iterate over 
> millions of entries and retain only the best ones without having an unbounded 
> buffer. For this to work, though, you need to know how many tuples to keep in 
> the buffer in advance. If my index can't know, in advance, which TIDs are 
> invisible or deleted, then it can't keep them out of the buffer, and this 
> whole scheme fails.
>
> This is not going to work unless the system gives the index a clear picture 
> of transactions, visibility, and deletes as they happen. Is this information 
> available?

Are you implementing a new index AM or a new table AM? Discarding data
based on something like a relevance score doesn't seem like something
that either API provides for. Indexes in Postgres can be lossy, but
that in itself doesn't change the result of queries.

-- 
Peter Geoghegan

Re: Parallel INSERT SELECT take 2

2021-07-19 Thread Greg Nancarrow

On Tue, Jul 20, 2021 at 11:47 AM houzj.f...@fujitsu.com
 wrote:
>
> Attach rebased patches.
>

Just letting you know that CRLFs are in the patch comments for the
0001 and 0003 patches.
(It doesn't affect patch application)

Regards,
Greg Nancarrow
Fujitsu Australia

Kerberos delegation support in libpq and postgres_fdw

2021-07-19 Thread Peifeng Qiu

Hi hackers.

This is the patch to add kerberos delegation support in libpq, which
enables postgres_fdw to connect to another server and authenticate
as the same user to the current login user. This will obsolete my
previous patch which requires keytab file to be present on the fdw
server host.

After the backend accepts the gssapi context, it may also get a
proxy credential if permitted by policy. I previously made a hack
to pass the pointer of proxy credential directly into libpq. It turns
out that the correct way to do this is store/acquire using credential
cache within local process memory to prevent leak.

Because no password is needed when querying foreign table via
kerberos delegation, the "password_required" option in user
mapping must be set to false by a superuser. Other than this, it
should work with normal user.

I only tested it manually in a very simple configuration currently.
I will go on to work with TAP tests for this.

How do you feel about this patch? Any feature/security concerns
about this?

Best regards,
Peifeng Qiu

diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 8cc23ef7fb..235c9b32f5 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -918,6 +918,7 @@ pg_GSS_recvauth(Port *port)
 	int			mtype;
 	StringInfoData buf;
 	gss_buffer_desc gbuf;
+	gss_cred_id_t proxy;
 
 	/*
 	 * Use the configured keytab, if there is one.  Unfortunately, Heimdal
@@ -947,6 +948,8 @@ pg_GSS_recvauth(Port *port)
 	 */
 	port->gss->ctx = GSS_C_NO_CONTEXT;
 
+	proxy = NULL;
+
 	/*
 	 * Loop through GSSAPI message exchange. This exchange can consist of
 	 * multiple messages sent in both directions. First message is always from
@@ -997,7 +1000,7 @@ pg_GSS_recvauth(Port *port)
 		  &port->gss->outbuf,
 		  &gflags,
 		  NULL,
-		  NULL);
+		  &proxy);
 
 		/* gbuf no longer used */
 		pfree(buf.data);
@@ -1009,6 +1012,9 @@ pg_GSS_recvauth(Port *port)
 
 		CHECK_FOR_INTERRUPTS();
 
+		if (proxy != NULL)
+			pg_store_proxy_credential(proxy);
+
 		if (port->gss->outbuf.length != 0)
 		{
 			/*
diff --git a/src/backend/libpq/be-gssapi-common.c b/src/backend/libpq/be-gssapi-common.c
index 38f58def25..cd243994c8 100644
--- a/src/backend/libpq/be-gssapi-common.c
+++ b/src/backend/libpq/be-gssapi-common.c
@@ -92,3 +92,40 @@ pg_GSS_error(const char *errmsg,
 			(errmsg_internal("%s", errmsg),
 			 errdetail_internal("%s: %s", msg_major, msg_minor)));
 }
+
+void
+pg_store_proxy_credential(gss_cred_id_t cred)
+{
+	OM_uint32 major, minor;
+	gss_OID_set mech;
+	gss_cred_usage_t usage;
+	gss_key_value_element_desc cc;
+	gss_key_value_set_desc ccset;
+
+	cc.key = "ccache";
+	cc.value = "MEMORY:";
+	ccset.count = 1;
+	ccset.elements = &cc;
+
+	/* Make the proxy credential only available to current process */
+	major = gss_store_cred_into(&minor,
+		cred,
+		GSS_C_INITIATE, /* credential only used for starting libpq connection */
+		GSS_C_NULL_OID, /* store all */
+		true, /* overwrite */
+		true, /* make default */
+		&ccset,
+		&mech,
+		&usage);
+
+
+	if (major != GSS_S_COMPLETE)
+	{
+		pg_GSS_error("gss_store_cred", major, minor);
+	}
+
+	/* quite strange that gss_store_cred doesn't work with "KRB5CCNAME=MEMORY:",
+	 * we have to use gss_store_cred_into instead and set the env for later
+	 * gss_acquire_cred calls. */
+	putenv("KRB5CCNAME=MEMORY:");
+}
diff --git a/src/backend/libpq/be-secure-gssapi.c b/src/backend/libpq/be-secure-gssapi.c
index 316ca65db5..e27d517dea 100644
--- a/src/backend/libpq/be-secure-gssapi.c
+++ b/src/backend/libpq/be-secure-gssapi.c
@@ -497,6 +497,7 @@ secure_open_gssapi(Port *port)
 	bool		complete_next = false;
 	OM_uint32	major,
 minor;
+	gss_cred_id_t	proxy;
 
 	/*
 	 * Allocate subsidiary Port data for GSSAPI operations.
@@ -588,7 +589,8 @@ secure_open_gssapi(Port *port)
 	   GSS_C_NO_CREDENTIAL, &input,
 	   GSS_C_NO_CHANNEL_BINDINGS,
 	   &port->gss->name, NULL, &output, NULL,
-	   NULL, NULL);
+	   NULL, &proxy);
+
 		if (GSS_ERROR(major))
 		{
 			pg_GSS_error(_("could not accept GSSAPI security context"),
@@ -605,6 +607,9 @@ secure_open_gssapi(Port *port)
 			complete_next = true;
 		}
 
+		if (proxy != NULL)
+			pg_store_proxy_credential(proxy);
+
 		/* Done handling the incoming packet, reset our buffer */
 		PqGSSRecvLength = 0;
 
diff --git a/src/include/libpq/be-gssapi-common.h b/src/include/libpq/be-gssapi-common.h
index c07d7e7c5a..62d60ffbd8 100644
--- a/src/include/libpq/be-gssapi-common.h
+++ b/src/include/libpq/be-gssapi-common.h
@@ -18,9 +18,11 @@
 #include 
 #else
 #include 
+#include 
 #endif
 
 extern void pg_GSS_error(const char *errmsg,
 		 OM_uint32 maj_stat, OM_uint32 min_stat);
 
+extern void pg_store_proxy_credential(gss_cred_id_t cred);
 #endif			/* BE_GSSAPI_COMMON_H */
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 3421ed4685..e0d342124e 100644
--- a/src/interfaces/libpq/fe-auth.c
+++ b/src/inter

Re: [HACKERS] logical decoding of two-phase transactions

2021-07-19 Thread Peter Smith

Please find attached the latest patch set v98*

Patches:

v97-0001 --> v98-0001

Differences:

* Rebased to HEAD @ yesterday.

* Code/Docs changes:

1. Fixed the same strcpy problem as reported by Tom Lane [1] for the
previous 2PC patch.

2. Addressed all feedback suggestions given by Greg [2].

3. Added some more documentation as suggested by Vignesh [3].


[1] https://www.postgresql.org/message-id/161029.1626639923%40sss.pgh.pa.us
[2] 
https://www.postgresql.org/message-id/CAJcOf-ckGONzyAj0Y70ju_tfLWF819JYb%3Ddv9p5AnoZxm50j0g%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/CALDaNm0LVY5A98xrgaodynnj6c%3DWQ5%3DZMpauC44aRio7-jWBYQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia


v98-0001-Add-prepare-API-support-for-streaming-transactio.patch
Description: Binary data

Re: [HACKERS] logical decoding of two-phase transactions

2021-07-19 Thread Peter Smith

On Mon, Jul 19, 2021 at 3:28 PM Greg Nancarrow  wrote:
>
> On Wed, Jul 14, 2021 at 6:33 PM Peter Smith  wrote:
> >
> > Please find attached the latest patch set v97*
> >
>
> I couldn't spot spot any significant issues in the v97-0001 patch, but
> do have the following trivial feedback comments:
>
> (1) doc/src/sgml/protocol.sgml
> Suggestion:
>
> BEFORE:
> +   contains a Stream Prepare or Stream Commit or Stream Abort message.
> AFTER:
> +   contains a Stream Prepare, Stream Commit or Stream Abort message.
>
>
> (2) src/backend/replication/logical/worker.c
> It seems a bit weird to add a forward declaration here, without a
> comment, like for the one immediately above it
>
> /* Compute GID for two_phase transactions */
> static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char
> *gid, int szgid);
> -
> +static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
>
>
> (3) src/backend/replication/logical/worker.c
> Other DEBUG1 messages don't end with "."
>
> + elog(DEBUG1, "apply_handle_stream_prepare: replayed %d
> (all) changes.", nchanges);
>

Thanks for the feedback. All these are fixed as suggested in v98.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: [HACKERS] logical decoding of two-phase transactions

2021-07-19 Thread Peter Smith

On Fri, Jul 16, 2021 at 4:08 PM vignesh C  wrote:
>
[...]
> Thanks for the updated patch, the patch applies cleanly and test passes:
> I had couple of comments:
> 1) Should we include "stream_prepare_cb" here  in
> logicaldecoding-streaming section of logicaldecoding.sgml
> documentation:
> To reduce the apply lag caused by large transactions, an output plugin
> may provide additional callback to support incremental streaming of
> in-progress transactions. There are multiple required streaming
> callbacks (stream_start_cb, stream_stop_cb, stream_abort_cb,
> stream_commit_cb and stream_change_cb) and two optional callbacks
> (stream_message_cb and stream_truncate_cb).
>

Modified in v98. The information about 'stream_prepare_cb' and friends
is given in detail in section 49.10 so I added a link to that page.


> 2) Should we add an example for stream_prepare_cb here  in
> logicaldecoding-streaming section of logicaldecoding.sgml
> documentation:
> One example sequence of streaming callback calls for one transaction
> may look like this:
>
> stream_start_cb(...);   <-- start of first block of changes
>   stream_change_cb(...);
>   stream_change_cb(...);
>   stream_message_cb(...);
>   stream_change_cb(...);
>   ...
>   stream_change_cb(...);
> stream_stop_cb(...);<-- end of first block of changes
>
> stream_start_cb(...);   <-- start of second block of changes
>   stream_change_cb(...);
>   stream_change_cb(...);
>   stream_change_cb(...);
>   ...
>   stream_message_cb(...);
>   stream_change_cb(...);
> stream_stop_cb(...);<-- end of second block of changes
>
> stream_commit_cb(...);  <-- commit of the streamed transaction
>

Modified in v98. I felt it would be too verbose to add another full
example since it would be 90% the same as the current example. So I
have combined the information.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Failure with 004_logrotate in prairiedog

2021-07-19 Thread Kyotaro Horiguchi

At Mon, 19 Jul 2021 10:23:46 -0400, Tom Lane  wrote in 
> Michael Paquier  writes:
> > On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote:
> >> When rotation happens, the metainfo file is once removed then
> >> created. If slurp_file in the metafile-checking loop hits the gap, the
> >> slurp_file fails with ENOENT.
> 
> > I can read the following code, as of update_metainfo_datafile():
> > if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0)
> 
> Yeah, ignore my previous message.  There is an unlink up at the top
> of the function, which fooled me in my caffeine-deprived state.

Yeah, sorry for the stupidity.

> But that path is only taken when logging was just turned off, so
> we must remove the now-irrelevant metafile.

I'm not sure this is relevant, I found the following article. (as a
token of my apology:p)

http://www.weirdnet.nl/apple/rename.html

> There is an easy way to empirically prove that rename() is not
> atomic on Leopard 10.5.2. All you have to do is create a link to a
> directory, replace that link with a

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Have I found an interval arithmetic bug?

2021-07-19 Thread Bruce Momjian

On Wed, Jul 14, 2021 at 09:03:21AM -0700, Zhihong Yu wrote:
> On Thu, Jul 8, 2021 at 10:22 AM Zhihong Yu  wrote:
> On Wed, Jun 30, 2021 at 9:35 AM Bruce Momjian  wrote:
> 
> On Tue, Jun 29, 2021 at 06:49:45PM +0200, Daniel Gustafsson wrote:
> > > On 29 Jun 2021, at 18:50, Zhihong Yu  wrote:
> >
> > > Now that PG 15 is open for commit, do you think the patch can land
> ?
> >
> > Adding it to the commitfest patch tracker is a good way to ensure
> it's not
> > forgotten about:
> >
> >       https://commitfest.postgresql.org/33/
> 
> OK, I have been keeping it in my git tree since I wrote it and will
> apply it in the next few days.
> Thanks, Bruce.
> 
> Hopefully you can get to this soon. 
> 
> Bruce: 
> Please see if the patch can be integrated now.

I found a mistake in my most recent patch.  For example, in master we
see this output:

SELECT INTERVAL '1.8594 months';
 interval
--
 1 mon 25 days 18:46:04.8

Obviously this should return '1 mon 26 days', but with my most recent
patch, it returned '1 mon 25 days'.  Turns out I had not properly used
rint() in AdjustFractDays, and in fact the function is now longer needed
because it is just a multiplication and an rint().

Updated patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

interval.diff.gz
Description: application/gzip

Re: row filtering for logical replication

2021-07-19 Thread Amit Kapila

On Mon, Jul 19, 2021 at 4:31 PM Dilip Kumar  wrote:
>
> On Mon, Jul 19, 2021 at 3:12 PM Amit Kapila  wrote:
>
> > > Maybe a second option is to have replication change any UPDATE into
> > > either an INSERT or a DELETE, if the old or the new row do not pass the
> > > filter, respectively.  That way, the databases would remain consistent.
>
> Yeah, I think this is the best way to keep the data consistent.
>

Today, while studying the behavior of this particular operation in
other databases, I found that IBM's InfoSphere Data Replication does
exactly this. See [1]. I think there is a merit if want to follow this
idea.

[1] - https://www.ibm.com/docs/en/idr/11.4.0?topic=rows-search-conditions

-- 
With Regards,
Amit Kapila.

Re: Introduce pg_receivewal gzip compression tests

2021-07-19 Thread Michael Paquier

On Mon, Jul 19, 2021 at 04:03:33PM +0900, Michael Paquier wrote:
> Another advantage of this patch is the handling of ".gz" is reduced to
> one code path instead of four.  That makes a bit easier the
> introduction of new compression methods.
> 
> A second thing that was really confusing is that the name of the WAL
> segment generated in this code path completely ignored the type of
> compression.  This led to one confusing error message if failing to
> open a segment for write where we'd mention a .partial file rather
> than a .gz.partial file.  The versions of zlib I used on Windows
> looked buggy so I cannot conclude there, but I am sure that this
> should allow bowerbird to handle the test correctly.

After more testing and more review, I have applied and backpatched
this stuff.  Another thing I did on HEAD was to enable again the ZLIB
portion of the pg_receivewal tests on Windows.  bowerdird should stay
green (I hope), and it is better to have as much more coverage as
possible for all that.
--
Michael


signature.asc
Description: PGP signature

Re: Avoid stack frame setup in performance critical routines using tail calls

2021-07-19 Thread David Rowley

On Tue, 20 Jul 2021 at 08:00, Andres Freund  wrote:
> I have *not* carefully benchmarked this, but a quick implementation of this
> does seem to increase readonly pgbench tps at a small scale by 2-3% (both

Interesting.

I've not taken the time to study the patch but I was running some
other benchmarks today on a small scale pgbench readonly test and I
took this patch for a spin to see if I could see the same performance
gains.

This is an AMD 3990x machine that seems to get the most throughput
from pgbench with 132 processes

I did: pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared
--random-seed=12345 postgres

master = dd498998a

Master: 3816959.53 tps
Patched: 3820723.252 tps

I didn't quite get the same 2-3% as you did, but it did come out
faster than on master.

David
master  dense hash LockReleaseAll + aset tail call  aset tail call  
dense hash LockReleaseAll
10  3758201.2   3741925.6   3737701.5   3713521.5
20  3810125.5   3861572.5   3830863.7   3844142.9
30  3806505.1   3851164.4   3832257.9   3848458
40  3816094.8   3855232.4   3832305.7   3855706.6
50  3820317.2   3846941 3829641.5   3851717.7
60  3827809 3849490.4   3812254.5   3851499.4
70  3828757.9   3844582.8   3829097.8   3849312
80  3824492.1   3843161.8   3821383 3852378.8
90  3816502.1   3851970.8   3825119.2   3854793.8
100 3819124.1   3839695.5   3839286.7   3860418.6
110 3816154.3   3851302.8   3821209.5   3845327.7
120 3817070.5   3852974.2   3833781 3845842.5
130 3815424.7   3854379.1   3830812.5   3847626
140 3823631.1   3852449.1   3825261.9   3846760.6
150 3820963.8   3837493.5   3820703.2   3840196.6
160 3827737 3837809.7   3835278.4   3841149.3
170 3827779.2   3851799.4   3818430.3   3840130.9
180 3829352 3853094 3823286.1   3842814.5
190 3825518.3   3854912.8   3816329.4   3841991
200 3823477.2   3838998.6   3816060.8   3839390.7
210 3809304.3   3845776.7   3814737.2   3836433.5
220 3814328.5   3841394.7   3818894 3842073.7
230 3811399.3   3839360.8   3811939 3843780.7
avg 3816959.53  3843368.809 3820723.252 3840672.478
100.69% 100.10% 100.62%

Re: O_DIRECT on macOS

2021-07-19 Thread Thomas Munro

On Tue, Jul 20, 2021 at 12:26 PM Tom Lane  wrote:
> > I can try that on the gcc farm in a bit.

Thanks!

> Hmm, it compiles cleanly, but something seems drastically wrong,
> because performance is just awful.  On the other hand, I don't
> know what sort of storage is underlying this instance, so maybe
> that's to be expected?

Ouch.  I assume this was without wal_method=minimal (or it'd have
reached the new code and failed completely, based on the pg_test_fsync
result).

> open_datasync  n/a*

I'm waiting for access, but I see from man pages that closed source
ZFS doesn't accept DIRECTIO_ON, so it may not be possible to see it
work on an all-ZFS system that you can't mount a new FS on.  Hmm.
Well, many OSes have file systems that can't do it (ext4 journal=data,
etc).  One problem is that we don't treat all OSes the same when
selecting wal_sync_method, even though O_DIRECT is complicated on many
OSes.  It would also be nice if the choice to use direct I/O were
independently controlled, and ... [trails off].  Alright, I'll leave
this on ice for now.

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-07-19 Thread David Rowley

On Mon, 12 Jul 2021 at 19:23, David Rowley  wrote:
> I also adjusted the hash seq scan code so that it performs better when
> faced a non-sparsely populated table.  Previously my benchmark for
> that case didn't do well [2].

I was running some select only pgbench tests today on an AMD 3990x
machine with a large number of processes.

I saw that LockReleaseAll was coming up on the profile a bit at:

Master: 0.77% postgres [.] LockReleaseAll

I wondered if this patch would help, so I tried it and got:

dense hash lockrelease all: 0.67% postgres [.] LockReleaseAll

It's a very small increase which translated to about a 0.62% gain in
tps. It made me think it might be worth doing something about this
LockReleaseAll can show up when releasing small numbers of locks.

pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared --random-seed=12345 postgres

Units = tps

Secmaster   dense hash LockReleaseAll
10   3758201.2 3713521.5   98.81%
20   3810125.5 3844142.9 100.89%
30   3806505.1 3848458101.10%
40   3816094.8 3855706.6 101.04%
50   3820317.2 3851717.7 100.82%
60   3827809 3851499.4100.62%
70   3828757.9 3849312100.54%
80   3824492.1 3852378.8 100.73%
90   3816502.1 3854793.8 101.00%
100 3819124.1 3860418.6 101.08%
110 3816154.3 3845327.7 100.76%
120 3817070.5 3845842.5 100.75%
130 3815424.7 3847626100.84%
140 3823631.1 3846760.6 100.60%
150 3820963.8 3840196.6 100.50%
160 38277373841149.3 100.35%
170 3827779.2 3840130.9 100.32%
180  3829352   3842814.5 100.35%
190 3825518.3 3841991100.43%
200 3823477.2 3839390.7 100.42%
210 3809304.3 3836433.5 100.71%
220 3814328.5 3842073.7 100.73%
230 3811399.3 3843780.7 100.85%
avg 3816959.53   3840672.478 100.62%

David

Re: row filtering for logical replication

2021-07-19 Thread Amit Kapila

On Mon, Jul 19, 2021 at 7:02 PM Tomas Vondra
 wrote:
>
> On 7/19/21 1:00 PM, Dilip Kumar wrote:
> > On Mon, Jul 19, 2021 at 3:12 PM Amit Kapila  wrote:
> >> a. Just log it and move to the next row
> >> b. send to stats collector some info about this which can be displayed
> >> in a view and then move ahead
> >> c. just skip it like any other row that doesn't match the filter clause.
> >>
> >> I am not sure if there is any use of sending a row if one of the
> >> old/new rows doesn't match the filter. Because if the old row doesn't
> >> match but the new one matches the criteria, we will anyway just throw
> >> such a row on the subscriber instead of applying it.
> >
> > But at some time that will be true even if we skip the row based on
> > (a) or (c) right.  Suppose the OLD row was not satisfying the
> > condition but the NEW row is satisfying the condition, now even if we
> > skip this operation then in the next operation on the same row even if
> > both OLD and NEW rows are satisfying the filter the operation will
> > just be dropped by the subscriber right? because we did not send the
> > previous row when it first updated to value which were satisfying the
> > condition.  So basically, any row is inserted which did not satisfy
> > the condition first then post that no matter how many updates we do to
> > that row either it will be skipped by the publisher because the OLD
> > row was not satisfying the condition or it will be skipped by the
> > subscriber as there was no matching row.
> >
>
> I have a feeling it's getting overly complicated, to the extent that
> it'll be hard to explain to users and reason about. I don't think
> there's a "perfect" solution for cases when the filter expression gives
> different answers for old/new row - it'll always be surprising for some
> users :-(
>


It is possible but OTOH, the three replication solutions (Debezium,
Oracle, IBM's InfoSphere Data Replication) which have this feature
seems to filter based on both old and new rows in one or another way.
Also, I am not sure if the simple approach of just filter based on the
new row is very clear because it can also confuse users in a way that
even if all the new rows matches the filters, they don't see anything
on the subscriber and in fact, that can cause a lot of network
overhead without any gain.

> So maybe the best thing is to stick to the simple approach already used
> e.g. by pglogical, which simply user the new row when available (insert,
> update) and old one for deletes.
>
> I think that behaves more or less sensibly and it's easy to explain.
>

Okay, if nothing better comes up, then we can fall back to this option.

> All the other things (e.g. turning UPDATE to INSERT, advanced conflict
> resolution etc.) will require a lot of other stuff,
>

I have not evaluated this yet but I think spending some time thinking
about turning Update to Insert/Delete (yesterday's suggestion by
Alvaro) might be worth especially as that seems to be followed by some
other replication solution as well.

>and I see them as
> improvements of this simple approach.
>
> >>> Maybe a second option is to have replication change any UPDATE into
> >>> either an INSERT or a DELETE, if the old or the new row do not pass the
> >>> filter, respectively.  That way, the databases would remain consistent.
> >
> > Yeah, I think this is the best way to keep the data consistent.
> >
>
> It'd also require REPLICA IDENTITY FULL, which seems like it'd add a
> rather significant overhead.
>

Why? I think it would just need similar restrictions as we are
planning for Delete operation such that filter columns must be either
present in primary or replica identity columns.

-- 
With Regards,
Amit Kapila.

Re: .ready and .done files considered harmful

2021-07-19 Thread Dilip Kumar

On Mon, Jul 19, 2021 at 5:43 PM Dipesh Pandit  wrote:
>
> Hi,
>
> > I agree, I missed this part. The .history file should be given higher 
> > preference.
> > I will take care of it in the next patch.
>
> Archiver does not have access to shared memory and the current timeline ID
> is not available at archiver. In order to keep track of timeline switch we 
> have
> to push a notification from backend to archiver.  Backend can send a signal
> to notify archiver about the timeline change. Archiver can register this
> notification and perform a full directory scan to make sure that archiving
> history files take precedence over archiving WAL files.

Yeah, that makes sense, some comments on v2.

1.
+pgarch_timeline_switch(SIGNAL_ARGS)
+{
+intsave_errno = errno;
+
+/* Set the flag to register a timeline switch */
+timeline_switch = true;
+SetLatch(MyLatch);
+

On the timeline switch, setting a flag should be enough, I don't think
that we need to wake up the archiver.  Because it will just waste the
scan cycle.  We have set the flag and that should be enough and let
the XLogArchiveNotify() wake this up when something is ready to be
archived and that time we will scan the directory first based on the
flag.

2.
+ */
+if (XLogArchivingActive() && ArchiveRecoveryRequested)
+XLogArchiveNotifyTLISwitch();
+
+
.

 /*
+ * Signal archiver to notify timeline switch
+ */
+void
+XLogArchiveNotifyTLISwitch(void)
+{
+if (IsUnderPostmaster)
+PgArchNotifyTLISwitch();
+}

Why do we need multi level interfaces? I mean instead of calling first
XLogArchiveNotifyTLISwitch and then calling PgArchNotifyTLISwitch,
can't we directly call PgArchNotifyTLISwitch()?

3.
+if (timeline_switch)
+{
+/* Perform a full directory scan in next cycle */
+dirScan = true;
+timeline_switch = false;
+}

I suggest you can add some comments atop this check.

4.
+PgArchNotifyTLISwitch(void)
+{
+intarch_pgprocno = PgArch->pgprocno;
+
+if (arch_pgprocno != INVALID_PGPROCNO)
+{
+intarchiver_pid = ProcGlobal->allProcs[arch_pgprocno].pid;
+
+if (kill(archiver_pid, SIGINT) < 0)
+elog(ERROR, "could not notify timeline change to archiver");

I think you should use %m in the error message so that it also prints
the OS error code.

5.
+/* Flag to specify a full directory scan to find next log file */
+static bool dirScan = true;

Why is this a global variable?  I mean whenever you enter the function
pgarch_ArchiverCopyLoop(), this can be set to true and after that you
can pass this as inout parameter to pgarch_readyXlog() there in it can
be conditionally set to false once we get some segment and whenever
the timeline switch we can set it back to the true.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: row filtering for logical replication

2021-07-19 Thread Greg Nancarrow

On Tue, Jul 20, 2021 at 2:25 PM Amit Kapila  wrote:
>
> Today, while studying the behavior of this particular operation in
> other databases, I found that IBM's InfoSphere Data Replication does
> exactly this. See [1]. I think there is a merit if want to follow this
> idea.
>

So in this model (after initial sync of rows according to the filter),
for UPDATE, the OLD row is checked against the WHERE clause, to know
if the row had been previously published. If it hadn't, and the NEW
row satisfies the WHERE clause, then it needs to be published as an
INSERT. If it had been previously published, but the NEW row doesn't
satisfy the WHERE condition, then it needs to be published as a
DELETE. Otherwise, if both OLD and NEW rows satisfy the WHERE clause,
it needs to be published as an UPDATE.
At least, that seems to be the model when the WHERE clause refers to
the NEW (updated) values, as used in most of their samples (i.e. in
that database "the current log record", indicated by a ":" prefix on
the column name).
I think that allowing the OLD values ("old log record") to be
referenced in the WHERE clause, as that model does, could be
potentially confusing.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Avoid stack frame setup in performance critical routines using tail calls

2021-07-19 Thread Andres Freund

Hi,

On 2021-07-20 16:50:09 +1200, David Rowley wrote:
> I've not taken the time to study the patch but I was running some
> other benchmarks today on a small scale pgbench readonly test and I
> took this patch for a spin to see if I could see the same performance
> gains.

Thanks!

> This is an AMD 3990x machine that seems to get the most throughput
> from pgbench with 132 processes
> 
> I did: pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared
> --random-seed=12345 postgres
> 
> master = dd498998a
> 
> Master: 3816959.53 tps
> Patched: 3820723.252 tps
> 
> I didn't quite get the same 2-3% as you did, but it did come out
> faster than on master.

It would not at all be suprising to me if AMD in recent microarchitectures did
a better job at removing stack management overview (e.g. by better register
renaming, or by resolving dependencies on %rsp in a smarter way) than Intel
has. This was on a Cascade Lake CPU (xeon 5215), which, despite being released
in 2019, effectively is a moderately polished (or maybe shoehorned)
microarchitecture from 2015 due to all the Intel troubles. Whereas Zen2 is
from 2019.

It's also possible that my attempts at avoiding the stack management just
didn't work on your compiler. Either due to vendor (I know that gcc is better
at it than clang), version, or compiler flags (e.g. -fno-omit-frame-pointer
could make it harder, -fno-optimize-sibling-calls would disable it).

A third plausible explanation for the difference is that at a client count of
132, the bottlenecks are sufficiently elsewhere to just not show a meaningful
gain from memory management efficiency improvements.

Any chance you could show a `perf annotate AllocSetAlloc` and `perf annotate
palloc` from a patched run? And perhaps how high their percentages of the
total work are. E.g. using something like
perf report -g none|grep -E 'AllocSetAlloc|palloc|MemoryContextAlloc|pfree'

It'd be interesting to know where the bottlenecks on a zen2 machine are.

Greetings,

Andres Freund

Re: Avoid stack frame setup in performance critical routines using tail calls

2021-07-19 Thread David Rowley

On Tue, 20 Jul 2021 at 18:17, Andres Freund  wrote:
> Any chance you could show a `perf annotate AllocSetAlloc` and `perf annotate
> palloc` from a patched run? And perhaps how high their percentages of the
> total work are. E.g. using something like
> perf report -g none|grep -E 'AllocSetAlloc|palloc|MemoryContextAlloc|pfree'

Sure. See attached.

David
 Percent |  Source code & Disassembly of postgres for cycles (626 samples, 
percent: local period)
-
 :
 :
 :
 :Disassembly of section .text:
 :
 :0056bd80 :
 :AllocSetAlloc():
 :* is marked, as mcxt.c will set it to UNDEFINED.  In some 
paths we will
 :* return space that is marked NOACCESS - AllocSetRealloc 
has to beware!
 :*/
 :static void *
 :AllocSetAlloc(MemoryContext context, Size size, int flags)
 :{
7.66 :   56bd80: endbr64
 :
 :/*
 :* If requested size exceeds maximum for chunks, allocate 
an entire block
 :* for this request.
 :*/
 :if (unlikely(size > set->allocChunkLimit))
2.68 :   56bd84: cmp%rsi,0xc8(%rdi)
3.34 :   56bd8b: jb 56be10 
 :AllocSetFreeIndex():
 :idx = 0;
0.17 :   56bd91: xor%ecx,%ecx
 :if (size > (1 << ALLOC_MINBITS))
0.00 :   56bd93: cmp$0x8,%rsi
0.16 :   56bd97: jbe56bda9 
 :idx = 31 - __builtin_clz((uint32) size - 1) - 
ALLOC_MINBITS + 1;
0.00 :   56bd99: sub$0x1,%esi
0.47 :   56bd9c: mov$0x1d,%ecx
0.50 :   56bda1: bsr%esi,%esi
6.38 :   56bda4: xor$0x1f,%esi
1.92 :   56bda7: sub%esi,%ecx
 :AllocSetAlloc():
 :* corresponding free list to see if there is a free chunk 
we could reuse.
 :* If one is found, remove it from the free list, make it 
again a member
 :* of the alloc set and return its data address.
 :*/
 :fidx = AllocSetFreeIndex(size);
 :chunk = set->freelist[fidx];
1.44 :   56bda9: movslq %ecx,%rdx
1.88 :   56bdac: add$0xa,%rdx
1.11 :   56bdb0: mov0x8(%rdi,%rdx,8),%rax
 :if (chunk != NULL)
   19.90 :   56bdb5: test   %rax,%rax
0.32 :   56bdb8: je 56bdd0 
 :{
 :Assert(chunk->size >= size);
 :
 :set->freelist[fidx] = (AllocChunk) chunk->aset;
0.79 :   56bdba: mov0x8(%rax),%rcx
 :AllocSetAllocReturnChunk():
 :return AllocChunkGetPointer(chunk);
   13.97 :   56bdbe: add$0x10,%rax
 :AllocSetAlloc():
 :set->freelist[fidx] = (AllocChunk) chunk->aset;
0.00 :   56bdc2: mov%rcx,0x8(%rdi,%rdx,8)
 :AllocSetAllocReturnChunk():
 :chunk->aset = (void *) set;
0.00 :   56bdc7: mov%rdi,-0x8(%rax)
 :AllocSetAlloc():
 :
 :return AllocSetAllocReturnChunk(set, size, chunk, 
chunk->size);
0.16 :   56bdcb: ret
0.00 :   56bdcc: nopl   0x0(%rax)
 :}
 :
 :/*
 :* Choose the actual chunk size to allocate.
 :*/
 :chunk_size = (1 << ALLOC_MINBITS) << fidx;
1.10 :   56bdd0: mov$0x8,%esi
 :
 :/*
 :* If there is enough room in the active allocation block, 
we will put the
 :* chunk into that block.  Else must start a new one.
 :*/
 :if ((block = set->blocks) != NULL)
0.32 :   56bdd5: mov0x50(%rdi),%rdx
 :chunk_size = (1 << ALLOC_MINBITS) << fidx;
0.32 :   56bdd9: shl%cl,%esi
0.00 :   56bddb: movslq %esi,%rsi
 :if ((block = set->blocks) != NULL)
0.00 :   56bdde: test   %rdx,%rdx
0.47 :   56bde1: je 56be18 
 :{
 :Sizeavailspace = block->endptr - 
block->freeptr;
0.00 :   56bde3: mov0x18(%rdx),%rax
4.80 :   56bde7: mov0x20(%rdx),%rcx
 :
 :if (unlikely(availspace < (chunk_size + 
ALLOC_CHUNKHDRSZ)))
1.75 :   56bdeb: lea0x10(%rsi),%r8
 :Sizeavailspace = block->endptr - 
block->freeptr;
0.47 :   56bdef: sub%rax,%rcx
 :if (unlikely(availspace < (chunk_size + 
ALLOC_CHUNKHDRSZ)))
0.16 :   56bdf2: cmp%rcx,%r8
1.90 :   56bdf5: ja 56be20 
 :

94 matches

Mail list logo