from:"'alvhe...@alvh.no\-ip.org'"

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-03-23 Thread alvhe...@alvh.no-ip.org

On 2022-Mar-07, Imseih (AWS), Sami wrote:

> I have gone ahead and backpatched this all the way to 10 as well.

Thanks!  I pushed this now.  I edited the test though: I don't
understand why you went to the trouble of setting stuff in order to call
'pg_ctl promote' (in different ways for older branches), when
$node_standby->promote does the same and is simpler to call.  So I
changed the tests to do that.  (I did verify that without the code fix,
the PANIC indeed is thrown.)

Thank again,

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: libpq debug log

2021-02-23 Thread alvhe...@alvh.no-ip.org

I'll give this another look tomorrow, but I wanted to pass along that I
prefer libpq-trace.{c,h} instead of libpq-logging.  I also renamed
variable "pin" and pgindented.  I don't have any major reservations
about this patch now, so I'll mark it ready-for-committer in case
somebody else wants to have a look before push.

(Not sure about the use of the word "forcely")

-- 
Álvaro Herrera39°49'30"S 73°17'W
>From da0faf61ceb8e9cbc5894ca1e43b6f7fc477a835 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 23 Feb 2021 22:17:51 -0300
Subject: [PATCH v20] Improve libpq tracing capabilities

Author: Aya Iwata
---
 doc/src/sgml/libpq.sgml |  25 +-
 src/interfaces/libpq/Makefile   |   3 +
 src/interfaces/libpq/exports.txt|   1 +
 src/interfaces/libpq/fe-connect.c   |  25 +-
 src/interfaces/libpq/fe-exec.c  |   4 -
 src/interfaces/libpq/fe-misc.c  |  59 ++-
 src/interfaces/libpq/fe-protocol3.c |  18 +
 src/interfaces/libpq/libpq-fe.h |   2 +
 src/interfaces/libpq/libpq-int.h|   6 +
 src/interfaces/libpq/libpq-trace.c  | 638 
 src/interfaces/libpq/libpq-trace.h  |  95 +
 src/tools/pgindent/typedefs.list|   6 +
 12 files changed, 823 insertions(+), 59 deletions(-)
 create mode 100644 src/interfaces/libpq/libpq-trace.c
 create mode 100644 src/interfaces/libpq/libpq-trace.h

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 5e25f20843..3a1e368e0b 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -5907,7 +5907,8 @@ PGContextVisibility PQsetErrorContextVisibility(PGconn *conn, PGContextVisibilit
 
 
  
-  Enables  tracing of the client/server communication to a debugging file stream.
+  Enables tracing of the client/server communication to a debugging file stream.
+  (Details of tracing contents appear in ).
 
 void PQtrace(PGconn *conn, FILE *stream);
 
@@ -5927,6 +5928,28 @@ void PQtrace(PGconn *conn, FILE *stream);
 

 
+
+   
+PQtraceSetFlagsPQtraceSetFlags
+
+
+ 
+  Controls the tracing behavior of client/server communication.
+
+void PQtraceSetFlags(PGconn *conn, int flags);
+
+ 
+
+ 
+  flags contains flag bits describing the operating mode
+  of tracing.  If flags contains PQTRACE_SUPPRESS_TIMESTAMPS,
+  then timestamp is not printed with each message. If set to 0 (default),tracing will be output with timestamp.
+  This function should be called after calling PQtrace.
+ 
+
+
+   
+

 PQuntracePQuntrace
 
diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile
index f74677eaf9..cdb65b48af 100644
--- a/src/interfaces/libpq/Makefile
+++ b/src/interfaces/libpq/Makefile
@@ -42,6 +42,7 @@ OBJS = \
 	fe-secure.o \
 	legacy-pqsignal.o \
 	libpq-events.o \
+	libpq-trace.o \
 	pqexpbuffer.o \
 	fe-auth.o
 
@@ -114,6 +115,7 @@ install: all installdirs install-lib
 	$(INSTALL_DATA) $(srcdir)/libpq-fe.h '$(DESTDIR)$(includedir)'
 	$(INSTALL_DATA) $(srcdir)/libpq-events.h '$(DESTDIR)$(includedir)'
 	$(INSTALL_DATA) $(srcdir)/libpq-int.h '$(DESTDIR)$(includedir_internal)'
+	$(INSTALL_DATA) $(srcdir)/libpq-trace.h '$(DESTDIR)$(includedir)'
 	$(INSTALL_DATA) $(srcdir)/pqexpbuffer.h '$(DESTDIR)$(includedir_internal)'
 	$(INSTALL_DATA) $(srcdir)/pg_service.conf.sample '$(DESTDIR)$(datadir)/pg_service.conf.sample'
 
@@ -127,6 +129,7 @@ uninstall: uninstall-lib
 	rm -f '$(DESTDIR)$(includedir)/libpq-fe.h'
 	rm -f '$(DESTDIR)$(includedir)/libpq-events.h'
 	rm -f '$(DESTDIR)$(includedir_internal)/libpq-int.h'
+	rm -f '$(DESTDIR)$(includedir)/libpq-trace.h'
 	rm -f '$(DESTDIR)$(includedir_internal)/pqexpbuffer.h'
 	rm -f '$(DESTDIR)$(datadir)/pg_service.conf.sample'
 
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index bbc1f90481..09f4a6 100644
--- a/src/interfaces/libpq/exports.txt
+++ b/src/interfaces/libpq/exports.txt
@@ -179,3 +179,4 @@ PQgetgssctx   176
 PQsetSSLKeyPassHook_OpenSSL 177
 PQgetSSLKeyPassHook_OpenSSL 178
 PQdefaultSSLKeyPassHook_OpenSSL 179
+PQtraceSetFlags   180
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index db71fea169..9bf2f0d0b6 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -4021,6 +4021,10 @@ freePGconn(PGconn *conn)
 	if (conn->connip)
 		free(conn->connip);
 	/* Note that conn->Pfdebug is not ours to close or free */
+	if (conn->be_msg)
+		free(conn->be_msg);
+	if (conn->fe_msg)
+		free(conn->fe_msg);
 	if (conn->last_query)
 		free(conn->last_query);
 	if (conn->write_err_msg)
@@ -6766,27 +6770,6 @@ PQsetErrorContextVisibility(PGconn *conn, PGContextVisibility show_context)
 	return old;
 }
 
-void
-PQtrace(PGconn *conn, FILE *debug_port)
-{
-	if (conn == NULL)
-		return;
-	PQuntrace(conn);
-	conn->Pfdebug = debug_port;
-}
-
-void
-PQuntrace(PGconn *conn)
-{
-	if (conn == NULL)
-		return;
-	if (conn->Pfdebug)

Re: libpq debug log

2021-03-05 Thread alvhe...@alvh.no-ip.org

On 2021-Mar-05, tsunakawa.ta...@fujitsu.com wrote:

> From: Tom Lane 
> > But I think passing the message start address explicitly might be
> > better than having it understand the buffering behavior in enough
> > detail to know where to find the message.  Part of the point here
> > (IMO) is to decouple the tracing logic from the core libpq logic, in
> > hopes of not having common-mode bugs.
> 
> Ouch, you're perfectly right.  Then let's make the signature:
> 
> void pqLogMessage(PGconn *conn, const char *message, bool toServer);

Yeah, looks good!  I agree that going this route will result in more
trustworthy trace output.

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: libpq debug log

2021-03-10 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-10, iwata@fujitsu.com wrote:

> Hi all,
> 
> Following all reviewer's advice, I have created a new patch.
> 
> In this patch, I add only two tracing entry points; I call 
> pqTraceOutputMsg(PGconn conn, int msgCursor, PGCommSource commsource) in 
> pqParseInput3 () and pqPutMsgEnd () to output log.
> The argument contains message first byte offset called msgCursor because it 
> is simpler to specify the buffer pointer in the caller. 
> 
> In pqTraceOutputMsg(), the content common to all protocol messages (first 
> timestamp, < or >, first 1 byte, and length) are output first and after that 
> each protocol message contents is output. I referred to pqParseInput3 () , 
> fe-exec.c and documentation page for that part.
> 
> This fix almost eliminates if(conn->Pfdebug) that was built into the code for 
> other features.

This certainly looks much better!  Thanks for getting it done so
quickly.

I didn't review the code super closely.  I do see a compiler warning:

/pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c: In function 
'pqTraceOutputNchar':
/pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:373:2: warning: 
argument 1 null where non-null expected [-Wnonnull]
  memcpy(v, buf + *cursor, len);
  ^

and then when I try to run the "libpq_pipeline" program from the other
thread, I get a crash with this backtrace:

Program received signal SIGSEGV, Segmentation fault.
__memmove_avx_unaligned_erms () at 
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:288
288 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No existe el 
fichero o el directorio.
(gdb) bt
#0  __memmove_avx_unaligned_erms () at 
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:288
#1  0x77f9b50b in pqTraceOutputNchar (buf=buf@entry=0x55564660 "P", 
LogEnd=LogEnd@entry=42, cursor=cursor@entry=0x7fffdeb4, 
pfdebug=0x55569320, len=1)
at /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:373
#2  0x77f9bc25 in pqTraceOutputMsg (conn=conn@entry=0x55560260, 
msgCursor=, 
commsource=commsource@entry=MSGDIR_FROM_FRONTEND)
at /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:476
#3  0x77f96280 in pqPutMsgEnd (conn=conn@entry=0x55560260)
at /pgsql/source/pipeline/src/interfaces/libpq/fe-misc.c:533
...

The attached patch fixes it, though I'm not sure that that's the final
form we want it to have (since we'd alloc and free repeatedly, making it
slow -- perhaps better to use a static, or ...? ).

-- 
Álvaro Herrera   Valdivia, Chile
Essentially, you're proposing Kevlar shoes as a solution for the problem
that you want to walk around carrying a loaded gun aimed at your foot.
(Tom Lane)
diff --git a/src/interfaces/libpq/libpq-trace.c b/src/interfaces/libpq/libpq-trace.c
index ef294fa556..5d3b40a1d0 100644
--- a/src/interfaces/libpq/libpq-trace.c
+++ b/src/interfaces/libpq/libpq-trace.c
@@ -368,7 +368,15 @@ pqTraceOutputBinary(char *v, int length, FILE *pfdebug)
 static void
 pqTraceOutputNchar(char *buf, int LogEnd, int *cursor, FILE *pfdebug, int len)
 {
-	char	*v = '\0';
+	char	*v;
+
+	v = malloc(len);
+	if (v == NULL)
+	{
+		fprintf(pfdebug, "'..%d chars..'", len);
+		*cursor += len;
+		return;
+	}
 
 	memcpy(v, buf + *cursor, len);
 	*cursor += len;
@@ -377,6 +385,8 @@ pqTraceOutputNchar(char *buf, int LogEnd, int *cursor, FILE *pfdebug, int len)
 	pqTraceOutputBinary(v, len, pfdebug);
 	fprintf(pfdebug, "\'");
 
+	free(v);
+
 	if (*cursor < LogEnd)
 		fprintf(pfdebug, " ");
 	else

Re: libpq debug log

2021-03-10 Thread &#x27;alvhe...@alvh.no-ip.org'

I also found that DataRow messages are not being printed.  This seems to
fix that in the normal case and singlerow, at least in pipeline mode.
Didn't verify the non-pipeline mode.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre)
diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c
index 68f0f6a081..e8db5edb71 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -869,6 +869,9 @@ getAnotherTuple(PGconn *conn, int msgLength)
 		goto advance_and_error;
 	}
 
+	if (conn->Pfdebug)
+		pqTraceOutputMsg(conn, conn->inStart, MSGDIR_FROM_BACKEND);
+
 	/* Advance inStart to show that the "D" message has been processed. */
 	conn->inStart = conn->inCursor;

Re: libpq debug log

2021-03-10 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-10, 'alvhe...@alvh.no-ip.org' wrote:

> I also found that DataRow messages are not being printed.  This seems to
> fix that in the normal case and singlerow, at least in pipeline mode.
> Didn't verify the non-pipeline mode.

Hm, and RowDescription ('T') messages are also not being printed ... so
I think there's some larger problem here, and perhaps it needs to be
fixed using a different approach.

After staring at it a couple of times, I think that the places in
pqParseInput3() where there's a comment "... moves inStart itself" and
then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU
those are the places where the message in question would not reach the
"Successfully consumed this message" block that prints each message.

-- 
Álvaro Herrera39°49'30"S 73°17'W
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)

Re: libpq debug log

2021-03-10 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-10, Tom Lane wrote:

> "'alvhe...@alvh.no-ip.org'"  writes:
> > After staring at it a couple of times, I think that the places in
> > pqParseInput3() where there's a comment "... moves inStart itself" and
> > then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU
> > those are the places where the message in question would not reach the
> > "Successfully consumed this message" block that prints each message.
> 
> Yeah, the whole business of just when a message has been "consumed"
> is a stumbling block for libpq tracing.  It was a real mess with the
> existing code and v2 protocol, because we could actually try to parse
> a message more than once, with the first few tries deciding that the
> message wasn't all here yet (but nonetheless emitting partial trace
> output).

Hmm, that makes sense, but the issue I'm reporting is not the same,
unless I misunderstand you.

> Now that v2 is dead, the logic to abort because of the message not
> being all arrived yet is basically useless: only the little bit of
> code concerned with the message length word really needs to cope with
> that.  It's tempting to go through and get rid of all the now-unreachable
> "return"s and such, but it seemed like it would be a lot of code churn for
> not really that much gain.

That sounds like an interesting exercise, and I bet it'd bring a lot of
code readability improvements.

> I didn't look at the new version of the patch yet, so I'm not
> sure whether the issues it still has are related to this.

The issues I noticed are related to separate messages rather than one
message split in pieces -- for example several DataRow messages are
processed internally in a loop, rather than each individually.  The
number of places that need to be adjusted for things to AFAICT work
correctly are few enough; ISTM that the attached patch is sufficient.

(The business with the "logged" boolean is necessary so that we print to
the trace file any of those messages even if they are deviating from the
protocol.)

-- 
Álvaro Herrera   Valdivia, Chile
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)

Re: libpq debug log

2021-03-10 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-10, Tom Lane wrote:

> Or we could rethink the logic.  It's not quite clear to me, after
> all this time, why getRowDescriptions() et al are allowed to
> move inStart themselves rather than letting the main loop in
> pqParseInput3 do it.  It might well be an artifact of having not
> rewritten the v2 logic too much.

I would certainly prefer that the logic stays put for the time being,
while I finalize the pipelining stuff.

-- 
Álvaro Herrera   Valdivia, Chile

Re: libpq debug log

2021-03-26 Thread alvhe...@alvh.no-ip.org

Hello

I added an option to the new libpq_pipeline program that it activates
libpq trace.  It works nicely and I think we can add that to the
regression tests.  However I have two observations.

1. The trace output for the error message won't be very nice, because it
prints line numbers.  So if I want to match the output to an "expected"
file, it would break every time somebody edits the source file where the
error originates and the ereport() line is moved.  For example:

<   70  ErrorResponseS "ERROR" V "ERROR" C "22012" M 
"division by zero" F "numeric.c" L "8366" R "div_var" \x00 "Z"

The line number 8366 in this line would be problematic.  I think if we
want regression testing for this, we should have another trace flag to
suppress output of a few of these fields.  I would have it print only S,
C and M.

(Hey, what the heck is that "Z" at the end of the message?  I thought
the error ended right at the \x00 ...)


2. The < and > characters are not good for visual inspection.  I
replaced them with F and B and I think it's much clearer.  Compare:

>   27  Query"SET lc_messages TO "C""
<   8   CommandComplete  "SET"
<   5   ReadyForQueryI
>   21  Parse"" "SELECT $1" 1 23
>   19  Bind "" "" 0 1 1 '1' 1 0
>   6   Describe P ""
>   9   Execute  "" 0
>   4   Sync
<   4   ParseComplete
<   4   BindComplete
<   33  RowDescription   1 "?column?" 0 0 23 4 -1 0
<   11  DataRow  1 1 '1'
<   13  CommandComplete  "SELECT 1"
<   5   ReadyForQueryI
>   4   Terminate

with

F   27  Query"SET lc_messages TO "C""
B   8   CommandComplete  "SET"
B   5   ReadyForQueryI
F   21  Parse"" "SELECT $1" 1 23
F   19  Bind "" "" 0 1 1 '1' 1 0
F   6   Describe P ""
F   9   Execute  "" 0
F   4   Sync
B   4   ParseComplete
B   4   BindComplete
B   33  RowDescription   1 "?column?" 0 0 23 4 -1 0
B   11  DataRow  1 1 '1'
B   13  CommandComplete  "SELECT 1"
B   5   ReadyForQueryI
F   4   Terminate

I think the second one is much easier on the eye.

(This one is the output from "libpq_pipeline simple_pipeline").

-- 
Álvaro Herrera   Valdivia, Chile
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)

Re: libpq debug log

2021-03-26 Thread alvhe...@alvh.no-ip.org

Proposed changes on top of v29.

-- 
Álvaro Herrera   Valdivia, Chile
>From b32ae3805bb28553c0a1cf308c6ed27f58576f3c Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 26 Mar 2021 19:12:12 -0300
Subject: [PATCH 1/5] libpq_pipeline: add -t support for PQtrace

---
 .../modules/libpq_pipeline/libpq_pipeline.c   | 84 +--
 1 file changed, 59 insertions(+), 25 deletions(-)

diff --git a/src/test/modules/libpq_pipeline/libpq_pipeline.c b/src/test/modules/libpq_pipeline/libpq_pipeline.c
index 846ee9f5ab..34d085035b 100644
--- a/src/test/modules/libpq_pipeline/libpq_pipeline.c
+++ b/src/test/modules/libpq_pipeline/libpq_pipeline.c
@@ -23,6 +23,7 @@
 #include "catalog/pg_type_d.h"
 #include "common/fe_memutils.h"
 #include "libpq-fe.h"
+#include "pg_getopt.h"
 #include "portability/instr_time.h"
 
 
@@ -30,6 +31,9 @@ static void exit_nicely(PGconn *conn);
 
 const char *const progname = "libpq_pipeline";
 
+/* Options and defaults */
+char	   *tracefile = NULL;	/* path to PQtrace() file */
+
 
 #define DEBUG
 #ifdef DEBUG
@@ -1209,8 +1213,10 @@ usage(const char *progname)
 {
 	fprintf(stderr, "%s tests libpq's pipeline mode.\n\n", progname);
 	fprintf(stderr, "Usage:\n");
-	fprintf(stderr, "  %s tests", progname);
-	fprintf(stderr, "  %s testname [conninfo [number_of_rows]]\n", progname);
+	fprintf(stderr, "  %s [OPTION] tests\n", progname);
+	fprintf(stderr, "  %s [OPTION] TESTNAME [CONNINFO [NUMBER_OF_ROWS]\n", progname);
+	fprintf(stderr, "\nOptions:\n");
+	fprintf(stderr, "  -t TRACEFILE   generate a libpq trace to TRACEFILE\n");
 }
 
 static void
@@ -1231,37 +1237,54 @@ main(int argc, char **argv)
 {
 	const char *conninfo = "";
 	PGconn	   *conn;
+	FILE	   *trace;
+	char	   *testname;
 	int			numrows = 1;
 	PGresult   *res;
+	int			c;
 
-	if (strcmp(argv[1], "tests") == 0)
+	while ((c = getopt(argc, argv, "t:")) != -1)
 	{
-		print_test_list();
-		exit(0);
+		switch (c)
+		{
+			case 't':			/* trace file */
+tracefile = pg_strdup(optarg);
+break;
+		}
 	}
 
-	/*
-	 * The testname parameter is mandatory; it can be followed by a conninfo
-	 * string and number of rows.
-	 */
-	if (argc < 2 || argc > 4)
+	if (optind < argc)
+	{
+		testname = argv[optind];
+		optind++;
+	}
+	else
 	{
 		usage(argv[0]);
 		exit(1);
 	}
 
-	if (argc >= 3)
-		conninfo = pg_strdup(argv[2]);
+	if (strcmp(testname, "tests") == 0)
+	{
+		print_test_list();
+		exit(0);
+	}
 
-	if (argc >= 4)
+	if (optind < argc)
+	{
+		conninfo = argv[optind];
+		optind++;
+	}
+	if (optind < argc)
 	{
 		errno = 0;
-		numrows = strtol(argv[3], NULL, 10);
+		numrows = strtol(argv[optind], NULL, 10);
 		if (errno != 0 || numrows <= 0)
 		{
-			fprintf(stderr, "couldn't parse \"%s\" as a positive integer\n", argv[3]);
+			fprintf(stderr, "couldn't parse \"%s\" as a positive integer\n", argv[optind]);
 			exit(1);
 		}
+		optind++;
 	}
 
 	/* Make a connection to the database */
@@ -1272,30 +1295,41 @@ main(int argc, char **argv)
 PQerrorMessage(conn));
 		exit_nicely(conn);
 	}
+
+	/* Set the trace file, if requested */
+	if (tracefile != NULL)
+	{
+		trace = fopen(tracefile, "w+");
+
+		if (trace == NULL)
+			pg_fatal("could not open file \"%s\": %m", tracefile);
+		PQtrace(conn, trace);
+		PQtraceSetFlags(conn, PQTRACE_SUPPRESS_TIMESTAMPS);
+	}
+
 	res = PQexec(conn, "SET lc_messages TO \"C\"");
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		pg_fatal("failed to set lc_messages: %s", PQerrorMessage(conn));
 
-	if (strcmp(argv[1], "disallowed_in_pipeline") == 0)
+	if (strcmp(testname, "disallowed_in_pipeline") == 0)
 		test_disallowed_in_pipeline(conn);
-	else if (strcmp(argv[1], "multi_pipelines") == 0)
+	else if (strcmp(testname, "multi_pipelines") == 0)
 		test_multi_pipelines(conn);
-	else if (strcmp(argv[1], "pipeline_abort") == 0)
+	else if (strcmp(testname, "pipeline_abort") == 0)
 		test_pipeline_abort(conn);
-	else if (strcmp(argv[1], "pipelined_insert") == 0)
+	else if (strcmp(testname, "pipelined_insert") == 0)
 		test_pipelined_insert(conn, numrows);
-	else if (strcmp(argv[1], "prepared") == 0)
+	else if (strcmp(testname, "prepared") == 0)
 		test_prepared(conn);
-	else if (strcmp(argv[1], "simple_pipeline") == 0)
+	else if (strcmp(testname, "simple_pipeline") == 0)
 		test_simple_pipeline(conn);
-	else if (strcmp(argv[1], "singlerow") == 0)
+	else if (strcmp(testname, "singlerow") == 0)
 		test_singlerowmode(conn);
-	else if (strcmp(argv[1], "transaction") == 0)
+	else if (strcmp(testname, "transaction") == 0)
 		test_transaction(conn);
 	else
 	{
-		fprintf(stderr, "\"%s\" is not a recognized test name\n", argv[1]);
-		usage(argv[0]);
+		fprintf(stderr, "\"%s\" is not a recognized test name\n", testname);
 		exit(1);
 	}
 
-- 
2.20.1

>From 4b62c3159fe3aa8445317a5d65b7e81d91c7fba6 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 26 Mar 2021 19:13:04 -0300
Subject: [PATCH 2/5] put FILE * arg always first

---
 src/interfaces/libpq/libpq-trace.c | 234 ++---
 1 fil

Re: libpq debug log

2021-03-27 Thread alvhe...@alvh.no-ip.org

On 2021-Mar-26, alvhe...@alvh.no-ip.org wrote:

> Proposed changes on top of v29.

This last one uses libpq_pipeline -t and verifies the output against an
expected trace file.  Applies on top of all the previous patches.  I
attach the whole lot, so that the CF bot has a chance to run it.

I did notice another problem for comparison of expected trace files,
which is that RowDescription includes table OIDs for some columns.  I
think we would need to have a flag to suppress that too, somehow,
although the answer to what should we do is not as clear as for the
other two cases.

I dodged the issue by just using -t for the pipeline cases that don't
have OIDs in their output.  This leaves us with no coverage for the
ErrorResponse message, which I think is a shortcoming.  If we fixed the
OID problem and the ErrorResponse problem, we could add an expected
trace for pipeline_abort.  I think we should do that.

Maybe the easiest way is to have a new flag PQTRACE_REGRESS_MODE.

-- 
Álvaro Herrera39°49'30"S 73°17'W
Essentially, you're proposing Kevlar shoes as a solution for the problem
that you want to walk around carrying a loaded gun aimed at your foot.
(Tom Lane)
>From f239da90f069b19e1c2fa6b69bc7c6fd4be826b9 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 26 Mar 2021 11:10:52 -0300
Subject: [PATCH v30 1/7] libpq trace v29

---
 doc/src/sgml/libpq.sgml |  39 +-
 src/interfaces/libpq/Makefile   |   1 +
 src/interfaces/libpq/exports.txt|   1 +
 src/interfaces/libpq/fe-connect.c   |  21 -
 src/interfaces/libpq/fe-exec.c  |   4 -
 src/interfaces/libpq/fe-misc.c  |  66 +--
 src/interfaces/libpq/fe-protocol3.c |   8 +
 src/interfaces/libpq/libpq-fe.h |   2 +
 src/interfaces/libpq/libpq-int.h|   7 +
 src/interfaces/libpq/libpq-trace.c  | 717 
 10 files changed, 789 insertions(+), 77 deletions(-)
 create mode 100644 src/interfaces/libpq/libpq-trace.c

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index be674fbaa9..838e394d54 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -6459,12 +6459,27 @@ PGContextVisibility PQsetErrorContextVisibility(PGconn *conn, PGContextVisibilit
 
 
  
-  Enables  tracing of the client/server communication to a debugging file stream.
+  Enables tracing of the client/server communication to a debugging file
+  stream.
 
 void PQtrace(PGconn *conn, FILE *stream);
 
  
 
+ 
+  Each line consists of: an optional timestamp, a direction indicator
+  (> for messages from client to server
+  or < for messages from server to client),
+  message length, message type, and message contents.
+  Non-message contents fields (timestamp, direction, length and message type)
+  are separated by a tab. Message contents are separated by a space.
+  Protocol strings are enclosed in double quotes, while strings used as data
+  values are enclosed in single quotes.  Non-printable chars are printed as
+  hexadecimal escapes.
+  Further message-type-specific detail can be found in
+  .
+ 
+
  
   
On Windows, if the libpq library and an application are
@@ -6479,6 +6494,28 @@ void PQtrace(PGconn *conn, FILE *stream);
 

 
+   
+PQtraceSetFlagsPQtraceSetFlags
+
+
+ 
+  Controls the tracing behavior of client/server communication.
+
+void PQtraceSetFlags(PGconn *conn, int flags);
+
+ 
+
+ 
+  flags contains flag bits describing the operating mode
+  of tracing.
+  If flags contains PQTRACE_SUPPRESS_TIMESTAMPS,
+  then the timestamp is not included when printing each message.
+  This function must be called after calling PQtrace.
+ 
+
+
+   
+

 PQuntracePQuntrace
 
diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile
index 2aca882a2b..0424523492 100644
--- a/src/interfaces/libpq/Makefile
+++ b/src/interfaces/libpq/Makefile
@@ -41,6 +41,7 @@ OBJS = \
 	fe-secure.o \
 	legacy-pqsignal.o \
 	libpq-events.o \
+	libpq-trace.o \
 	pqexpbuffer.o \
 	fe-auth.o
 
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 5c48c14191..a00701f2c5 100644
--- a/src/interfaces/libpq/exports.txt
+++ b/src/interfaces/libpq/exports.txt
@@ -183,3 +183,4 @@ PQenterPipelineMode   180
 PQexitPipelineMode181
 PQpipelineSync182
 PQpipelineStatus  183
+PQtraceSetFlags   184
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 53b354abb2..a90bdb8ab6 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -6859,27 +6859,6 @@ PQsetErrorContextVisibility(PGconn *conn, PGContextVisibility show_context)
 	return old;
 }
 
-void
-PQtrace(PGconn *conn, FILE *debug_port)
-{
-	if (conn == NULL)
-		return;
-	PQuntrace(conn);
-	conn->Pfdebug = debug_port;
-}
-
-voi

Re: libpq debug log

2021-03-27 Thread alvhe...@alvh.no-ip.org

On 2021-Mar-27, alvhe...@alvh.no-ip.org wrote:

> This last one uses libpq_pipeline -t and verifies the output against an
> expected trace file.  Applies on top of all the previous patches.  I
> attach the whole lot, so that the CF bot has a chance to run it.

All tests pass, but CFbot does not run src/test/modules, so it's not
saying much.

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: libpq debug log

2021-03-28 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-29, tsunakawa.ta...@fujitsu.com wrote:

> > (Hey, what the heck is that "Z" at the end of the message?  I thought
> > the error ended right at the \x00 ...)
> 
> We'll investigate these issues.

For what it's worth, I did fix this problem in patch 0005 that I
attached.  The problem was that one "continue" should have been "break",
and also a "while ( .. )" needed to be made an infinite loop.  It was
easy to catch these problems once I added (in 0006) the check that the
bytes consumed equal message length, as I had suggested a couple of
weeks ago :-)  I also changed the code for Notice, but I didn't actually
verify that one.

> > 2. The < and > characters are not good for visual inspection.  I
> > replaced them with F and B and I think it's much clearer.  Compare:
> > I think the second one is much easier on the eye.
> 
> Yes, agreed.  I too thought of something like "C->S" and "S->C"
> because client and server should be more familiar for users than
> frontend and backend.

Hmm, yeah, that's a reasonable option too.  What do others think?

-- 
Álvaro Herrera   Valdivia, Chile

Re: libpq debug log

2021-03-30 Thread &#x27;alvhe...@alvh.no-ip.org'

Okay, pushed this patch and the new testing for it based on
libpq_pipeline.  We'll see how the buildfarm likes it.

I made some further changes to the last version; user-visibly, I split
the trace flags in two, keeping the timestamp suppression separate from
the redacting feature for regression testing.

I didn't like the idea of silently skipping the redacted fields, so I
changed the code to print  or  instead.  I also made the
redacting occur in the last mile (pqTraceOutputInt32 / String) rather
that in their callers: it seemed quite odd to advance the cursor in the
"else" branch.

I refactored the duplicate code that appeared for Notice and Error.
In that function, we redact not only the 'L' field (what Iwata-san was
doing) but also 'F' (file) and 'R' (routine) because those things can
move around for reasons that are not interesting to testing this code.

In the libpq_pipeline commit I added 'pipeline_abort' and 'transaction'
to the cases that generate traces, which adds coverage for
NoticeResponse and ErrorResponse.

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: libpq debug log

2021-03-30 Thread &#x27;alvhe...@alvh.no-ip.org'



So crake failed.  The failure is that it doesn't print the DataRow
messages.  That's quite odd.  We see this in the trace log:

Mar 30 20:05:15 # F 54  Parse"" "SELECT 1.0/g FROM 
generate_series(3, -1, -1) g" 0
Mar 30 20:05:15 # F 12  Bind "" "" 0 0 0
Mar 30 20:05:15 # F 6   Describe P ""
Mar 30 20:05:15 # F 9   Execute  "" 0
Mar 30 20:05:15 # F 4   Sync
Mar 30 20:05:15 # B 4   ParseComplete
Mar 30 20:05:15 # B 4   BindComplete
Mar 30 20:05:15 # B 33  RowDescription   1 "?column?"  0  65535 
-1 0
Mar 30 20:05:15 # B 70  ErrorResponseS "ERROR" V "ERROR" C "22012" 
M "division by zero" F "" L "" R "" \\x00
Mar 30 20:05:15 # B 5   ReadyForQueryI

and the expected is this:

Mar 30 20:05:15 # F 54  Parse"" "SELECT 1.0/g FROM 
generate_series(3, -1, -1) g" 0
Mar 30 20:05:15 # F 12  Bind "" "" 0 0 0
Mar 30 20:05:15 # F 6   Describe P ""
Mar 30 20:05:15 # F 9   Execute  "" 0
Mar 30 20:05:15 # F 4   Sync
Mar 30 20:05:15 # B 4   ParseComplete
Mar 30 20:05:15 # B 4   BindComplete
Mar 30 20:05:15 # B 33  RowDescription   1 "?column?"  0  65535 
-1 0
Mar 30 20:05:15 # B 32  DataRow  1 22 '0.'
Mar 30 20:05:15 # B 32  DataRow  1 22 '0.5000'
Mar 30 20:05:15 # B 32  DataRow  1 22 '1.'
Mar 30 20:05:15 # B 70  ErrorResponseS "ERROR" V "ERROR" C "22012" 
M "division by zero" F "" L "" R "" \\x00
Mar 30 20:05:15 # B 5   ReadyForQueryI


...

I wonder if this is a libpq pipeline problem rather than a PQtrace
problem.  In that test case we're using the libpq singlerow mode, so we
should see three rows before the error is sent, but for some reason
crake is not doing that.

aborted pipeline... NOTICE:  table "pq_pipeline_demo" does not exist, skipping
ok
got expected ERROR:  cannot insert multiple commands into a prepared statement
got expected division-by-zero
ok 5 - libpq_pipeline pipeline_abort
not ok 6 - pipeline_abort trace match


Other hosts seem to get it right:

# Running: libpq_pipeline -t 
/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/libpq_pipeline/tmp_check/log/pipeline_abort.trace
 pipeline_abort port=49797 
host=/var/folders/md/8mp8j5m5169ccgy11plb4w_8gp/T/iA9YP9_IHa 
dbname='postgres' 1
aborted pipeline... NOTICE:  table "pq_pipeline_demo" does not exist, skipping
ok
got expected ERROR:  cannot insert multiple commands into a prepared statement
got row: 0.
got row: 0.5000
got row: 1.
got expected division-by-zero
ok 5 - libpq_pipeline pipeline_abort
ok 6 - pipeline_abort trace match


-- 
Álvaro Herrera   Valdivia, Chile
"This is what I like so much about PostgreSQL.  Most of the surprises
are of the "oh wow!  That's cool" Not the "oh shit!" kind.  :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php

Re: libpq debug log

2021-03-30 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-30, 'alvhe...@alvh.no-ip.org' wrote:

> got expected ERROR:  cannot insert multiple commands into a prepared statement
> got expected division-by-zero

Eh?  this is not at all expected, of course, but clearly not PQtrace's
fault.  I'll look tomorrow.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"El sentido de las cosas no viene de las cosas, sino de
las inteligencias que las aplican a sus problemas diarios
en busca del progreso." (Ernesto Hernández-Novich)

Re: libpq debug log

2021-03-31 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-31, Tom Lane wrote:

> I think this is a timing problem that's triggered (on some machines)
> by force_parallel_mode = regress.  Looking at spurfowl's latest
> failure of this type, the postmaster log shows
> 
> 2021-03-31 14:34:54.982 EDT [18233:15] 001_libpq_pipeline.pl LOG:  execute 
> : SELECT 1.0/g FROM generate_series(3, -1, -1) g
> 2021-03-31 14:34:54.992 EDT [18234:1] ERROR:  division by zero
> 2021-03-31 14:34:54.992 EDT [18234:2] STATEMENT:  SELECT 1.0/g FROM 
> generate_series(3, -1, -1) g
> 2021-03-31 14:34:54.993 EDT [18233:16] 001_libpq_pipeline.pl ERROR:  division 
> by zero
> 2021-03-31 14:34:54.993 EDT [18233:17] 001_libpq_pipeline.pl STATEMENT:  
> SELECT 1.0/g FROM generate_series(3, -1, -1) g
> 2021-03-31 14:34:54.995 EDT [18216:4] LOG:  background worker "parallel 
> worker" (PID 18234) exited with exit code 1
> 2021-03-31 14:34:54.995 EDT [18233:18] 001_libpq_pipeline.pl LOG:  could not 
> send data to client: Broken pipe
> 2021-03-31 14:34:54.995 EDT [18233:19] 001_libpq_pipeline.pl FATAL:  
> connection to client lost
> 
> We can see that the division by zero occurred in a parallel worker.
> My theory is that in parallel mode, it's uncertain whether the
> error will be reported before or after the "preceding" successful
> output rows.  So you need to disable parallelism to make this
> test case stable.

Ooh, that makes sense, thanks.  Added a SET in the program itself to set
it to off.

This is not the *only* issue though; at least animal drongo shows a
completely different failure, where the last few tests don't even get to
run, and the server log just has this:

2021-03-31 20:20:14.460 UTC [178364:4] 001_libpq_pipeline.pl LOG:  
disconnection: session time: 0:00:00.469 user=pgrunner database=postgres 
host=127.0.0.1 port=52638
2021-03-31 20:20:14.681 UTC [178696:1] [unknown] LOG:  connection received: 
host=127.0.0.1 port=52639
2021-03-31 20:20:14.687 UTC [178696:2] [unknown] LOG:  connection authorized: 
user=pgrunner database=postgres application_name=001_libpq_pipeline.pl
2021-03-31 20:20:15.157 UTC [178696:3] 001_libpq_pipeline.pl LOG:  could not 
receive data from client: An existing connection was forcibly closed by the 
remote host.

I suppose the program is crashing for some reason.

-- 
Álvaro Herrera   Valdivia, Chile
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)

Re: libpq debug log

2021-03-31 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-31, Tom Lane wrote:

> I wrote:
> > That is weird - only test 4 (of 8) runs at all, the rest seem to
> > fail to connect.  What's different about pipelined_insert?
> 
> Oh ... there's a pretty obvious theory.  pipelined_insert is
> the only one that is not asked to write a trace file.
> So for some reason, opening the trace file fails.
> (I wonder why we don't see an error message for this though.)

.. oh, I think we forgot to set conn->Pfdebug = NULL when creating the
connection.  So when we do PQtrace(), the first thing it does is
PQuntrace(), and then that tries to do fflush(conn->Pfdebug) ---> crash.
So this should fix it.

-- 
Álvaro Herrera   Valdivia, Chile
 really, I see PHP as like a strange amalgamation of C, Perl, Shell
 inflex: you know that "amalgam" means "mixture with mercury",
   more or less, right?
 i.e., "deadly poison"
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index a90bdb8ab6..56a8266bc3 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3952,6 +3952,7 @@ makeEmptyPGconn(void)
 	conn->verbosity = PQERRORS_DEFAULT;
 	conn->show_context = PQSHOW_CONTEXT_ERRORS;
 	conn->sock = PGINVALID_SOCKET;
+	conn->Pfdebug = NULL;
 
 	/*
 	 * We try to send at least 8K at a time, which is the usual size of pipe

Re: libpq debug log

2021-03-31 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-31, 'alvhe...@alvh.no-ip.org' wrote:

> .. oh, I think we forgot to set conn->Pfdebug = NULL when creating the
> connection.  So when we do PQtrace(), the first thing it does is
> PQuntrace(), and then that tries to do fflush(conn->Pfdebug) ---> crash.
> So this should fix it.

I tried to use MALLOC_PERTURB_ to prove this theory, but I failed at it.
I'll just push this blind and see what happens.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."  (Brian Kernighan)

Re: libpq debug log

2021-03-31 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-31, Tom Lane wrote:

> While this may have little to do with drongo's issue,
> I'm going to take exception to this bit that I see that
> the patch added to PQtrace():
> 
>   /* Make the trace stream line-buffered */
>   setvbuf(debug_port, NULL, _IOLBF, 0);

Mea culpa.  I added this early on because it made PQtrace's use more
comfortable, but I agree that it's a mistake.  Removed.  I put it in
libpq_pipeline.c instead.

> ... and as for an actual explanation for drongo's issue,
> I'm sort of wondering why libpq_pipeline.c is opening the
> trace file in "w+" mode rather than vanilla "w" mode.
> That seems unnecessary, and maybe it doesn't work so well
> on Windows.

well, at least the MSVC documentation claims that it works, but who
knows.  I removed that too.
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-160

-- 
Álvaro Herrera   Valdivia, Chile

Re: libpq debug log

2021-04-01 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Mar-31, Tom Lane wrote:

> I wrote:
> > That is weird - only test 4 (of 8) runs at all, the rest seem to
> > fail to connect.  What's different about pipelined_insert?
> 
> Oh ... there's a pretty obvious theory.  pipelined_insert is
> the only one that is not asked to write a trace file.
> So for some reason, opening the trace file fails.
> (I wonder why we don't see an error message for this though.)

Eh, so I forgot to strdup(optarg[optind]).  Apparently that works fine
in glibc but other getopt implementations are likely not so friendly.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."   (Fotis)
   (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)

Re: libpq debug log

2021-04-01 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Apr-01, Tom Lane wrote:

> BTW, what in the world is this supposed to accomplish?
> 
> -(long long) rows_to_send);
> +(1L << 62) + (long long) rows_to_send);
> 
> Various buildfarm members are complaining that the shift distance
> is more than the width of "long", which I'd just fix with s/L/LL/
> except that the addition of the constant seems just wrong to
> begin with.

It makes the text representation wider, which means some buffer fills up
faster and the program switches from sending to receiving.

-- 
Álvaro Herrera   Valdivia, Chile
"Ni aún el genio muy grande llegaría muy lejos
si tuviera que sacarlo todo de su propio interior" (Goethe)

Re: libpq debug log

2021-04-01 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Apr-01, Tom Lane wrote:

> "'alvhe...@alvh.no-ip.org'"  writes:
> > On 2021-Apr-01, Tom Lane wrote:
> >> BTW, what in the world is this supposed to accomplish?
> >> -(long long) rows_to_send);
> >> +(1L << 62) + (long long) rows_to_send);
> 
> > It makes the text representation wider, which means some buffer fills up
> > faster and the program switches from sending to receiving.
> 
> Hm.  If the actual field value isn't relevant, why bother including
> rows_to_send in it?  A constant string would be simpler and much
> less confusing as to its purpose.

Hah, yeah, we could as well do that I guess :-)

-- 
Álvaro Herrera   Valdivia, Chile
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)

Re: libpq debug log

2021-04-01 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Apr-01, Tom Lane wrote:

> So drongo is still failing, and after a bit of looking around at
> other code I finally got hit with the clue hammer.  Per port.h:
> 
>  * On Windows, setvbuf() does not support _IOLBF mode, and interprets that
>  * as _IOFBF.  To add insult to injury, setvbuf(file, NULL, _IOFBF, 0)
>  * crashes outright if "parameter validation" is enabled.  Therefore, in
>  * places where we'd like to select line-buffered mode, we fall back to
>  * unbuffered mode instead on Windows.  Always use PG_IOLBF not _IOLBF
>  * directly in order to implement this behavior.
> 
> You want to do the honors?  And do something about that shift bug
> while at it.

Ooh, wow ... now that is a silly bug!  Thanks, I'll push the fix in a
minute.

Andrew also noted that the use of command_ok() in the test file eats all
the output and is what is preventing us from seeing it in the first.
I'll try to get rid of that together with the rest.

-- 
Álvaro Herrera   Valdivia, Chile

Re: libpq debug log

2021-04-01 Thread &#x27;alvhe...@alvh.no-ip.org'

On 2021-Apr-01, 'alvhe...@alvh.no-ip.org' wrote:

> Ooh, wow ... now that is a silly bug!  Thanks, I'll push the fix in a
> minute.

It still didn't fix it!  Drongo is now reporting a difference in the
expected trace -- and the differences all seem to be message lengths.
Now that is pretty mysterious, because the messages themselves are
printed identically.  Perl's output is pretty unhelpful, but I wrote them to a
file and diffed manually; it's attached.

So for example when we expect
! # B   123 ErrorResponseS "ERROR" V "ERROR" C "42601" M "cannot insert 
multiple commands into a prepared statement" F "" L "" R "" \\x00

we actually get
! # B   173 ErrorResponseS "ERROR" V "ERROR" C "42601" M "cannot insert 
multiple commands into a prepared statement" F "" L "" R "" \\x00

[slaps forehead] I suppose the difference must be that the message
length includes the redacted string fields.  So the file in the F file
is 50 chars longer, *kaboom*.  (I'm actually very surprised that this
didn't blow up earlier.)

I'm not sure what to do about this. Maybe the message length for
ErrorResponse/NoticeResponse ought to be redacted too.

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: Preventing abort() and exit() calls in libpq

2021-07-02 Thread alvhe...@alvh.no-ip.org

On 2021-Jul-02, Jacob Champion wrote:

> Only src/common:
> 
> controldata_utils_shlib.o:
>  U close
>  U __errno_location
>  U exit

Actually, I do see these in the .o files as well, but they don't make it
to the .a file.

gcc here is 8.3.0.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-07-28 Thread alvhe...@alvh.no-ip.org

Hello,

On 2022-Jun-29, Imseih (AWS), Sami wrote:

> > Would you mind trying the second attached to abtain detailed log on
> > your testing environment? With the patch, the modified TAP test yields
> > the log lines like below.
> 
> Thanks for this. I will apply this to the testing environment and
> will share the output.

Any luck with this?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Doing what he did amounts to sticking his fingers under the hood of the
implementation; if he gets his fingers burnt, it's his problem."  (Tom Lane)

Re: prevent immature WAL streaming

2021-08-25 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-25, Jakub Wartak wrote:

> In order to get reliable reproducer and get proper the fault injection
> instead of playing with really filling up fs, apparently one could
> substitute fd with fd of /dev/full using e.g. dup2() so that every
> write is going to throw this error too:

Oh, this is a neat trick that I didn't know about.  Thanks.

> After restarting master and inspecting standby - in all of those above
> 3 cases - the standby didn't inhibit the "invalid contrecord length"
> at least here, while without corruption this v02 patch it is
> notorious. So if it passes the worst-case code review assumptions I
> would be wondering if it shouldn't even be committed as it stands
> right now.

Well, Nathan is right that this patch isn't really closing the hole
because we aren't tracking segment boundaries that aren't "earliest" nor
"latest" at the moment of registration.  The simplistic approach seems
okay for the archive case, but not for the replication case.

I also noticed today (facepalm) that the patch doesn't work unless you
have archiving enabled, because the registration code is only invoked
when XLogArchivingActive().  Doh.  This part is easy to solve.  The
other doesn't look easy ...

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-08-25 Thread alvhe...@alvh.no-ip.org

BTW while going about testing this, I noticed that we forbid
pg_walfile_name() while in recovery.  That restriction was added by
commit 370f770c15a4 because ThisTimeLineID was not set correctly during
recovery.  That was supposed to be fixed by commit 1148e22a82ed, so I
thought that it should be possible to remove the restriction.  However,
I did that per the attached patch, but was quickly disappointed because
ThisTimeLineID seems to remain zero in a standby for reasons that I
didn't investigate.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
   (Paul Graham)
>From 1b57e504b9a7eeddf2d99e615f21c5aec042a908 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 25 Aug 2021 19:55:43 -0400
Subject: [PATCH] Don't forbid pg_walfile_name in recovery mode

---
 src/backend/access/transam/xlogfuncs.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..99b22d0b30 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -503,13 +503,6 @@ pg_walfile_name(PG_FUNCTION_ARGS)
 	XLogRecPtr	locationpoint = PG_GETARG_LSN(0);
 	char		xlogfilename[MAXFNAMELEN];
 
-	if (RecoveryInProgress())
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("recovery is in progress"),
- errhint("%s cannot be executed during recovery.",
-		 "pg_walfile_name()")));
-
 	XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
 	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size);
 
-- 
2.30.2

Re: archive status ".ready" files may be created too early

2021-08-30 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-28, Andres Freund wrote:

> While rebasing the aio patchset ontop of HEAD I noticed that this commit added
> another atomic operation to XLogWrite() with archiving enabled. The WAL write
> path is really quite hot, and at least some of the
> NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.
> 
> I think we should at least try to make the fast-path where no segment
> boundaries were crossed use no atomic operations.

I think the best way to achieve this is is to rely completely on
walwriter doing the segment notification, so that the WAL write done by
backend would only do a latch set.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: archive status ".ready" files may be created too early

2021-08-31 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-31, Andres Freund wrote:

> Maybe, but this is getting uglier and uglier.
> 
> I think patch should be reverted. It's not in a state that's appropriate for
> the backbranches.

Yeah, that's becoming my conclusion too -- undo that, and start from
scratch using the other idea.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: archive status ".ready" files may be created too early

2021-08-31 Thread alvhe...@alvh.no-ip.org

On 2021-Sep-01, Michael Paquier wrote:

> That's about 515e3d8, right?

Yes.

> I have not looked in details at what you have here, but this produces
> a compilation warning on Windows for me with this part of the patch:

This seems a tiny speck in a sea of bogosity.  If you want to silence
the warning, be my guest, but in the long run I am inclined to revert
the whole commit once I have a better picture of a way forward.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: archive status ".ready" files may be created too early

2021-08-16 Thread alvhe...@alvh.no-ip.org

So why do we call this structure "record boundary map", when the
boundaries it refers to are segment boundaries?  I think we should call
it "segment boundary map" instead ... and also I think we should use
that name in the lock that protects it, so instead of ArchNotifyLock, it
could be SegmentBoundaryLock or perhaps WalSegmentBoundaryLock.

The reason for the latter is that I suspect the segment boundary map
will also have a use to fix the streaming replication issue I mentioned
earlier in the thread.  This also makes me think that we'll want the wal
record *start* address to be in the hash table too, not just its *end*
address.  So we'll use the start-1 as position to send, rather than the
end-of-segment when GetFlushRecPtr() returns that.

As for 0x, I think it would be cleaner to do a
#define MaxXLogSegNo with that value in the line immediately after
typedef XLogSegNo, rather than use the numerical value directly in the
assignment.

Typo in comment atop RemoveRecordBoundariesUpTo: it reads "up to an",
should read "up to and".

I think the API of GetLatestRecordBoundarySegment would be better by
returning the boolean and having the segment as out argument.  Then you
could do the caller more cleanly,

if (GetLatestRecordBoundarySegment(last_notified, flushed, 
&latest_boundary_segment))
{
   SetLastNotified( ... );
   RemoveRecordBoundaries( ... );
   LWLockRelease( ... );
   for (..)
 XLogArchiveNotifySeg(...);
   PgArchWakeup();
}
else
   LWLockRelease(...);

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)

Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-17, Bossart, Nathan wrote:

> On 8/16/21, 5:09 PM, "alvhe...@alvh.no-ip.org"  
> wrote:
> > The reason for the latter is that I suspect the segment boundary map
> > will also have a use to fix the streaming replication issue I mentioned
> > earlier in the thread.  This also makes me think that we'll want the wal
> > record *start* address to be in the hash table too, not just its *end*
> > address.  So we'll use the start-1 as position to send, rather than the
> > end-of-segment when GetFlushRecPtr() returns that.
> 
> I've been thinking about the next steps for this one, too.  ISTM we'll
> need to basically assume that the flush pointer jumps along record
> boundaries except for the cross-segment records.  I don't know if that
> is the safest assumption, but I think the alternative involves
> recording every record boundary in the map.

I'm not sure I understand your idea correctly.  Perhaps another solution
is to assume that the flush pointer jumps along record boundaries
*including* for cross-segment records.  The problem stems precisely from
the fact that we set the flush pointer at segment boundaries, even when
they aren't record boundary.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree.  (Don Knuth)

Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-17, Bossart, Nathan wrote:

> I think we are in agreement.  If we assume that the flush pointer
> jumps along record boundaries and segment boundaries, the solution
> would be to avoid using the flush pointer when it points to a segment
> boundary (given that the segment boundary is not also a record
> boundary).  Instead, we'd only send up to the start position of the
> last record in the segment to standbys.

Agreed.

An implementation for that would be to test the flush pointer for it
being a segment boundary, and in that case we (acquire segment boundary
lock and) test for presence in the segment boundary map.  If present,
then retreat the pointer to the record's start address.

This means that we acquire the segment boundary lock rarely.  I was
concerned that we'd need to acquire it every time we read the flush
pointer, which would have been a disaster.

Thanks

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org

The thing I still don't understand about this patch is why we call
RegisterSegmentBoundaryEntry and NotifySegmentsReadyForArchive in
XLogInsertRecord.

My model concept of this would have these routines called only in
XLogFlush / XLogWrite, which are conceptually about transferring data
from in-memory WAL buffers into the on-disk WAL store (pg_xlog files).
As I understand, XLogInsertRecord is about copying bytes from the
high-level operation (heap insert etc) into WAL buffers.  So doing the
register/notify dance in both places should be redundant and
unnecessary.


In the NotifySegmentsReadyForArchive() call at the bottom of XLogWrite,
we use flushed=InvalidXLogRecPtr.  Why is that?  Surely we can use
LogwrtResult.Flush, just like in the other callsite there?  (If we're
covering for somebody advancing FlushRecPtr concurrently, I think we
add a comment to explain that.  But why do we need to do that in the
first place?)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)

Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-17, Bossart, Nathan wrote:

> The main reason for registering the boundaries in XLogInsertRecord()
> is that it has the required information about the WAL record
> boundaries.  I do not think that XLogWrite() has this information.

Doh, of course.  So, why isn't it that we call Register in
XLogInsertRecord, and Notify in XLogWrite?

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)

Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-17, Bossart, Nathan wrote:
> On 8/17/21, 2:13 PM, "alvhe...@alvh.no-ip.org"  
> wrote:
>
> > So, why isn't it that we call Register in XLogInsertRecord, and
> > Notify in XLogWrite?
> 
> We do.  However, we also call NotifySegmentsReadyForArchive() in
> XLogInsertRecord() to handle the probably-unlikely scenario that the
> flush pointer has already advanced past the to-be-registered boundary.
> This ensures that the .ready files are created as soon as possible.

I see.

I have two thoughts on that.  First, why not do it outside the block
that tests for crossing a segment boundary?  If that's a good thing to
do, then we should do it always.

However, why do it in a WAL-producing client-connected backend?  It
strikes me as a bad thing to do, because you are possibly causing delays
for client-connected backends.  I suggest that we should give this task
to the WAL writer process -- say, have XLogBackgroundFlush do it.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
 (George Orwell's The Lord of the Rings)

Re: archive status ".ready" files may be created too early

2021-08-18 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-17, alvhe...@alvh.no-ip.org wrote:

> However, why do it in a WAL-producing client-connected backend?  It
> strikes me as a bad thing to do, because you are possibly causing delays
> for client-connected backends.  I suggest that we should give this task
> to the WAL writer process -- say, have XLogBackgroundFlush do it.

Reading the comments on walwriter.c I am hesitant of having walwriter do
it:

>  * Because the walwriter's cycle is directly linked to the maximum delay
>  * before async-commit transactions are guaranteed committed, it's probably
>  * unwise to load additional functionality onto it.  For instance, if you've
>  * got a yen to create xlog segments further in advance, that'd be better done
>  * in bgwriter than in walwriter.

So that comment suggests that we should give the responsibility to bgwriter.
This seems good enough to me.  I suppose if bgwriter has a long run of
buffers to write it could take a little bit of time (a few hundred
milliseconds?) but I think that should be okay.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No renuncies a nada. No te aferres a nada."

Re: archive status ".ready" files may be created too early

2021-08-18 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-18, Bossart, Nathan wrote:

> On 8/18/21, 10:06 AM, "alvhe...@alvh.no-ip.org"  
> wrote:
> > So that comment suggests that we should give the responsibility to bgwriter.
> > This seems good enough to me.  I suppose if bgwriter has a long run of
> > buffers to write it could take a little bit of time (a few hundred
> > milliseconds?) but I think that should be okay.
> 
> Do you think bgwriter should be the only caller of
> NotifySegmentsReadyForArchive(), or should we still have XLogWrite()
> call it?

I think XLogWrite should absolutely be the primary caller.  The one in
bgwriter should be a backstop for the case you describe where the flush
pointer advanced past the registration point in XLogInsertRecord.

I realize this means there's a contradiction with my previous argument,
in that synchronous transaction commit calls XLogWrite at some point, so
we *are* putting the client-connected backend in charge of creating the
notify files.  However, that only happens on transaction commit, where
we already accept responsibility for the WAL flush, not on each
individual XLOG record insert; also, the WAL writer will take care of it
sometimes, for transactions that are long-enough lived.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it." (ncm, http://lwn.net/Articles/174769/)

Re: archive status ".ready" files may be created too early

2021-08-18 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-18, alvhe...@alvh.no-ip.org wrote:

> I realize this means there's a contradiction with my previous argument,
> in that synchronous transaction commit calls XLogWrite at some point, so
> we *are* putting the client-connected backend in charge of creating the
> notify files.  However, that only happens on transaction commit, where
> we already accept responsibility for the WAL flush, not on each
> individual XLOG record insert; also, the WAL writer will take care of it
> sometimes, for transactions that are long-enough lived.

Eh.  I just said WAL writer will sometimes do it, and that's true
because it'll occur in XLogBackgroundFlush.  But upthread I wimped out
of having WAL writer call NotifySegmentsReadyForArchive() and instead
opined to give responsibility to bgwriter.  However, thinking about it
again, maybe it does make sense to have walwriter do it too directly.
This causes no harm to walwriter's time constraints, since *it will have
to do it via XLogBackgroundFlush anyway*.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: archive status ".ready" files may be created too early

2021-08-18 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-18, Bossart, Nathan wrote:

> I'll add it after XLogBackgroundFlush().

I was wondering which would be better: before or after.
XLogBackgroundFlush would do it anyway, so if you do it after then it's
not clear to me that it'd do anything (I mean we should not do any new
calls of NotifySegmentsReadyForArchive and just rely on the one in
XLogBackgroundFlush -> XLogWrite).

The advantage of doing NotifySegmentsReadyForArchive before
XLogBackgroundFlush is that the files would be created sooner, so the
archiver can be working in parallel while walwriter does its other
thing; then we'd reach the NotifySegmentsReadyForArchive in
XLogBackgroundFlush and it'd find nothing to do most of the time, which
is just fine.

> I think we'll also want to set the WAL writer's latch in case it is
> hibernating.

Yeah.  (That's another advantage of doing it in walwriter rather than
bgwriter: we don't publish bgwriter's latch anywhere AFAICS).

> Another approach could be to keep the NotifySegmentsReadyForArchive()
> call in XLogInsertRecord(), but only call it if the flush pointer is
> beyond the boundary we just registered.  Or we could only set the
> latch in XLogInsertRecord() if we detect that the flush pointer has
> advanced.

Hmm.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: archive status ".ready" files may be created too early

2021-08-18 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-18, Bossart, Nathan wrote:

> As long as XLogBackgroundFlush() found work to do, it would take care
> of notifying, but I don't think we can depend on that.  However, since
> we're primarily using the WAL writer to take care of the case when the
> record has already been flushed, notifying beforehand seems fine to
> me.  If XLogBackgroundFlush() does end up calling XLogWrite(), it'll
> call it again, anyway.

Agreed.

> In the attached patch, I modified XLogInsertRecord() to simply set the
> latch if we detect that flushRecPtr has advanced.

Right, that's what I was thinking.  I modified that slightly to use
LogwrtResult.Flush (which should be fresh enough) instead of calling
GetFlushRecPtr again, which saves a bit.  I also changed it to > instead
of >=, because if I understand you correctly we only care to notify if
the flush pointer advanced, not in the case it stayed the same.

I made a few other cosmetic tweaks -- added comment to
SegmentBoundaryEntry and renamed the 'pos' to 'endpos'; renamed argument
'notify' of XLogArchiveNotify to 'nudge' (because having two different
"notify" concepts in that function seemed confusing).

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html
>From 39190a4e33a4cb35402ec9451ce6678f6e9aa494 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 17 Aug 2021 03:52:43 +
Subject: [PATCH v10] Avoid creating archive status ".ready" files too early.

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL
page of the segment is written, the archive status file will be
created.  If PostgreSQL crashes before it is able to write the rest
of the record, it will end up reusing segments that have already
been marked as ready-for-archival.  However, the archiver process
may have already processed the old version of the segment, so the
wrong version of the segment may be backed-up.  This backed-up
segment will cause operations such as point-in-time restores to
fail.

To fix this, we keep track of records that span across segments and
ensure that segments are only marked ready-for-archival once such
records have been completely written to disk.
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 291 ++-
 src/backend/access/transam/xlogarchive.c |  17 +-
 src/backend/postmaster/walwriter.c   |   7 +
 src/backend/replication/walreceiver.c|   6 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/access/xlog.h|   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   1 +
 src/tools/pgindent/typedefs.list |   1 +
 10 files changed, 309 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..d2ccf2a7bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -512,6 +512,16 @@ typedef enum ExclusiveBackupState
  */
 static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
 
+/*
+ * Entries for SegmentBoundaryMap.  Each such entry represents a WAL
+ * record that ends in endpos and crosses a WAL segment boundary.
+ */
+typedef struct SegmentBoundaryEntry
+{
+	XLogSegNo	seg;			/* must be first */
+	XLogRecPtr	endpos;
+} SegmentBoundaryEntry;
+
 /*
  * Shared state data for WAL insertion.
  */
@@ -723,6 +733,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * The last segment we've marked ready for archival.  Protected by
+	 * info_lck.
+	 */
+	XLogSegNo	lastNotifiedSeg;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -736,6 +752,12 @@ static WALInsertLockPadded *WALInsertLocks = NULL;
  */
 static ControlFileData *ControlFile = NULL;
 
+/*
+ * Segment boundary map, used for marking segments as ready for archival.
+ * Protected by SegmentBoundaryLock.
+ */
+static HTAB *SegmentBoundaryMap = NULL;
+
 /*
  * Calculate the amount of space left on the page after 'endptr'. Beware
  * multiple evaluation!
@@ -962,6 +984,13 @@ static XLogRecPtr XLogBytePos

Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org

In v12 I moved the code around a bit and reworded some comments.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 7d1475578431e265a5e7f8b94a6b0791b68a9a31 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 17 Aug 2021 03:52:43 +
Subject: [PATCH v12] Avoid creating archive status ".ready" files too early

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL page of
the segment is written, the archive status file is created, and the
archiver may process it.  If PostgreSQL crashes before it is able to
write and flush the rest of the record (in the next WAL segment), the
wrong version of the first segment file lingers in the archive, which
causes operations such as point-in-time restores to fail.

To fix this, keep track of records that span across segments and ensure
that segments are only marked ready-for-archival once such records have
been completely written to disk.

XXX Note that this may add a new LWLock acquisition in the transaction
commit path.  (It doesn't happen in all cases.)

Author: Nathan Bossart 
Reviewed-by: Kyotaro Horiguchi 
Reviewed-by: Ryo Matsumura 
Reviewed-by: Andrey Borodin 
Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 307 ++-
 src/backend/access/transam/xlogarchive.c |  17 +-
 src/backend/postmaster/walwriter.c   |   7 +
 src/backend/replication/walreceiver.c|   6 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/access/xlog.h|   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   1 +
 src/tools/pgindent/typedefs.list |   1 +
 10 files changed, 323 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..01da4ab7b8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -512,6 +512,16 @@ typedef enum ExclusiveBackupState
  */
 static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
 
+/*
+ * Entries for SegmentBoundaryMap.  Each such entry represents a WAL
+ * record that ends in endpos and crosses the WAL segment boundary.
+ */
+typedef struct SegmentBoundaryEntry
+{
+	XLogSegNo	seg;			/* must be first */
+	XLogRecPtr	endpos;
+} SegmentBoundaryEntry;
+
 /*
  * Shared state data for WAL insertion.
  */
@@ -723,6 +733,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * The last segment we've marked ready for archival.  Protected by
+	 * info_lck.
+	 */
+	XLogSegNo	lastNotifiedSeg;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -736,6 +752,23 @@ static WALInsertLockPadded *WALInsertLocks = NULL;
  */
 static ControlFileData *ControlFile = NULL;
 
+/*
+ * Segment boundary map.
+ *
+ * This hash table tracks WAL records that cross WAL segment boundaries.
+ * For WAL archiving it is critical not to mark a segment as archivable
+ * until both halves of the record are flushed: otherwise if we crash
+ * after archiving the segment containing the first half and before
+ * flushing the second half of the record, then after recovery the
+ * primary would overwrite the first half with new data, making the
+ * segment archived prior to the crash a bogus copy.
+ *
+ * XXX note that an equivalent problem exists with streaming replication.
+ *
+ * Protected by SegmentBoundaryLock.
+ */
+static HTAB *SegmentBoundaryMap = NULL;
+
 /*
  * Calculate the amount of space left on the page after 'endptr'. Beware
  * multiple evaluation!
@@ -920,6 +953,13 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		   XLogSegNo *endlogSegNo);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
+static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos);
+static XLogSegNo GetLastNotifiedSegment(void);
+static void SetLastNotifiedSegment(XLogSegNo seg);
+static bool GetLatestSegmentBoundary(XLogSegNo last_notified,
+	 XLogRecPtr flushed,
+	 XLogSegNo *latest_boundary_seg);
+static void RemoveSegmentBoundariesUpTo(XLogSegNo seg);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord

Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org

Two things.

1. We use a hash table in shared memory.  That's great.  The part that's
   not so great is that in both places where we read items from it, we
   have to iterate in some way.  This seems a bit silly.  An array would
   serve us better, if only we could expand it as needed.  However, in
   shared memory we can't do that.  (I think the list of elements we
   need to memoize is arbitrary long, if enough processes can be writing
   WAL at the same time.)

   Now that I think about this again, maybe it's limited by
   NUM_XLOGINSERT_LOCKS, since there can only be that many records being
   written down at a time ...

2. There is a new LWLock acquisition that may be a new contention point.
   We acquire the lock in these cases:
   - WAL record insert, when a segment boundary is crosses (rare
 enough).
   - XLogWrite, when a segment needs to be notified.

   Looking again, I think the last point might be a problem actually,
   because XLogWrite is called with WALWriteLock held.  Maybe we should
   take the NotifySegmentsReadyForArchive() call outside the section
   locked by WALWriteLock (so put it in XLogWrite callers instead of
   XLogWrite itself).

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)

Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-20, Bossart, Nathan wrote:

> On 8/20/21, 8:29 AM, "Robert Haas"  wrote:

> > We can't expand the hash table either. It has an initial and maximum
> > size of 16 elements, which means it's basically an expensive array,
> > and which also means that it imposes a new limit of 16 *
> > wal_segment_size on the size of WAL records. If you exceed that limit,
> > I think things just go boom... which I think is not acceptable. I
> > think we can have records in the multi-GB range of wal_level=logical
> > and someone chooses a stupid replica identity setting.
> 
> I was under the impression that shared hash tables could be expanded
> as necessary, but from your note and the following comment, that does
> not seem to be true:

Actually, you were right.  Hash tables in shared memory can be expanded.
There are some limitations (the hash "directory" is fixed size, which
means the hash table get less efficient if it grows too much), but you
can definitely create more hash entries than the initial size.  See for
example element_alloc(), which covers the case of a hash table being
IS_PARTITIONED -- something that only shmem hash tables can be.  Note
that ShmemInitHash passes the HASH_ALLOC flag and uses ShmemAllocNoError
as allocation function, which acquires memory from the shared segment.

This is a minor thing -- it doesn't affect the fact that the hash table
is possibly being misused and inefficient -- but I thought it was worth
pointing out.

As an example, consider the LOCK / PROCLOCK hash tables.  These can
contain more elements than max_backends * max_locks_per_transaction.
Those elements consume shared memory from the "allocation slop" in the
shared memory segment.  It's tough when it happens (as far as I know the
memory is never "returned" once such a hash table grows to use that
space), but it does work.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-20, Bossart, Nathan wrote:

> > On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan  wrote:

> >> This led me to revisit the two-element
> >> approach that was discussed upthread.  What if we only stored the
> >> earliest and latest segment boundaries at any given time?  Once the
> >> earliest boundary is added, it never changes until the segment is
> >> flushed and it is removed.  The latest boundary, however, will be
> >> updated any time we register another segment.  Once the earliest
> >> boundary is removed, we replace it with the latest boundary.  This
> >> strategy could cause us to miss intermediate boundaries, but AFAICT
> >> the worst case scenario is that we hold off creating .ready files a
> >> bit longer than necessary.

> I've attached a patch to demonstrate what I'm thinking.

There is only one thing I didn't like in this new version, which is that
we're holding info_lck too much.  I've seen info_lck contention be a
problem in some workloads and I'd rather not add more stuff to it.  I'd
rather we stick with using a new lock object to protect all the data we
need for this job.

Should this new lock object be a spinlock or an lwlock?  I think a
spinlock would generally be better because it's lower overhead and we
can't use it in shared mode anywhere, which would be the greatest
argument for an lwlock.  However, I think we avoid letting code run with
spinlocks held that's not straight-line code, and we have some function
calls there.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org

Attached is v14 which uses a separate spinlock.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto.  No es cierto, y si fuera cierto,
 no me acuerdo." (Augusto Pinochet a una corte de justicia)
>From 4180334f51a1f343ce5795d0b62f3aa298b472da Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 20 Aug 2021 19:25:31 +
Subject: [PATCH v14] Avoid creating archive status ".ready" files too early

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL page of
the segment is written, the archive status file is created, and the
archiver may process it.  If PostgreSQL crashes before it is able to
write and flush the rest of the record (in the next WAL segment), the
wrong version of the first segment file lingers in the archive, which
causes operations such as point-in-time restores to fail.

To fix this, keep track of records that span across segments and ensure
that segments are only marked ready-for-archival once such records have
been completely written to disk.

Author: Nathan Bossart 
Reviewed-by: Kyotaro Horiguchi 
Reviewed-by: Ryo Matsumura 
Reviewed-by: Andrey Borodin 
Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 242 +--
 src/backend/access/transam/xlogarchive.c |  17 +-
 src/backend/postmaster/walwriter.c   |   7 +
 src/backend/replication/walreceiver.c|   6 +-
 src/include/access/xlog.h|   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   1 +
 8 files changed, 256 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..95f03adef8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -724,6 +724,18 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastFpwDisableRecPtr;
 
 	slock_t		info_lck;		/* locks shared variables shown above */
+
+	/*
+	 * Variables used to track cross-segment records.  Protected by
+	 * segtrack_lck.
+	 */
+	XLogSegNo	lastNotifiedSeg;
+	XLogSegNo	earliestSegBoundary;
+	XLogRecPtr	earliestSegBoundaryRecPtr;
+	XLogSegNo	latestSegBoundary;
+	XLogRecPtr	latestSegBoundaryRecPtr;
+
+	slock_t		segtrack_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
 static XLogCtlData *XLogCtl = NULL;
@@ -920,6 +932,9 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		   XLogSegNo *endlogSegNo);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
+static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos);
+static bool GetLatestSegmentBoundary(XLogRecPtr flushed, XLogSegNo *latest_boundary_seg);
+static void RemoveSegmentBoundariesUpTo(XLogSegNo seg);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
@@ -1154,23 +1169,56 @@ XLogInsertRecord(XLogRecData *rdata,
 	END_CRIT_SECTION();
 
 	/*
-	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
+	 * If we crossed page boundary, update LogwrtRqst.Write; if we crossed
+	 * segment boundary, register that and wake up walwriter.
 	 */
 	if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
 	{
+		XLogSegNo	StartSeg;
+		XLogSegNo	EndSeg;
+
+		XLByteToSeg(StartPos, StartSeg, wal_segment_size);
+		XLByteToSeg(EndPos, EndSeg, wal_segment_size);
+
+		/*
+		 * Register our crossing the segment boundary if that occurred.
+		 *
+		 * Note that we did not use XLByteToPrevSeg() for determining the
+		 * ending segment.  This is so that a record that fits perfectly into
+		 * the end of the segment is marked ready for archival as soon as the
+		 * flushed pointer jumps to the next segment.
+		 */
+		if (StartSeg != EndSeg && XLogArchivingActive())
+			RegisterSegmentBoundary(EndSeg, EndPos);
+
+		/*
+		 * Advance LogwrtRqst.Write so that it includes new block(s).
+		 *
+		 * We do this after registering the segment boundary so that the
+		 * comparison with the flushed pointer below can use the latest value
+		 * known globally.
+		 */
 		SpinLockAcquire(&XLogCtl->info_lck);
-		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRq

Re: archive status ".ready" files may be created too early

2021-08-20 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-20, Bossart, Nathan wrote:

> I was looking at moving the function calls out of the spinlock region.
> I don't think the functions are doing anything too expensive, and they
> help clean up NotifySegmentsReadyForArchive() quite a bit, but I
> understand why it might be against project policy to do something like
> that.  It would be easy enough to get rid of the helper functions if
> that was concern.

Well, the thing I realized is that these three helper functions have
exactly one caller each.  I think the compiler is going to inline them,
so there isn't going to be a function call in the assembly.  I haven't
verified this, though.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)

Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-21, Bossart, Nathan wrote:

> > Well, the thing I realized is that these three helper functions have
> > exactly one caller each.  I think the compiler is going to inline them,
> > so there isn't going to be a function call in the assembly.  I haven't
> > verified this, though.
> 
> Good point.  It looks like they're getting inlined for me.

I still didn't like it, because it looks like we're creating an API for
which there can be only one caller.  So I expanded the functions in the
caller.  It doesn't look too bad.  However ...

... while reading the resulting code after backpatching to all branches,
I realized that if there are no registrations whatsoever, then archiving
won't do anything, which surely is the wrong thing to do.  The correct
behavior should be "if there are no registrations, then *all* flushed
segments can be notified".

I'll fix that ...

Another thing I didn't like is that you used a name ending in RecPtr for
the LSN, which gives no indication that it really is the *end* LSN, not
the start pointer.  And it won't play nice with the need to add the
*start* LSN which we'll need to implement solving the equivalent problem
for streaming replication.  I'll rename those to
earliestSegBoundaryEndPtr and latestSegBoundaryEndPtr.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.
>From 6e4c3fd6f5687ec5762de121344ce35c1c890812 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 23 Aug 2021 09:06:09 -0400
Subject: [PATCH v15] Avoid creating archive status ".ready" files too early

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL page of
the segment is written, the archive status file is created, and the
archiver may process it.  If PostgreSQL crashes before it is able to
write and flush the rest of the record (in the next WAL segment), the
wrong version of the first segment file lingers in the archive, which
causes operations such as point-in-time restores to fail.

To fix this, keep track of records that span across segments and ensure
that segments are only marked ready-for-archival once such records have
been completely written to disk.

This has always been wrong, so backpatch all the way back.

Author: Nathan Bossart 
Reviewed-by: Kyotaro Horiguchi 
Reviewed-by: Ryo Matsumura 
Reviewed-by: Andrey Borodin 
Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 220 +--
 src/backend/access/transam/xlogarchive.c |  17 +-
 src/backend/postmaster/walwriter.c   |   7 +
 src/backend/replication/walreceiver.c|   6 +-
 src/include/access/xlog.h|   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   1 +
 8 files changed, 234 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..8e7c3a364a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -724,6 +724,18 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastFpwDisableRecPtr;

 	slock_t		info_lck;		/* locks shared variables shown above */
+
+	/*
+	 * Variables used to track cross-segment records.  Protected by
+	 * segtrack_lck.
+	 */
+	XLogSegNo	lastNotifiedSeg;
+	XLogSegNo	earliestSegBoundary;
+	XLogRecPtr	earliestSegBoundaryRecPtr;
+	XLogSegNo	latestSegBoundary;
+	XLogRecPtr	latestSegBoundaryRecPtr;
+
+	slock_t		segtrack_lck;		/* locks shared variables shown above */
 } XLogCtlData;

 static XLogCtlData *XLogCtl = NULL;
@@ -920,6 +932,7 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		   XLogSegNo *endlogSegNo);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
+static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
@@ -1154,23 +1167,56 @@ XLogInsertRecord(XLogRecData *rdata,
 	END_CRIT_SECTION();

 	/*
-	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
+	 * If we crossed page boundary, update LogwrtRqst.Write; if we crossed
+	 * se

Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-23, Bossart, Nathan wrote:

> Hm.  My expectation would be that if there are no registrations, we
> cannot create .ready files for the flushed segments.  The scenario
> where I can see that happening is when a record gets flushed to disk
> prior to registration.  In that case, we'll still eventually register
> the record and wake up the WAL writer process, which will take care of
> creating the .ready files that were missed earlier.  Is there another
> case you are thinking of where we could miss registration for a cross-
> segment record altogether?

I'm thinking of the case where no record cross segment boundaries ever.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-23, Bossart, Nathan wrote:

> Sorry, I'm still not following this one.  If we skipped creating
> .ready segments due to a crash, we rely on RemoveOldXlogFiles() to
> create them as needed in the end-of-recovery checkpoint.  If a record
> fits perfectly in the end of a segment, we'll still register it as a
> boundary for the next segment (hence why we use XLByteToSeg() instead
> of XLByteToPrevSeg()).  If database activity stops completely, there
> shouldn't be anything to mark ready.

The only way .ready files are created is that XLogNotifyWrite() is
called.  For regular WAL files during regular operation, that only
happens in XLogNotifyWriteSeg().  That, in turn, only happens in
NotifySegmentsReadyForArchive().  But if the system runs and never
writes WAL records that cross WAL boundaries, that function will see
that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno,
and return without doing anything.  So no segments will be notified.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-23, alvhe...@alvh.no-ip.org wrote:

> The only way .ready files are created is that XLogNotifyWrite() is
> called.  For regular WAL files during regular operation, that only
> happens in XLogNotifyWriteSeg().  That, in turn, only happens in
> NotifySegmentsReadyForArchive().  But if the system runs and never
> writes WAL records that cross WAL boundaries, that function will see
> that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno,
> and return without doing anything.  So no segments will be notified.

Nevermind -- I realized that all segments get registered, not just those
for which we generate continuation records.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"

Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-23, Bossart, Nathan wrote:

> Ah, okay.  BTW the other changes you mentioned made sense to me.

Thanks.  I've pushed this now to all live branches.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
 It does it in a really, really complicated way
 why does it need to be complicated?
 Because it's MakeMaker.

Re: archive status ".ready" files may be created too early

2021-08-23 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-23, Bossart, Nathan wrote:

> On 8/23/21, 12:55 PM, "alvhe...@alvh.no-ip.org"  
> wrote:
> > Thanks.  I've pushed this now to all live branches.
> 
> Thank you!  I appreciate the thorough reviews.  Should we make a new
> thread for the streaming replication fix?

Yeah, this one is long enough :-)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-08-24 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-24, Bossart, Nathan wrote:

> I wonder if we need to move the call to RegisterSegmentBoundary() to
> somewhere before WALInsertLockRelease() for this to work as intended.
> Right now, boundary registration could take place after the flush
> pointer has been advanced, which means GetSafeFlushRecPtr() could
> still return an unsafe position.

Yeah, you're right -- that's a definite risk.  I didn't try to reproduce
a problem with that, but it is seems pretty obvious that it can happen.

I didn't have a lot of luck with a reliable reproducer script.  I was
able to reproduce the problem starting with Ryo Matsumura's script and
attaching a replica; most of the time the replica would recover by
restarting from a streaming position earlier than where the problem
occurred; but a few times it would just get stuck with a WAL segment
containing a bogus record.  Then, after patch, the problem no longer
occurs.

I attach the patch with the change you suggested.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes...  Fixed.
http://archives.postgresql.org/message-id/482d1632.8010...@sigaev.ru
>From 9f39d6f186c3af55db85ee73427a9ed94750939e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 23 Aug 2021 18:25:48 -0400
Subject: [PATCH v2] Don't stream non-final WAL segments

Avoid setting the physical-stream replication read pointer in the middle
of a WAL record.  This can occur if a record is split in two (or more)
across segment boundaries.  The reason to avoid it is that if we stream
the segment containing the first half, and then we crash before writing
the next segment, the primary will rewrite the tail of the segment with
a new WAL record (having discarded the incomplete record), but the
replica will be stuck trying to replay a broken file (since the next
segment will never contain the now-gone data).

To do this, change streaming replication to retreat the flush pointer
according to registered segment boundaries.
---
 src/backend/access/transam/xlog.c   | 80 +
 src/backend/replication/walsender.c |  2 +-
 src/include/access/xlog.h   |  1 +
 3 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03e..4d010c86dd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -731,8 +731,10 @@ typedef struct XLogCtlData
 	 */
 	XLogSegNo	lastNotifiedSeg;
 	XLogSegNo	earliestSegBoundary;
+	XLogRecPtr	earliestSegBoundaryStartPtr;
 	XLogRecPtr	earliestSegBoundaryEndPtr;
 	XLogSegNo	latestSegBoundary;
+	XLogRecPtr	latestSegBoundaryStartPtr;
 	XLogRecPtr	latestSegBoundaryEndPtr;
 
 	slock_t		segtrack_lck;	/* locks shared variables shown above */
@@ -932,7 +934,7 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		   XLogSegNo *endlogSegNo);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
-static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos);
+static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr startpos, XLogRecPtr endpos);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
@@ -1022,6 +1024,8 @@ XLogInsertRecord(XLogRecData *rdata,
 			   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
+	XLogSegNo	StartSeg;
+	XLogSegNo	EndSeg;
 	bool		prevDoPageWrites = doPageWrites;
 
 	/* we assume that all of the record header is in the first chunk */
@@ -1157,6 +1161,30 @@ XLogInsertRecord(XLogRecData *rdata,
 		 */
 	}
 
+	/*
+	 * Before releasing our WAL insertion lock, register that we crossed the
+	 * segment boundary if that occurred.  We need to do it with the lock held
+	 * for GetSafeFlushRecPtr's sake: otherwise it could see the WAL flush
+	 * point advance but not see the registration, which would lead it to
+	 * wrongly conclude that our flush point is safe to use.
+	 */
+	if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
+	{
+		XLByteToSeg(StartPos, StartSeg, wal_segment_size);
+		XLByteToSeg(EndPos, EndSeg, wal_segment_size);
+
+		/*
+		 * Register our crossing the segment boundary if that occurred.
+		 *
+		 * Note that we did not use XLByteToPrevSeg() for determining the
+		 * ending segment.  This is so that a record that fits perfectly into
+		 * the end of the segment causes said segment to get marked ready for
+		 * archival immediately.
+		 */
+		if (StartSeg != EndSeg && XLogArchivingActive())
+			RegisterSegmentBoundary(EndSeg, StartPos, EndPos);
+	}
+
 	/*
 	 * Done! Let others know that we're finished.
 	 */
@@ -1168,27 +1196,10 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	/*
 	 * If we crossed page boundary, update LogwrtRqst.Write; if we crossed
-	 * segment boundary, register that and

Re: prevent immature WAL streaming

2021-08-25 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-24, Bossart, Nathan wrote:

> If moving RegisterSegmentBoundary() is sufficient to prevent the flush
> pointer from advancing before we register the boundary, I bet we could
> also remove the WAL writer nudge.

Can you elaborate on this?  I'm not sure I see the connection.

> Another interesting thing I see is that the boundary stored in
> earliestSegBoundary is not necessarily the earliest one.  It's just
> the first one that has been registered.  I did this for simplicity for
> the .ready file fix, but I can see it causing problems here.

Hmm, is there really a problem here?  Surely the flush point cannot go
past whatever has been written.  If somebody is writing an earlier
section of WAL, then we cannot move the flush pointer to a later
position.  So it doesn't matter if the earliest point we have registered
is the true earliest -- we only care for it to be the earliest that is
past the flush point.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

57 matches

Mail list logo