Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
On 2022-Mar-07, Imseih (AWS), Sami wrote: > I have gone ahead and backpatched this all the way to 10 as well. Thanks! I pushed this now. I edited the test though: I don't understand why you went to the trouble of setting stuff in order to call 'pg_ctl promote' (in different ways for older branches), when $node_standby->promote does the same and is simpler to call. So I changed the tests to do that. (I did verify that without the code fix, the PANIC indeed is thrown.) Thank again, -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: libpq debug log
I'll give this another look tomorrow, but I wanted to pass along that I prefer libpq-trace.{c,h} instead of libpq-logging. I also renamed variable "pin" and pgindented. I don't have any major reservations about this patch now, so I'll mark it ready-for-committer in case somebody else wants to have a look before push. (Not sure about the use of the word "forcely") -- Álvaro Herrera39°49'30"S 73°17'W >From da0faf61ceb8e9cbc5894ca1e43b6f7fc477a835 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 23 Feb 2021 22:17:51 -0300 Subject: [PATCH v20] Improve libpq tracing capabilities Author: Aya Iwata --- doc/src/sgml/libpq.sgml | 25 +- src/interfaces/libpq/Makefile | 3 + src/interfaces/libpq/exports.txt| 1 + src/interfaces/libpq/fe-connect.c | 25 +- src/interfaces/libpq/fe-exec.c | 4 - src/interfaces/libpq/fe-misc.c | 59 ++- src/interfaces/libpq/fe-protocol3.c | 18 + src/interfaces/libpq/libpq-fe.h | 2 + src/interfaces/libpq/libpq-int.h| 6 + src/interfaces/libpq/libpq-trace.c | 638 src/interfaces/libpq/libpq-trace.h | 95 + src/tools/pgindent/typedefs.list| 6 + 12 files changed, 823 insertions(+), 59 deletions(-) create mode 100644 src/interfaces/libpq/libpq-trace.c create mode 100644 src/interfaces/libpq/libpq-trace.h diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 5e25f20843..3a1e368e0b 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -5907,7 +5907,8 @@ PGContextVisibility PQsetErrorContextVisibility(PGconn *conn, PGContextVisibilit - Enables tracing of the client/server communication to a debugging file stream. + Enables tracing of the client/server communication to a debugging file stream. + (Details of tracing contents appear in ). void PQtrace(PGconn *conn, FILE *stream); @@ -5927,6 +5928,28 @@ void PQtrace(PGconn *conn, FILE *stream); + + +PQtraceSetFlagsPQtraceSetFlags + + + + Controls the tracing behavior of client/server communication. + +void PQtraceSetFlags(PGconn *conn, int flags); + + + + + flags contains flag bits describing the operating mode + of tracing. If flags contains PQTRACE_SUPPRESS_TIMESTAMPS, + then timestamp is not printed with each message. If set to 0 (default),tracing will be output with timestamp. + This function should be called after calling PQtrace. + + + + + PQuntracePQuntrace diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile index f74677eaf9..cdb65b48af 100644 --- a/src/interfaces/libpq/Makefile +++ b/src/interfaces/libpq/Makefile @@ -42,6 +42,7 @@ OBJS = \ fe-secure.o \ legacy-pqsignal.o \ libpq-events.o \ + libpq-trace.o \ pqexpbuffer.o \ fe-auth.o @@ -114,6 +115,7 @@ install: all installdirs install-lib $(INSTALL_DATA) $(srcdir)/libpq-fe.h '$(DESTDIR)$(includedir)' $(INSTALL_DATA) $(srcdir)/libpq-events.h '$(DESTDIR)$(includedir)' $(INSTALL_DATA) $(srcdir)/libpq-int.h '$(DESTDIR)$(includedir_internal)' + $(INSTALL_DATA) $(srcdir)/libpq-trace.h '$(DESTDIR)$(includedir)' $(INSTALL_DATA) $(srcdir)/pqexpbuffer.h '$(DESTDIR)$(includedir_internal)' $(INSTALL_DATA) $(srcdir)/pg_service.conf.sample '$(DESTDIR)$(datadir)/pg_service.conf.sample' @@ -127,6 +129,7 @@ uninstall: uninstall-lib rm -f '$(DESTDIR)$(includedir)/libpq-fe.h' rm -f '$(DESTDIR)$(includedir)/libpq-events.h' rm -f '$(DESTDIR)$(includedir_internal)/libpq-int.h' + rm -f '$(DESTDIR)$(includedir)/libpq-trace.h' rm -f '$(DESTDIR)$(includedir_internal)/pqexpbuffer.h' rm -f '$(DESTDIR)$(datadir)/pg_service.conf.sample' diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index bbc1f90481..09f4a6 100644 --- a/src/interfaces/libpq/exports.txt +++ b/src/interfaces/libpq/exports.txt @@ -179,3 +179,4 @@ PQgetgssctx 176 PQsetSSLKeyPassHook_OpenSSL 177 PQgetSSLKeyPassHook_OpenSSL 178 PQdefaultSSLKeyPassHook_OpenSSL 179 +PQtraceSetFlags 180 diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index db71fea169..9bf2f0d0b6 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -4021,6 +4021,10 @@ freePGconn(PGconn *conn) if (conn->connip) free(conn->connip); /* Note that conn->Pfdebug is not ours to close or free */ + if (conn->be_msg) + free(conn->be_msg); + if (conn->fe_msg) + free(conn->fe_msg); if (conn->last_query) free(conn->last_query); if (conn->write_err_msg) @@ -6766,27 +6770,6 @@ PQsetErrorContextVisibility(PGconn *conn, PGContextVisibility show_context) return old; } -void -PQtrace(PGconn *conn, FILE *debug_port) -{ - if (conn == NULL) - return; - PQuntrace(conn); - conn->Pfdebug = debug_port; -} - -void -PQuntrace(PGconn *conn) -{ - if (conn == NULL) - return; - if (conn->Pfdebug)
Re: libpq debug log
On 2021-Mar-05, tsunakawa.ta...@fujitsu.com wrote: > From: Tom Lane > > But I think passing the message start address explicitly might be > > better than having it understand the buffering behavior in enough > > detail to know where to find the message. Part of the point here > > (IMO) is to decouple the tracing logic from the core libpq logic, in > > hopes of not having common-mode bugs. > > Ouch, you're perfectly right. Then let's make the signature: > > void pqLogMessage(PGconn *conn, const char *message, bool toServer); Yeah, looks good! I agree that going this route will result in more trustworthy trace output. -- Álvaro Herrera39°49'30"S 73°17'W
Re: libpq debug log
On 2021-Mar-10, iwata@fujitsu.com wrote: > Hi all, > > Following all reviewer's advice, I have created a new patch. > > In this patch, I add only two tracing entry points; I call > pqTraceOutputMsg(PGconn conn, int msgCursor, PGCommSource commsource) in > pqParseInput3 () and pqPutMsgEnd () to output log. > The argument contains message first byte offset called msgCursor because it > is simpler to specify the buffer pointer in the caller. > > In pqTraceOutputMsg(), the content common to all protocol messages (first > timestamp, < or >, first 1 byte, and length) are output first and after that > each protocol message contents is output. I referred to pqParseInput3 () , > fe-exec.c and documentation page for that part. > > This fix almost eliminates if(conn->Pfdebug) that was built into the code for > other features. This certainly looks much better! Thanks for getting it done so quickly. I didn't review the code super closely. I do see a compiler warning: /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c: In function 'pqTraceOutputNchar': /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:373:2: warning: argument 1 null where non-null expected [-Wnonnull] memcpy(v, buf + *cursor, len); ^ and then when I try to run the "libpq_pipeline" program from the other thread, I get a crash with this backtrace: Program received signal SIGSEGV, Segmentation fault. __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:288 288 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No existe el fichero o el directorio. (gdb) bt #0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:288 #1 0x77f9b50b in pqTraceOutputNchar (buf=buf@entry=0x55564660 "P", LogEnd=LogEnd@entry=42, cursor=cursor@entry=0x7fffdeb4, pfdebug=0x55569320, len=1) at /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:373 #2 0x77f9bc25 in pqTraceOutputMsg (conn=conn@entry=0x55560260, msgCursor=, commsource=commsource@entry=MSGDIR_FROM_FRONTEND) at /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:476 #3 0x77f96280 in pqPutMsgEnd (conn=conn@entry=0x55560260) at /pgsql/source/pipeline/src/interfaces/libpq/fe-misc.c:533 ... The attached patch fixes it, though I'm not sure that that's the final form we want it to have (since we'd alloc and free repeatedly, making it slow -- perhaps better to use a static, or ...? ). -- Álvaro Herrera Valdivia, Chile Essentially, you're proposing Kevlar shoes as a solution for the problem that you want to walk around carrying a loaded gun aimed at your foot. (Tom Lane) diff --git a/src/interfaces/libpq/libpq-trace.c b/src/interfaces/libpq/libpq-trace.c index ef294fa556..5d3b40a1d0 100644 --- a/src/interfaces/libpq/libpq-trace.c +++ b/src/interfaces/libpq/libpq-trace.c @@ -368,7 +368,15 @@ pqTraceOutputBinary(char *v, int length, FILE *pfdebug) static void pqTraceOutputNchar(char *buf, int LogEnd, int *cursor, FILE *pfdebug, int len) { - char *v = '\0'; + char *v; + + v = malloc(len); + if (v == NULL) + { + fprintf(pfdebug, "'..%d chars..'", len); + *cursor += len; + return; + } memcpy(v, buf + *cursor, len); *cursor += len; @@ -377,6 +385,8 @@ pqTraceOutputNchar(char *buf, int LogEnd, int *cursor, FILE *pfdebug, int len) pqTraceOutputBinary(v, len, pfdebug); fprintf(pfdebug, "\'"); + free(v); + if (*cursor < LogEnd) fprintf(pfdebug, " "); else
Re: libpq debug log
I also found that DataRow messages are not being printed. This seems to fix that in the normal case and singlerow, at least in pipeline mode. Didn't verify the non-pipeline mode. -- Álvaro Herrera39°49'30"S 73°17'W "Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre) diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c index 68f0f6a081..e8db5edb71 100644 --- a/src/interfaces/libpq/fe-protocol3.c +++ b/src/interfaces/libpq/fe-protocol3.c @@ -869,6 +869,9 @@ getAnotherTuple(PGconn *conn, int msgLength) goto advance_and_error; } + if (conn->Pfdebug) + pqTraceOutputMsg(conn, conn->inStart, MSGDIR_FROM_BACKEND); + /* Advance inStart to show that the "D" message has been processed. */ conn->inStart = conn->inCursor;
Re: libpq debug log
On 2021-Mar-10, 'alvhe...@alvh.no-ip.org' wrote: > I also found that DataRow messages are not being printed. This seems to > fix that in the normal case and singlerow, at least in pipeline mode. > Didn't verify the non-pipeline mode. Hm, and RowDescription ('T') messages are also not being printed ... so I think there's some larger problem here, and perhaps it needs to be fixed using a different approach. After staring at it a couple of times, I think that the places in pqParseInput3() where there's a comment "... moves inStart itself" and then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU those are the places where the message in question would not reach the "Successfully consumed this message" block that prints each message. -- Álvaro Herrera39°49'30"S 73°17'W Maybe there's lots of data loss but the records of data loss are also lost. (Lincoln Yeoh)
Re: libpq debug log
On 2021-Mar-10, Tom Lane wrote: > "'alvhe...@alvh.no-ip.org'" writes: > > After staring at it a couple of times, I think that the places in > > pqParseInput3() where there's a comment "... moves inStart itself" and > > then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU > > those are the places where the message in question would not reach the > > "Successfully consumed this message" block that prints each message. > > Yeah, the whole business of just when a message has been "consumed" > is a stumbling block for libpq tracing. It was a real mess with the > existing code and v2 protocol, because we could actually try to parse > a message more than once, with the first few tries deciding that the > message wasn't all here yet (but nonetheless emitting partial trace > output). Hmm, that makes sense, but the issue I'm reporting is not the same, unless I misunderstand you. > Now that v2 is dead, the logic to abort because of the message not > being all arrived yet is basically useless: only the little bit of > code concerned with the message length word really needs to cope with > that. It's tempting to go through and get rid of all the now-unreachable > "return"s and such, but it seemed like it would be a lot of code churn for > not really that much gain. That sounds like an interesting exercise, and I bet it'd bring a lot of code readability improvements. > I didn't look at the new version of the patch yet, so I'm not > sure whether the issues it still has are related to this. The issues I noticed are related to separate messages rather than one message split in pieces -- for example several DataRow messages are processed internally in a loop, rather than each individually. The number of places that need to be adjusted for things to AFAICT work correctly are few enough; ISTM that the attached patch is sufficient. (The business with the "logged" boolean is necessary so that we print to the trace file any of those messages even if they are deviating from the protocol.) -- Álvaro Herrera Valdivia, Chile "La experiencia nos dice que el hombre peló millones de veces las patatas, pero era forzoso admitir la posibilidad de que en un caso entre millones, las patatas pelarían al hombre" (Ijon Tichy)
Re: libpq debug log
On 2021-Mar-10, Tom Lane wrote: > Or we could rethink the logic. It's not quite clear to me, after > all this time, why getRowDescriptions() et al are allowed to > move inStart themselves rather than letting the main loop in > pqParseInput3 do it. It might well be an artifact of having not > rewritten the v2 logic too much. I would certainly prefer that the logic stays put for the time being, while I finalize the pipelining stuff. -- Álvaro Herrera Valdivia, Chile
Re: libpq debug log
Hello I added an option to the new libpq_pipeline program that it activates libpq trace. It works nicely and I think we can add that to the regression tests. However I have two observations. 1. The trace output for the error message won't be very nice, because it prints line numbers. So if I want to match the output to an "expected" file, it would break every time somebody edits the source file where the error originates and the ereport() line is moved. For example: < 70 ErrorResponseS "ERROR" V "ERROR" C "22012" M "division by zero" F "numeric.c" L "8366" R "div_var" \x00 "Z" The line number 8366 in this line would be problematic. I think if we want regression testing for this, we should have another trace flag to suppress output of a few of these fields. I would have it print only S, C and M. (Hey, what the heck is that "Z" at the end of the message? I thought the error ended right at the \x00 ...) 2. The < and > characters are not good for visual inspection. I replaced them with F and B and I think it's much clearer. Compare: > 27 Query"SET lc_messages TO "C"" < 8 CommandComplete "SET" < 5 ReadyForQueryI > 21 Parse"" "SELECT $1" 1 23 > 19 Bind "" "" 0 1 1 '1' 1 0 > 6 Describe P "" > 9 Execute "" 0 > 4 Sync < 4 ParseComplete < 4 BindComplete < 33 RowDescription 1 "?column?" 0 0 23 4 -1 0 < 11 DataRow 1 1 '1' < 13 CommandComplete "SELECT 1" < 5 ReadyForQueryI > 4 Terminate with F 27 Query"SET lc_messages TO "C"" B 8 CommandComplete "SET" B 5 ReadyForQueryI F 21 Parse"" "SELECT $1" 1 23 F 19 Bind "" "" 0 1 1 '1' 1 0 F 6 Describe P "" F 9 Execute "" 0 F 4 Sync B 4 ParseComplete B 4 BindComplete B 33 RowDescription 1 "?column?" 0 0 23 4 -1 0 B 11 DataRow 1 1 '1' B 13 CommandComplete "SELECT 1" B 5 ReadyForQueryI F 4 Terminate I think the second one is much easier on the eye. (This one is the output from "libpq_pipeline simple_pipeline"). -- Álvaro Herrera Valdivia, Chile "Saca el libro que tu religión considere como el indicado para encontrar la oración que traiga paz a tu alma. Luego rebootea el computador y ve si funciona" (Carlos Duclós)
Re: libpq debug log
Proposed changes on top of v29. -- Álvaro Herrera Valdivia, Chile >From b32ae3805bb28553c0a1cf308c6ed27f58576f3c Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 26 Mar 2021 19:12:12 -0300 Subject: [PATCH 1/5] libpq_pipeline: add -t support for PQtrace --- .../modules/libpq_pipeline/libpq_pipeline.c | 84 +-- 1 file changed, 59 insertions(+), 25 deletions(-) diff --git a/src/test/modules/libpq_pipeline/libpq_pipeline.c b/src/test/modules/libpq_pipeline/libpq_pipeline.c index 846ee9f5ab..34d085035b 100644 --- a/src/test/modules/libpq_pipeline/libpq_pipeline.c +++ b/src/test/modules/libpq_pipeline/libpq_pipeline.c @@ -23,6 +23,7 @@ #include "catalog/pg_type_d.h" #include "common/fe_memutils.h" #include "libpq-fe.h" +#include "pg_getopt.h" #include "portability/instr_time.h" @@ -30,6 +31,9 @@ static void exit_nicely(PGconn *conn); const char *const progname = "libpq_pipeline"; +/* Options and defaults */ +char *tracefile = NULL; /* path to PQtrace() file */ + #define DEBUG #ifdef DEBUG @@ -1209,8 +1213,10 @@ usage(const char *progname) { fprintf(stderr, "%s tests libpq's pipeline mode.\n\n", progname); fprintf(stderr, "Usage:\n"); - fprintf(stderr, " %s tests", progname); - fprintf(stderr, " %s testname [conninfo [number_of_rows]]\n", progname); + fprintf(stderr, " %s [OPTION] tests\n", progname); + fprintf(stderr, " %s [OPTION] TESTNAME [CONNINFO [NUMBER_OF_ROWS]\n", progname); + fprintf(stderr, "\nOptions:\n"); + fprintf(stderr, " -t TRACEFILE generate a libpq trace to TRACEFILE\n"); } static void @@ -1231,37 +1237,54 @@ main(int argc, char **argv) { const char *conninfo = ""; PGconn *conn; + FILE *trace; + char *testname; int numrows = 1; PGresult *res; + int c; - if (strcmp(argv[1], "tests") == 0) + while ((c = getopt(argc, argv, "t:")) != -1) { - print_test_list(); - exit(0); + switch (c) + { + case 't': /* trace file */ +tracefile = pg_strdup(optarg); +break; + } } - /* - * The testname parameter is mandatory; it can be followed by a conninfo - * string and number of rows. - */ - if (argc < 2 || argc > 4) + if (optind < argc) + { + testname = argv[optind]; + optind++; + } + else { usage(argv[0]); exit(1); } - if (argc >= 3) - conninfo = pg_strdup(argv[2]); + if (strcmp(testname, "tests") == 0) + { + print_test_list(); + exit(0); + } - if (argc >= 4) + if (optind < argc) + { + conninfo = argv[optind]; + optind++; + } + if (optind < argc) { errno = 0; - numrows = strtol(argv[3], NULL, 10); + numrows = strtol(argv[optind], NULL, 10); if (errno != 0 || numrows <= 0) { - fprintf(stderr, "couldn't parse \"%s\" as a positive integer\n", argv[3]); + fprintf(stderr, "couldn't parse \"%s\" as a positive integer\n", argv[optind]); exit(1); } + optind++; } /* Make a connection to the database */ @@ -1272,30 +1295,41 @@ main(int argc, char **argv) PQerrorMessage(conn)); exit_nicely(conn); } + + /* Set the trace file, if requested */ + if (tracefile != NULL) + { + trace = fopen(tracefile, "w+"); + + if (trace == NULL) + pg_fatal("could not open file \"%s\": %m", tracefile); + PQtrace(conn, trace); + PQtraceSetFlags(conn, PQTRACE_SUPPRESS_TIMESTAMPS); + } + res = PQexec(conn, "SET lc_messages TO \"C\""); if (PQresultStatus(res) != PGRES_COMMAND_OK) pg_fatal("failed to set lc_messages: %s", PQerrorMessage(conn)); - if (strcmp(argv[1], "disallowed_in_pipeline") == 0) + if (strcmp(testname, "disallowed_in_pipeline") == 0) test_disallowed_in_pipeline(conn); - else if (strcmp(argv[1], "multi_pipelines") == 0) + else if (strcmp(testname, "multi_pipelines") == 0) test_multi_pipelines(conn); - else if (strcmp(argv[1], "pipeline_abort") == 0) + else if (strcmp(testname, "pipeline_abort") == 0) test_pipeline_abort(conn); - else if (strcmp(argv[1], "pipelined_insert") == 0) + else if (strcmp(testname, "pipelined_insert") == 0) test_pipelined_insert(conn, numrows); - else if (strcmp(argv[1], "prepared") == 0) + else if (strcmp(testname, "prepared") == 0) test_prepared(conn); - else if (strcmp(argv[1], "simple_pipeline") == 0) + else if (strcmp(testname, "simple_pipeline") == 0) test_simple_pipeline(conn); - else if (strcmp(argv[1], "singlerow") == 0) + else if (strcmp(testname, "singlerow") == 0) test_singlerowmode(conn); - else if (strcmp(argv[1], "transaction") == 0) + else if (strcmp(testname, "transaction") == 0) test_transaction(conn); else { - fprintf(stderr, "\"%s\" is not a recognized test name\n", argv[1]); - usage(argv[0]); + fprintf(stderr, "\"%s\" is not a recognized test name\n", testname); exit(1); } -- 2.20.1 >From 4b62c3159fe3aa8445317a5d65b7e81d91c7fba6 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 26 Mar 2021 19:13:04 -0300 Subject: [PATCH 2/5] put FILE * arg always first --- src/interfaces/libpq/libpq-trace.c | 234 ++--- 1 fil
Re: libpq debug log
On 2021-Mar-26, alvhe...@alvh.no-ip.org wrote: > Proposed changes on top of v29. This last one uses libpq_pipeline -t and verifies the output against an expected trace file. Applies on top of all the previous patches. I attach the whole lot, so that the CF bot has a chance to run it. I did notice another problem for comparison of expected trace files, which is that RowDescription includes table OIDs for some columns. I think we would need to have a flag to suppress that too, somehow, although the answer to what should we do is not as clear as for the other two cases. I dodged the issue by just using -t for the pipeline cases that don't have OIDs in their output. This leaves us with no coverage for the ErrorResponse message, which I think is a shortcoming. If we fixed the OID problem and the ErrorResponse problem, we could add an expected trace for pipeline_abort. I think we should do that. Maybe the easiest way is to have a new flag PQTRACE_REGRESS_MODE. -- Álvaro Herrera39°49'30"S 73°17'W Essentially, you're proposing Kevlar shoes as a solution for the problem that you want to walk around carrying a loaded gun aimed at your foot. (Tom Lane) >From f239da90f069b19e1c2fa6b69bc7c6fd4be826b9 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 26 Mar 2021 11:10:52 -0300 Subject: [PATCH v30 1/7] libpq trace v29 --- doc/src/sgml/libpq.sgml | 39 +- src/interfaces/libpq/Makefile | 1 + src/interfaces/libpq/exports.txt| 1 + src/interfaces/libpq/fe-connect.c | 21 - src/interfaces/libpq/fe-exec.c | 4 - src/interfaces/libpq/fe-misc.c | 66 +-- src/interfaces/libpq/fe-protocol3.c | 8 + src/interfaces/libpq/libpq-fe.h | 2 + src/interfaces/libpq/libpq-int.h| 7 + src/interfaces/libpq/libpq-trace.c | 717 10 files changed, 789 insertions(+), 77 deletions(-) create mode 100644 src/interfaces/libpq/libpq-trace.c diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index be674fbaa9..838e394d54 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -6459,12 +6459,27 @@ PGContextVisibility PQsetErrorContextVisibility(PGconn *conn, PGContextVisibilit - Enables tracing of the client/server communication to a debugging file stream. + Enables tracing of the client/server communication to a debugging file + stream. void PQtrace(PGconn *conn, FILE *stream); + + Each line consists of: an optional timestamp, a direction indicator + (> for messages from client to server + or < for messages from server to client), + message length, message type, and message contents. + Non-message contents fields (timestamp, direction, length and message type) + are separated by a tab. Message contents are separated by a space. + Protocol strings are enclosed in double quotes, while strings used as data + values are enclosed in single quotes. Non-printable chars are printed as + hexadecimal escapes. + Further message-type-specific detail can be found in + . + + On Windows, if the libpq library and an application are @@ -6479,6 +6494,28 @@ void PQtrace(PGconn *conn, FILE *stream); + +PQtraceSetFlagsPQtraceSetFlags + + + + Controls the tracing behavior of client/server communication. + +void PQtraceSetFlags(PGconn *conn, int flags); + + + + + flags contains flag bits describing the operating mode + of tracing. + If flags contains PQTRACE_SUPPRESS_TIMESTAMPS, + then the timestamp is not included when printing each message. + This function must be called after calling PQtrace. + + + + + PQuntracePQuntrace diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile index 2aca882a2b..0424523492 100644 --- a/src/interfaces/libpq/Makefile +++ b/src/interfaces/libpq/Makefile @@ -41,6 +41,7 @@ OBJS = \ fe-secure.o \ legacy-pqsignal.o \ libpq-events.o \ + libpq-trace.o \ pqexpbuffer.o \ fe-auth.o diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index 5c48c14191..a00701f2c5 100644 --- a/src/interfaces/libpq/exports.txt +++ b/src/interfaces/libpq/exports.txt @@ -183,3 +183,4 @@ PQenterPipelineMode 180 PQexitPipelineMode181 PQpipelineSync182 PQpipelineStatus 183 +PQtraceSetFlags 184 diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 53b354abb2..a90bdb8ab6 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -6859,27 +6859,6 @@ PQsetErrorContextVisibility(PGconn *conn, PGContextVisibility show_context) return old; } -void -PQtrace(PGconn *conn, FILE *debug_port) -{ - if (conn == NULL) - return; - PQuntrace(conn); - conn->Pfdebug = debug_port; -} - -voi
Re: libpq debug log
On 2021-Mar-27, alvhe...@alvh.no-ip.org wrote: > This last one uses libpq_pipeline -t and verifies the output against an > expected trace file. Applies on top of all the previous patches. I > attach the whole lot, so that the CF bot has a chance to run it. All tests pass, but CFbot does not run src/test/modules, so it's not saying much. -- Álvaro Herrera39°49'30"S 73°17'W
Re: libpq debug log
On 2021-Mar-29, tsunakawa.ta...@fujitsu.com wrote: > > (Hey, what the heck is that "Z" at the end of the message? I thought > > the error ended right at the \x00 ...) > > We'll investigate these issues. For what it's worth, I did fix this problem in patch 0005 that I attached. The problem was that one "continue" should have been "break", and also a "while ( .. )" needed to be made an infinite loop. It was easy to catch these problems once I added (in 0006) the check that the bytes consumed equal message length, as I had suggested a couple of weeks ago :-) I also changed the code for Notice, but I didn't actually verify that one. > > 2. The < and > characters are not good for visual inspection. I > > replaced them with F and B and I think it's much clearer. Compare: > > I think the second one is much easier on the eye. > > Yes, agreed. I too thought of something like "C->S" and "S->C" > because client and server should be more familiar for users than > frontend and backend. Hmm, yeah, that's a reasonable option too. What do others think? -- Álvaro Herrera Valdivia, Chile
Re: libpq debug log
Okay, pushed this patch and the new testing for it based on libpq_pipeline. We'll see how the buildfarm likes it. I made some further changes to the last version; user-visibly, I split the trace flags in two, keeping the timestamp suppression separate from the redacting feature for regression testing. I didn't like the idea of silently skipping the redacted fields, so I changed the code to print or instead. I also made the redacting occur in the last mile (pqTraceOutputInt32 / String) rather that in their callers: it seemed quite odd to advance the cursor in the "else" branch. I refactored the duplicate code that appeared for Notice and Error. In that function, we redact not only the 'L' field (what Iwata-san was doing) but also 'F' (file) and 'R' (routine) because those things can move around for reasons that are not interesting to testing this code. In the libpq_pipeline commit I added 'pipeline_abort' and 'transaction' to the cases that generate traces, which adds coverage for NoticeResponse and ErrorResponse. -- Álvaro Herrera39°49'30"S 73°17'W
Re: libpq debug log
So crake failed. The failure is that it doesn't print the DataRow messages. That's quite odd. We see this in the trace log: Mar 30 20:05:15 # F 54 Parse"" "SELECT 1.0/g FROM generate_series(3, -1, -1) g" 0 Mar 30 20:05:15 # F 12 Bind "" "" 0 0 0 Mar 30 20:05:15 # F 6 Describe P "" Mar 30 20:05:15 # F 9 Execute "" 0 Mar 30 20:05:15 # F 4 Sync Mar 30 20:05:15 # B 4 ParseComplete Mar 30 20:05:15 # B 4 BindComplete Mar 30 20:05:15 # B 33 RowDescription 1 "?column?" 0 65535 -1 0 Mar 30 20:05:15 # B 70 ErrorResponseS "ERROR" V "ERROR" C "22012" M "division by zero" F "" L "" R "" \\x00 Mar 30 20:05:15 # B 5 ReadyForQueryI and the expected is this: Mar 30 20:05:15 # F 54 Parse"" "SELECT 1.0/g FROM generate_series(3, -1, -1) g" 0 Mar 30 20:05:15 # F 12 Bind "" "" 0 0 0 Mar 30 20:05:15 # F 6 Describe P "" Mar 30 20:05:15 # F 9 Execute "" 0 Mar 30 20:05:15 # F 4 Sync Mar 30 20:05:15 # B 4 ParseComplete Mar 30 20:05:15 # B 4 BindComplete Mar 30 20:05:15 # B 33 RowDescription 1 "?column?" 0 65535 -1 0 Mar 30 20:05:15 # B 32 DataRow 1 22 '0.' Mar 30 20:05:15 # B 32 DataRow 1 22 '0.5000' Mar 30 20:05:15 # B 32 DataRow 1 22 '1.' Mar 30 20:05:15 # B 70 ErrorResponseS "ERROR" V "ERROR" C "22012" M "division by zero" F "" L "" R "" \\x00 Mar 30 20:05:15 # B 5 ReadyForQueryI ... I wonder if this is a libpq pipeline problem rather than a PQtrace problem. In that test case we're using the libpq singlerow mode, so we should see three rows before the error is sent, but for some reason crake is not doing that. aborted pipeline... NOTICE: table "pq_pipeline_demo" does not exist, skipping ok got expected ERROR: cannot insert multiple commands into a prepared statement got expected division-by-zero ok 5 - libpq_pipeline pipeline_abort not ok 6 - pipeline_abort trace match Other hosts seem to get it right: # Running: libpq_pipeline -t /Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/libpq_pipeline/tmp_check/log/pipeline_abort.trace pipeline_abort port=49797 host=/var/folders/md/8mp8j5m5169ccgy11plb4w_8gp/T/iA9YP9_IHa dbname='postgres' 1 aborted pipeline... NOTICE: table "pq_pipeline_demo" does not exist, skipping ok got expected ERROR: cannot insert multiple commands into a prepared statement got row: 0. got row: 0.5000 got row: 1. got expected division-by-zero ok 5 - libpq_pipeline pipeline_abort ok 6 - pipeline_abort trace match -- Álvaro Herrera Valdivia, Chile "This is what I like so much about PostgreSQL. Most of the surprises are of the "oh wow! That's cool" Not the "oh shit!" kind. :)" Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php
Re: libpq debug log
On 2021-Mar-30, 'alvhe...@alvh.no-ip.org' wrote: > got expected ERROR: cannot insert multiple commands into a prepared statement > got expected division-by-zero Eh? this is not at all expected, of course, but clearly not PQtrace's fault. I'll look tomorrow. -- Álvaro Herrera39°49'30"S 73°17'W "El sentido de las cosas no viene de las cosas, sino de las inteligencias que las aplican a sus problemas diarios en busca del progreso." (Ernesto Hernández-Novich)
Re: libpq debug log
On 2021-Mar-31, Tom Lane wrote: > I think this is a timing problem that's triggered (on some machines) > by force_parallel_mode = regress. Looking at spurfowl's latest > failure of this type, the postmaster log shows > > 2021-03-31 14:34:54.982 EDT [18233:15] 001_libpq_pipeline.pl LOG: execute > : SELECT 1.0/g FROM generate_series(3, -1, -1) g > 2021-03-31 14:34:54.992 EDT [18234:1] ERROR: division by zero > 2021-03-31 14:34:54.992 EDT [18234:2] STATEMENT: SELECT 1.0/g FROM > generate_series(3, -1, -1) g > 2021-03-31 14:34:54.993 EDT [18233:16] 001_libpq_pipeline.pl ERROR: division > by zero > 2021-03-31 14:34:54.993 EDT [18233:17] 001_libpq_pipeline.pl STATEMENT: > SELECT 1.0/g FROM generate_series(3, -1, -1) g > 2021-03-31 14:34:54.995 EDT [18216:4] LOG: background worker "parallel > worker" (PID 18234) exited with exit code 1 > 2021-03-31 14:34:54.995 EDT [18233:18] 001_libpq_pipeline.pl LOG: could not > send data to client: Broken pipe > 2021-03-31 14:34:54.995 EDT [18233:19] 001_libpq_pipeline.pl FATAL: > connection to client lost > > We can see that the division by zero occurred in a parallel worker. > My theory is that in parallel mode, it's uncertain whether the > error will be reported before or after the "preceding" successful > output rows. So you need to disable parallelism to make this > test case stable. Ooh, that makes sense, thanks. Added a SET in the program itself to set it to off. This is not the *only* issue though; at least animal drongo shows a completely different failure, where the last few tests don't even get to run, and the server log just has this: 2021-03-31 20:20:14.460 UTC [178364:4] 001_libpq_pipeline.pl LOG: disconnection: session time: 0:00:00.469 user=pgrunner database=postgres host=127.0.0.1 port=52638 2021-03-31 20:20:14.681 UTC [178696:1] [unknown] LOG: connection received: host=127.0.0.1 port=52639 2021-03-31 20:20:14.687 UTC [178696:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_libpq_pipeline.pl 2021-03-31 20:20:15.157 UTC [178696:3] 001_libpq_pipeline.pl LOG: could not receive data from client: An existing connection was forcibly closed by the remote host. I suppose the program is crashing for some reason. -- Álvaro Herrera Valdivia, Chile "Los dioses no protegen a los insensatos. Éstos reciben protección de otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)
Re: libpq debug log
On 2021-Mar-31, Tom Lane wrote: > I wrote: > > That is weird - only test 4 (of 8) runs at all, the rest seem to > > fail to connect. What's different about pipelined_insert? > > Oh ... there's a pretty obvious theory. pipelined_insert is > the only one that is not asked to write a trace file. > So for some reason, opening the trace file fails. > (I wonder why we don't see an error message for this though.) .. oh, I think we forgot to set conn->Pfdebug = NULL when creating the connection. So when we do PQtrace(), the first thing it does is PQuntrace(), and then that tries to do fflush(conn->Pfdebug) ---> crash. So this should fix it. -- Álvaro Herrera Valdivia, Chile really, I see PHP as like a strange amalgamation of C, Perl, Shell inflex: you know that "amalgam" means "mixture with mercury", more or less, right? i.e., "deadly poison" diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index a90bdb8ab6..56a8266bc3 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -3952,6 +3952,7 @@ makeEmptyPGconn(void) conn->verbosity = PQERRORS_DEFAULT; conn->show_context = PQSHOW_CONTEXT_ERRORS; conn->sock = PGINVALID_SOCKET; + conn->Pfdebug = NULL; /* * We try to send at least 8K at a time, which is the usual size of pipe
Re: libpq debug log
On 2021-Mar-31, 'alvhe...@alvh.no-ip.org' wrote: > .. oh, I think we forgot to set conn->Pfdebug = NULL when creating the > connection. So when we do PQtrace(), the first thing it does is > PQuntrace(), and then that tries to do fflush(conn->Pfdebug) ---> crash. > So this should fix it. I tried to use MALLOC_PERTURB_ to prove this theory, but I failed at it. I'll just push this blind and see what happens. -- Álvaro Herrera39°49'30"S 73°17'W "Someone said that it is at least an order of magnitude more work to do production software than a prototype. I think he is wrong by at least an order of magnitude." (Brian Kernighan)
Re: libpq debug log
On 2021-Mar-31, Tom Lane wrote: > While this may have little to do with drongo's issue, > I'm going to take exception to this bit that I see that > the patch added to PQtrace(): > > /* Make the trace stream line-buffered */ > setvbuf(debug_port, NULL, _IOLBF, 0); Mea culpa. I added this early on because it made PQtrace's use more comfortable, but I agree that it's a mistake. Removed. I put it in libpq_pipeline.c instead. > ... and as for an actual explanation for drongo's issue, > I'm sort of wondering why libpq_pipeline.c is opening the > trace file in "w+" mode rather than vanilla "w" mode. > That seems unnecessary, and maybe it doesn't work so well > on Windows. well, at least the MSVC documentation claims that it works, but who knows. I removed that too. https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-160 -- Álvaro Herrera Valdivia, Chile
Re: libpq debug log
On 2021-Mar-31, Tom Lane wrote: > I wrote: > > That is weird - only test 4 (of 8) runs at all, the rest seem to > > fail to connect. What's different about pipelined_insert? > > Oh ... there's a pretty obvious theory. pipelined_insert is > the only one that is not asked to write a trace file. > So for some reason, opening the trace file fails. > (I wonder why we don't see an error message for this though.) Eh, so I forgot to strdup(optarg[optind]). Apparently that works fine in glibc but other getopt implementations are likely not so friendly. -- Álvaro Herrera39°49'30"S 73°17'W "I am amazed at [the pgsql-sql] mailing list for the wonderful support, and lack of hesitasion in answering a lost soul's question, I just wished the rest of the mailing list could be like this." (Fotis) (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)
Re: libpq debug log
On 2021-Apr-01, Tom Lane wrote: > BTW, what in the world is this supposed to accomplish? > > -(long long) rows_to_send); > +(1L << 62) + (long long) rows_to_send); > > Various buildfarm members are complaining that the shift distance > is more than the width of "long", which I'd just fix with s/L/LL/ > except that the addition of the constant seems just wrong to > begin with. It makes the text representation wider, which means some buffer fills up faster and the program switches from sending to receiving. -- Álvaro Herrera Valdivia, Chile "Ni aún el genio muy grande llegaría muy lejos si tuviera que sacarlo todo de su propio interior" (Goethe)
Re: libpq debug log
On 2021-Apr-01, Tom Lane wrote: > "'alvhe...@alvh.no-ip.org'" writes: > > On 2021-Apr-01, Tom Lane wrote: > >> BTW, what in the world is this supposed to accomplish? > >> -(long long) rows_to_send); > >> +(1L << 62) + (long long) rows_to_send); > > > It makes the text representation wider, which means some buffer fills up > > faster and the program switches from sending to receiving. > > Hm. If the actual field value isn't relevant, why bother including > rows_to_send in it? A constant string would be simpler and much > less confusing as to its purpose. Hah, yeah, we could as well do that I guess :-) -- Álvaro Herrera Valdivia, Chile "I think my standards have lowered enough that now I think 'good design' is when the page doesn't irritate the living f*ck out of me." (JWZ)
Re: libpq debug log
On 2021-Apr-01, Tom Lane wrote: > So drongo is still failing, and after a bit of looking around at > other code I finally got hit with the clue hammer. Per port.h: > > * On Windows, setvbuf() does not support _IOLBF mode, and interprets that > * as _IOFBF. To add insult to injury, setvbuf(file, NULL, _IOFBF, 0) > * crashes outright if "parameter validation" is enabled. Therefore, in > * places where we'd like to select line-buffered mode, we fall back to > * unbuffered mode instead on Windows. Always use PG_IOLBF not _IOLBF > * directly in order to implement this behavior. > > You want to do the honors? And do something about that shift bug > while at it. Ooh, wow ... now that is a silly bug! Thanks, I'll push the fix in a minute. Andrew also noted that the use of command_ok() in the test file eats all the output and is what is preventing us from seeing it in the first. I'll try to get rid of that together with the rest. -- Álvaro Herrera Valdivia, Chile
Re: libpq debug log
On 2021-Apr-01, 'alvhe...@alvh.no-ip.org' wrote: > Ooh, wow ... now that is a silly bug! Thanks, I'll push the fix in a > minute. It still didn't fix it! Drongo is now reporting a difference in the expected trace -- and the differences all seem to be message lengths. Now that is pretty mysterious, because the messages themselves are printed identically. Perl's output is pretty unhelpful, but I wrote them to a file and diffed manually; it's attached. So for example when we expect ! # B 123 ErrorResponseS "ERROR" V "ERROR" C "42601" M "cannot insert multiple commands into a prepared statement" F "" L "" R "" \\x00 we actually get ! # B 173 ErrorResponseS "ERROR" V "ERROR" C "42601" M "cannot insert multiple commands into a prepared statement" F "" L "" R "" \\x00 [slaps forehead] I suppose the difference must be that the message length includes the redacted string fields. So the file in the F file is 50 chars longer, *kaboom*. (I'm actually very surprised that this didn't blow up earlier.) I'm not sure what to do about this. Maybe the message length for ErrorResponse/NoticeResponse ought to be redacted too. -- Álvaro Herrera39°49'30"S 73°17'W
Re: Preventing abort() and exit() calls in libpq
On 2021-Jul-02, Jacob Champion wrote: > Only src/common: > > controldata_utils_shlib.o: > U close > U __errno_location > U exit Actually, I do see these in the .o files as well, but they don't make it to the .a file. gcc here is 8.3.0. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Hello, On 2022-Jun-29, Imseih (AWS), Sami wrote: > > Would you mind trying the second attached to abtain detailed log on > > your testing environment? With the patch, the modified TAP test yields > > the log lines like below. > > Thanks for this. I will apply this to the testing environment and > will share the output. Any luck with this? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Doing what he did amounts to sticking his fingers under the hood of the implementation; if he gets his fingers burnt, it's his problem." (Tom Lane)
Re: prevent immature WAL streaming
On 2021-Aug-25, Jakub Wartak wrote: > In order to get reliable reproducer and get proper the fault injection > instead of playing with really filling up fs, apparently one could > substitute fd with fd of /dev/full using e.g. dup2() so that every > write is going to throw this error too: Oh, this is a neat trick that I didn't know about. Thanks. > After restarting master and inspecting standby - in all of those above > 3 cases - the standby didn't inhibit the "invalid contrecord length" > at least here, while without corruption this v02 patch it is > notorious. So if it passes the worst-case code review assumptions I > would be wondering if it shouldn't even be committed as it stands > right now. Well, Nathan is right that this patch isn't really closing the hole because we aren't tracking segment boundaries that aren't "earliest" nor "latest" at the moment of registration. The simplistic approach seems okay for the archive case, but not for the replication case. I also noticed today (facepalm) that the patch doesn't work unless you have archiving enabled, because the registration code is only invoked when XLogArchivingActive(). Doh. This part is easy to solve. The other doesn't look easy ... -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: prevent immature WAL streaming
BTW while going about testing this, I noticed that we forbid pg_walfile_name() while in recovery. That restriction was added by commit 370f770c15a4 because ThisTimeLineID was not set correctly during recovery. That was supposed to be fixed by commit 1148e22a82ed, so I thought that it should be possible to remove the restriction. However, I did that per the attached patch, but was quickly disappointed because ThisTimeLineID seems to remain zero in a standby for reasons that I didn't investigate. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "The problem with the facetime model is not just that it's demoralizing, but that the people pretending to work interrupt the ones actually working." (Paul Graham) >From 1b57e504b9a7eeddf2d99e615f21c5aec042a908 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 25 Aug 2021 19:55:43 -0400 Subject: [PATCH] Don't forbid pg_walfile_name in recovery mode --- src/backend/access/transam/xlogfuncs.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index b98deb72ec..99b22d0b30 100644 --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c @@ -503,13 +503,6 @@ pg_walfile_name(PG_FUNCTION_ARGS) XLogRecPtr locationpoint = PG_GETARG_LSN(0); char xlogfilename[MAXFNAMELEN]; - if (RecoveryInProgress()) - ereport(ERROR, -(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), - errmsg("recovery is in progress"), - errhint("%s cannot be executed during recovery.", - "pg_walfile_name()"))); - XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size); XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size); -- 2.30.2
Re: archive status ".ready" files may be created too early
On 2021-Aug-28, Andres Freund wrote: > While rebasing the aio patchset ontop of HEAD I noticed that this commit added > another atomic operation to XLogWrite() with archiving enabled. The WAL write > path is really quite hot, and at least some of the > NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held. > > I think we should at least try to make the fast-path where no segment > boundaries were crossed use no atomic operations. I think the best way to achieve this is is to rely completely on walwriter doing the segment notification, so that the WAL write done by backend would only do a latch set. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 2021-Aug-31, Andres Freund wrote: > Maybe, but this is getting uglier and uglier. > > I think patch should be reverted. It's not in a state that's appropriate for > the backbranches. Yeah, that's becoming my conclusion too -- undo that, and start from scratch using the other idea. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 2021-Sep-01, Michael Paquier wrote: > That's about 515e3d8, right? Yes. > I have not looked in details at what you have here, but this produces > a compilation warning on Windows for me with this part of the patch: This seems a tiny speck in a sea of bogosity. If you want to silence the warning, be my guest, but in the long run I am inclined to revert the whole commit once I have a better picture of a way forward. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
So why do we call this structure "record boundary map", when the boundaries it refers to are segment boundaries? I think we should call it "segment boundary map" instead ... and also I think we should use that name in the lock that protects it, so instead of ArchNotifyLock, it could be SegmentBoundaryLock or perhaps WalSegmentBoundaryLock. The reason for the latter is that I suspect the segment boundary map will also have a use to fix the streaming replication issue I mentioned earlier in the thread. This also makes me think that we'll want the wal record *start* address to be in the hash table too, not just its *end* address. So we'll use the start-1 as position to send, rather than the end-of-segment when GetFlushRecPtr() returns that. As for 0x, I think it would be cleaner to do a #define MaxXLogSegNo with that value in the line immediately after typedef XLogSegNo, rather than use the numerical value directly in the assignment. Typo in comment atop RemoveRecordBoundariesUpTo: it reads "up to an", should read "up to and". I think the API of GetLatestRecordBoundarySegment would be better by returning the boolean and having the segment as out argument. Then you could do the caller more cleanly, if (GetLatestRecordBoundarySegment(last_notified, flushed, &latest_boundary_segment)) { SetLastNotified( ... ); RemoveRecordBoundaries( ... ); LWLockRelease( ... ); for (..) XLogArchiveNotifySeg(...); PgArchWakeup(); } else LWLockRelease(...); -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "La virtud es el justo medio entre dos defectos" (Aristóteles)
Re: archive status ".ready" files may be created too early
On 2021-Aug-17, Bossart, Nathan wrote: > On 8/16/21, 5:09 PM, "alvhe...@alvh.no-ip.org" > wrote: > > The reason for the latter is that I suspect the segment boundary map > > will also have a use to fix the streaming replication issue I mentioned > > earlier in the thread. This also makes me think that we'll want the wal > > record *start* address to be in the hash table too, not just its *end* > > address. So we'll use the start-1 as position to send, rather than the > > end-of-segment when GetFlushRecPtr() returns that. > > I've been thinking about the next steps for this one, too. ISTM we'll > need to basically assume that the flush pointer jumps along record > boundaries except for the cross-segment records. I don't know if that > is the safest assumption, but I think the alternative involves > recording every record boundary in the map. I'm not sure I understand your idea correctly. Perhaps another solution is to assume that the flush pointer jumps along record boundaries *including* for cross-segment records. The problem stems precisely from the fact that we set the flush pointer at segment boundaries, even when they aren't record boundary. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'. After collecting 500 such letters, he mused, a university somewhere in Arizona would probably grant him a degree. (Don Knuth)
Re: archive status ".ready" files may be created too early
On 2021-Aug-17, Bossart, Nathan wrote: > I think we are in agreement. If we assume that the flush pointer > jumps along record boundaries and segment boundaries, the solution > would be to avoid using the flush pointer when it points to a segment > boundary (given that the segment boundary is not also a record > boundary). Instead, we'd only send up to the start position of the > last record in the segment to standbys. Agreed. An implementation for that would be to test the flush pointer for it being a segment boundary, and in that case we (acquire segment boundary lock and) test for presence in the segment boundary map. If present, then retreat the pointer to the record's start address. This means that we acquire the segment boundary lock rarely. I was concerned that we'd need to acquire it every time we read the flush pointer, which would have been a disaster. Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
The thing I still don't understand about this patch is why we call RegisterSegmentBoundaryEntry and NotifySegmentsReadyForArchive in XLogInsertRecord. My model concept of this would have these routines called only in XLogFlush / XLogWrite, which are conceptually about transferring data from in-memory WAL buffers into the on-disk WAL store (pg_xlog files). As I understand, XLogInsertRecord is about copying bytes from the high-level operation (heap insert etc) into WAL buffers. So doing the register/notify dance in both places should be redundant and unnecessary. In the NotifySegmentsReadyForArchive() call at the bottom of XLogWrite, we use flushed=InvalidXLogRecPtr. Why is that? Surely we can use LogwrtResult.Flush, just like in the other callsite there? (If we're covering for somebody advancing FlushRecPtr concurrently, I think we add a comment to explain that. But why do we need to do that in the first place?) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "The eagle never lost so much time, as when he submitted to learn of the crow." (William Blake)
Re: archive status ".ready" files may be created too early
On 2021-Aug-17, Bossart, Nathan wrote: > The main reason for registering the boundaries in XLogInsertRecord() > is that it has the required information about the WAL record > boundaries. I do not think that XLogWrite() has this information. Doh, of course. So, why isn't it that we call Register in XLogInsertRecord, and Notify in XLogWrite? -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "La fuerza no está en los medios físicos sino que reside en una voluntad indomable" (Gandhi)
Re: archive status ".ready" files may be created too early
On 2021-Aug-17, Bossart, Nathan wrote: > On 8/17/21, 2:13 PM, "alvhe...@alvh.no-ip.org" > wrote: > > > So, why isn't it that we call Register in XLogInsertRecord, and > > Notify in XLogWrite? > > We do. However, we also call NotifySegmentsReadyForArchive() in > XLogInsertRecord() to handle the probably-unlikely scenario that the > flush pointer has already advanced past the to-be-registered boundary. > This ensures that the .ready files are created as soon as possible. I see. I have two thoughts on that. First, why not do it outside the block that tests for crossing a segment boundary? If that's a good thing to do, then we should do it always. However, why do it in a WAL-producing client-connected backend? It strikes me as a bad thing to do, because you are possibly causing delays for client-connected backends. I suggest that we should give this task to the WAL writer process -- say, have XLogBackgroundFlush do it. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "All rings of power are equal, But some rings of power are more equal than others." (George Orwell's The Lord of the Rings)
Re: archive status ".ready" files may be created too early
On 2021-Aug-17, alvhe...@alvh.no-ip.org wrote: > However, why do it in a WAL-producing client-connected backend? It > strikes me as a bad thing to do, because you are possibly causing delays > for client-connected backends. I suggest that we should give this task > to the WAL writer process -- say, have XLogBackgroundFlush do it. Reading the comments on walwriter.c I am hesitant of having walwriter do it: > * Because the walwriter's cycle is directly linked to the maximum delay > * before async-commit transactions are guaranteed committed, it's probably > * unwise to load additional functionality onto it. For instance, if you've > * got a yen to create xlog segments further in advance, that'd be better done > * in bgwriter than in walwriter. So that comment suggests that we should give the responsibility to bgwriter. This seems good enough to me. I suppose if bgwriter has a long run of buffers to write it could take a little bit of time (a few hundred milliseconds?) but I think that should be okay. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "No renuncies a nada. No te aferres a nada."
Re: archive status ".ready" files may be created too early
On 2021-Aug-18, Bossart, Nathan wrote: > On 8/18/21, 10:06 AM, "alvhe...@alvh.no-ip.org" > wrote: > > So that comment suggests that we should give the responsibility to bgwriter. > > This seems good enough to me. I suppose if bgwriter has a long run of > > buffers to write it could take a little bit of time (a few hundred > > milliseconds?) but I think that should be okay. > > Do you think bgwriter should be the only caller of > NotifySegmentsReadyForArchive(), or should we still have XLogWrite() > call it? I think XLogWrite should absolutely be the primary caller. The one in bgwriter should be a backstop for the case you describe where the flush pointer advanced past the registration point in XLogInsertRecord. I realize this means there's a contradiction with my previous argument, in that synchronous transaction commit calls XLogWrite at some point, so we *are* putting the client-connected backend in charge of creating the notify files. However, that only happens on transaction commit, where we already accept responsibility for the WAL flush, not on each individual XLOG record insert; also, the WAL writer will take care of it sometimes, for transactions that are long-enough lived. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "How strange it is to find the words "Perl" and "saner" in such close proximity, with no apparent sense of irony. I doubt that Larry himself could have managed it." (ncm, http://lwn.net/Articles/174769/)
Re: archive status ".ready" files may be created too early
On 2021-Aug-18, alvhe...@alvh.no-ip.org wrote: > I realize this means there's a contradiction with my previous argument, > in that synchronous transaction commit calls XLogWrite at some point, so > we *are* putting the client-connected backend in charge of creating the > notify files. However, that only happens on transaction commit, where > we already accept responsibility for the WAL flush, not on each > individual XLOG record insert; also, the WAL writer will take care of it > sometimes, for transactions that are long-enough lived. Eh. I just said WAL writer will sometimes do it, and that's true because it'll occur in XLogBackgroundFlush. But upthread I wimped out of having WAL writer call NotifySegmentsReadyForArchive() and instead opined to give responsibility to bgwriter. However, thinking about it again, maybe it does make sense to have walwriter do it too directly. This causes no harm to walwriter's time constraints, since *it will have to do it via XLogBackgroundFlush anyway*. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 2021-Aug-18, Bossart, Nathan wrote: > I'll add it after XLogBackgroundFlush(). I was wondering which would be better: before or after. XLogBackgroundFlush would do it anyway, so if you do it after then it's not clear to me that it'd do anything (I mean we should not do any new calls of NotifySegmentsReadyForArchive and just rely on the one in XLogBackgroundFlush -> XLogWrite). The advantage of doing NotifySegmentsReadyForArchive before XLogBackgroundFlush is that the files would be created sooner, so the archiver can be working in parallel while walwriter does its other thing; then we'd reach the NotifySegmentsReadyForArchive in XLogBackgroundFlush and it'd find nothing to do most of the time, which is just fine. > I think we'll also want to set the WAL writer's latch in case it is > hibernating. Yeah. (That's another advantage of doing it in walwriter rather than bgwriter: we don't publish bgwriter's latch anywhere AFAICS). > Another approach could be to keep the NotifySegmentsReadyForArchive() > call in XLogInsertRecord(), but only call it if the flush pointer is > beyond the boundary we just registered. Or we could only set the > latch in XLogInsertRecord() if we detect that the flush pointer has > advanced. Hmm. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 2021-Aug-18, Bossart, Nathan wrote: > As long as XLogBackgroundFlush() found work to do, it would take care > of notifying, but I don't think we can depend on that. However, since > we're primarily using the WAL writer to take care of the case when the > record has already been flushed, notifying beforehand seems fine to > me. If XLogBackgroundFlush() does end up calling XLogWrite(), it'll > call it again, anyway. Agreed. > In the attached patch, I modified XLogInsertRecord() to simply set the > latch if we detect that flushRecPtr has advanced. Right, that's what I was thinking. I modified that slightly to use LogwrtResult.Flush (which should be fresh enough) instead of calling GetFlushRecPtr again, which saves a bit. I also changed it to > instead of >=, because if I understand you correctly we only care to notify if the flush pointer advanced, not in the case it stayed the same. I made a few other cosmetic tweaks -- added comment to SegmentBoundaryEntry and renamed the 'pos' to 'endpos'; renamed argument 'notify' of XLogArchiveNotify to 'nudge' (because having two different "notify" concepts in that function seemed confusing). -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "It takes less than 2 seconds to get to 78% complete; that's a good sign. A few seconds later it's at 90%, but it seems to have stuck there. Did somebody make percentages logarithmic while I wasn't looking?" http://smylers.hates-software.com/2005/09/08/1995c749.html >From 39190a4e33a4cb35402ec9451ce6678f6e9aa494 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 17 Aug 2021 03:52:43 + Subject: [PATCH v10] Avoid creating archive status ".ready" files too early. WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file will be created. If PostgreSQL crashes before it is able to write the rest of the record, it will end up reusing segments that have already been marked as ready-for-archival. However, the archiver process may have already processed the old version of the segment, so the wrong version of the segment may be backed-up. This backed-up segment will cause operations such as point-in-time restores to fail. To fix this, we keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. --- src/backend/access/transam/timeline.c| 2 +- src/backend/access/transam/xlog.c| 291 ++- src/backend/access/transam/xlogarchive.c | 17 +- src/backend/postmaster/walwriter.c | 7 + src/backend/replication/walreceiver.c| 6 +- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/include/access/xlog.h| 1 + src/include/access/xlogarchive.h | 4 +- src/include/access/xlogdefs.h| 1 + src/tools/pgindent/typedefs.list | 1 + 10 files changed, 309 insertions(+), 22 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..acd5c2431d 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, if (XLogArchivingActive()) { TLHistoryFileName(histfname, newTLI); - XLogArchiveNotify(histfname); + XLogArchiveNotify(histfname, true); } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a749d..d2ccf2a7bb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -512,6 +512,16 @@ typedef enum ExclusiveBackupState */ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE; +/* + * Entries for SegmentBoundaryMap. Each such entry represents a WAL + * record that ends in endpos and crosses a WAL segment boundary. + */ +typedef struct SegmentBoundaryEntry +{ + XLogSegNo seg; /* must be first */ + XLogRecPtr endpos; +} SegmentBoundaryEntry; + /* * Shared state data for WAL insertion. */ @@ -723,6 +733,12 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* + * The last segment we've marked ready for archival. Protected by + * info_lck. + */ + XLogSegNo lastNotifiedSeg; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -736,6 +752,12 @@ static WALInsertLockPadded *WALInsertLocks = NULL; */ static ControlFileData *ControlFile = NULL; +/* + * Segment boundary map, used for marking segments as ready for archival. + * Protected by SegmentBoundaryLock. + */ +static HTAB *SegmentBoundaryMap = NULL; + /* * Calculate the amount of space left on the page after 'endptr'. Beware * multiple evaluation! @@ -962,6 +984,13 @@ static XLogRecPtr XLogBytePos
Re: archive status ".ready" files may be created too early
In v12 I moved the code around a bit and reworded some comments. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ >From 7d1475578431e265a5e7f8b94a6b0791b68a9a31 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 17 Aug 2021 03:52:43 + Subject: [PATCH v12] Avoid creating archive status ".ready" files too early WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file is created, and the archiver may process it. If PostgreSQL crashes before it is able to write and flush the rest of the record (in the next WAL segment), the wrong version of the first segment file lingers in the archive, which causes operations such as point-in-time restores to fail. To fix this, keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. XXX Note that this may add a new LWLock acquisition in the transaction commit path. (It doesn't happen in all cases.) Author: Nathan Bossart Reviewed-by: Kyotaro Horiguchi Reviewed-by: Ryo Matsumura Reviewed-by: Andrey Borodin Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com --- src/backend/access/transam/timeline.c| 2 +- src/backend/access/transam/xlog.c| 307 ++- src/backend/access/transam/xlogarchive.c | 17 +- src/backend/postmaster/walwriter.c | 7 + src/backend/replication/walreceiver.c| 6 +- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/include/access/xlog.h| 1 + src/include/access/xlogarchive.h | 4 +- src/include/access/xlogdefs.h| 1 + src/tools/pgindent/typedefs.list | 1 + 10 files changed, 323 insertions(+), 24 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..acd5c2431d 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, if (XLogArchivingActive()) { TLHistoryFileName(histfname, newTLI); - XLogArchiveNotify(histfname); + XLogArchiveNotify(histfname, true); } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a749d..01da4ab7b8 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -512,6 +512,16 @@ typedef enum ExclusiveBackupState */ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE; +/* + * Entries for SegmentBoundaryMap. Each such entry represents a WAL + * record that ends in endpos and crosses the WAL segment boundary. + */ +typedef struct SegmentBoundaryEntry +{ + XLogSegNo seg; /* must be first */ + XLogRecPtr endpos; +} SegmentBoundaryEntry; + /* * Shared state data for WAL insertion. */ @@ -723,6 +733,12 @@ typedef struct XLogCtlData */ XLogRecPtr lastFpwDisableRecPtr; + /* + * The last segment we've marked ready for archival. Protected by + * info_lck. + */ + XLogSegNo lastNotifiedSeg; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -736,6 +752,23 @@ static WALInsertLockPadded *WALInsertLocks = NULL; */ static ControlFileData *ControlFile = NULL; +/* + * Segment boundary map. + * + * This hash table tracks WAL records that cross WAL segment boundaries. + * For WAL archiving it is critical not to mark a segment as archivable + * until both halves of the record are flushed: otherwise if we crash + * after archiving the segment containing the first half and before + * flushing the second half of the record, then after recovery the + * primary would overwrite the first half with new data, making the + * segment archived prior to the crash a bogus copy. + * + * XXX note that an equivalent problem exists with streaming replication. + * + * Protected by SegmentBoundaryLock. + */ +static HTAB *SegmentBoundaryMap = NULL; + /* * Calculate the amount of space left on the page after 'endptr'. Beware * multiple evaluation! @@ -920,6 +953,13 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo, XLogSegNo *endlogSegNo); static void UpdateLastRemovedPtr(char *filename); static void ValidateXLOGDirectoryStructure(void); +static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos); +static XLogSegNo GetLastNotifiedSegment(void); +static void SetLastNotifiedSegment(XLogSegNo seg); +static bool GetLatestSegmentBoundary(XLogSegNo last_notified, + XLogRecPtr flushed, + XLogSegNo *latest_boundary_seg); +static void RemoveSegmentBoundariesUpTo(XLogSegNo seg); static void CleanupBackupHistory(void); static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord
Re: archive status ".ready" files may be created too early
Two things. 1. We use a hash table in shared memory. That's great. The part that's not so great is that in both places where we read items from it, we have to iterate in some way. This seems a bit silly. An array would serve us better, if only we could expand it as needed. However, in shared memory we can't do that. (I think the list of elements we need to memoize is arbitrary long, if enough processes can be writing WAL at the same time.) Now that I think about this again, maybe it's limited by NUM_XLOGINSERT_LOCKS, since there can only be that many records being written down at a time ... 2. There is a new LWLock acquisition that may be a new contention point. We acquire the lock in these cases: - WAL record insert, when a segment boundary is crosses (rare enough). - XLogWrite, when a segment needs to be notified. Looking again, I think the last point might be a problem actually, because XLogWrite is called with WALWriteLock held. Maybe we should take the NotifySegmentsReadyForArchive() call outside the section locked by WALWriteLock (so put it in XLogWrite callers instead of XLogWrite itself). -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en La Feria de las Tinieblas, R. Bradbury)
Re: archive status ".ready" files may be created too early
On 2021-Aug-20, Bossart, Nathan wrote: > On 8/20/21, 8:29 AM, "Robert Haas" wrote: > > We can't expand the hash table either. It has an initial and maximum > > size of 16 elements, which means it's basically an expensive array, > > and which also means that it imposes a new limit of 16 * > > wal_segment_size on the size of WAL records. If you exceed that limit, > > I think things just go boom... which I think is not acceptable. I > > think we can have records in the multi-GB range of wal_level=logical > > and someone chooses a stupid replica identity setting. > > I was under the impression that shared hash tables could be expanded > as necessary, but from your note and the following comment, that does > not seem to be true: Actually, you were right. Hash tables in shared memory can be expanded. There are some limitations (the hash "directory" is fixed size, which means the hash table get less efficient if it grows too much), but you can definitely create more hash entries than the initial size. See for example element_alloc(), which covers the case of a hash table being IS_PARTITIONED -- something that only shmem hash tables can be. Note that ShmemInitHash passes the HASH_ALLOC flag and uses ShmemAllocNoError as allocation function, which acquires memory from the shared segment. This is a minor thing -- it doesn't affect the fact that the hash table is possibly being misused and inefficient -- but I thought it was worth pointing out. As an example, consider the LOCK / PROCLOCK hash tables. These can contain more elements than max_backends * max_locks_per_transaction. Those elements consume shared memory from the "allocation slop" in the shared memory segment. It's tough when it happens (as far as I know the memory is never "returned" once such a hash table grows to use that space), but it does work. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 2021-Aug-20, Bossart, Nathan wrote: > > On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan wrote: > >> This led me to revisit the two-element > >> approach that was discussed upthread. What if we only stored the > >> earliest and latest segment boundaries at any given time? Once the > >> earliest boundary is added, it never changes until the segment is > >> flushed and it is removed. The latest boundary, however, will be > >> updated any time we register another segment. Once the earliest > >> boundary is removed, we replace it with the latest boundary. This > >> strategy could cause us to miss intermediate boundaries, but AFAICT > >> the worst case scenario is that we hold off creating .ready files a > >> bit longer than necessary. > I've attached a patch to demonstrate what I'm thinking. There is only one thing I didn't like in this new version, which is that we're holding info_lck too much. I've seen info_lck contention be a problem in some workloads and I'd rather not add more stuff to it. I'd rather we stick with using a new lock object to protect all the data we need for this job. Should this new lock object be a spinlock or an lwlock? I think a spinlock would generally be better because it's lower overhead and we can't use it in shared mode anywhere, which would be the greatest argument for an lwlock. However, I think we avoid letting code run with spinlocks held that's not straight-line code, and we have some function calls there. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
Attached is v14 which uses a separate spinlock. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "No me acuerdo, pero no es cierto. No es cierto, y si fuera cierto, no me acuerdo." (Augusto Pinochet a una corte de justicia) >From 4180334f51a1f343ce5795d0b62f3aa298b472da Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 20 Aug 2021 19:25:31 + Subject: [PATCH v14] Avoid creating archive status ".ready" files too early WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file is created, and the archiver may process it. If PostgreSQL crashes before it is able to write and flush the rest of the record (in the next WAL segment), the wrong version of the first segment file lingers in the archive, which causes operations such as point-in-time restores to fail. To fix this, keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. Author: Nathan Bossart Reviewed-by: Kyotaro Horiguchi Reviewed-by: Ryo Matsumura Reviewed-by: Andrey Borodin Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com --- src/backend/access/transam/timeline.c| 2 +- src/backend/access/transam/xlog.c| 242 +-- src/backend/access/transam/xlogarchive.c | 17 +- src/backend/postmaster/walwriter.c | 7 + src/backend/replication/walreceiver.c| 6 +- src/include/access/xlog.h| 1 + src/include/access/xlogarchive.h | 4 +- src/include/access/xlogdefs.h| 1 + 8 files changed, 256 insertions(+), 24 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..acd5c2431d 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, if (XLogArchivingActive()) { TLHistoryFileName(histfname, newTLI); - XLogArchiveNotify(histfname); + XLogArchiveNotify(histfname, true); } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a749d..95f03adef8 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -724,6 +724,18 @@ typedef struct XLogCtlData XLogRecPtr lastFpwDisableRecPtr; slock_t info_lck; /* locks shared variables shown above */ + + /* + * Variables used to track cross-segment records. Protected by + * segtrack_lck. + */ + XLogSegNo lastNotifiedSeg; + XLogSegNo earliestSegBoundary; + XLogRecPtr earliestSegBoundaryRecPtr; + XLogSegNo latestSegBoundary; + XLogRecPtr latestSegBoundaryRecPtr; + + slock_t segtrack_lck; /* locks shared variables shown above */ } XLogCtlData; static XLogCtlData *XLogCtl = NULL; @@ -920,6 +932,9 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo, XLogSegNo *endlogSegNo); static void UpdateLastRemovedPtr(char *filename); static void ValidateXLOGDirectoryStructure(void); +static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos); +static bool GetLatestSegmentBoundary(XLogRecPtr flushed, XLogSegNo *latest_boundary_seg); +static void RemoveSegmentBoundariesUpTo(XLogSegNo seg); static void CleanupBackupHistory(void); static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord *ReadRecord(XLogReaderState *xlogreader, @@ -1154,23 +1169,56 @@ XLogInsertRecord(XLogRecData *rdata, END_CRIT_SECTION(); /* - * Update shared LogwrtRqst.Write, if we crossed page boundary. + * If we crossed page boundary, update LogwrtRqst.Write; if we crossed + * segment boundary, register that and wake up walwriter. */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { + XLogSegNo StartSeg; + XLogSegNo EndSeg; + + XLByteToSeg(StartPos, StartSeg, wal_segment_size); + XLByteToSeg(EndPos, EndSeg, wal_segment_size); + + /* + * Register our crossing the segment boundary if that occurred. + * + * Note that we did not use XLByteToPrevSeg() for determining the + * ending segment. This is so that a record that fits perfectly into + * the end of the segment is marked ready for archival as soon as the + * flushed pointer jumps to the next segment. + */ + if (StartSeg != EndSeg && XLogArchivingActive()) + RegisterSegmentBoundary(EndSeg, EndPos); + + /* + * Advance LogwrtRqst.Write so that it includes new block(s). + * + * We do this after registering the segment boundary so that the + * comparison with the flushed pointer below can use the latest value + * known globally. + */ SpinLockAcquire(&XLogCtl->info_lck); - /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRq
Re: archive status ".ready" files may be created too early
On 2021-Aug-20, Bossart, Nathan wrote: > I was looking at moving the function calls out of the spinlock region. > I don't think the functions are doing anything too expensive, and they > help clean up NotifySegmentsReadyForArchive() quite a bit, but I > understand why it might be against project policy to do something like > that. It would be easy enough to get rid of the helper functions if > that was concern. Well, the thing I realized is that these three helper functions have exactly one caller each. I think the compiler is going to inline them, so there isn't going to be a function call in the assembly. I haven't verified this, though. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Los dioses no protegen a los insensatos. Éstos reciben protección de otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)
Re: archive status ".ready" files may be created too early
On 2021-Aug-21, Bossart, Nathan wrote: > > Well, the thing I realized is that these three helper functions have > > exactly one caller each. I think the compiler is going to inline them, > > so there isn't going to be a function call in the assembly. I haven't > > verified this, though. > > Good point. It looks like they're getting inlined for me. I still didn't like it, because it looks like we're creating an API for which there can be only one caller. So I expanded the functions in the caller. It doesn't look too bad. However ... ... while reading the resulting code after backpatching to all branches, I realized that if there are no registrations whatsoever, then archiving won't do anything, which surely is the wrong thing to do. The correct behavior should be "if there are no registrations, then *all* flushed segments can be notified". I'll fix that ... Another thing I didn't like is that you used a name ending in RecPtr for the LSN, which gives no indication that it really is the *end* LSN, not the start pointer. And it won't play nice with the need to add the *start* LSN which we'll need to implement solving the equivalent problem for streaming replication. I'll rename those to earliestSegBoundaryEndPtr and latestSegBoundaryEndPtr. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ Syntax error: function hell() needs an argument. Please choose what hell you want to involve. >From 6e4c3fd6f5687ec5762de121344ce35c1c890812 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 23 Aug 2021 09:06:09 -0400 Subject: [PATCH v15] Avoid creating archive status ".ready" files too early WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file is created, and the archiver may process it. If PostgreSQL crashes before it is able to write and flush the rest of the record (in the next WAL segment), the wrong version of the first segment file lingers in the archive, which causes operations such as point-in-time restores to fail. To fix this, keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. This has always been wrong, so backpatch all the way back. Author: Nathan Bossart Reviewed-by: Kyotaro Horiguchi Reviewed-by: Ryo Matsumura Reviewed-by: Andrey Borodin Discussion: https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com --- src/backend/access/transam/timeline.c| 2 +- src/backend/access/transam/xlog.c| 220 +-- src/backend/access/transam/xlogarchive.c | 17 +- src/backend/postmaster/walwriter.c | 7 + src/backend/replication/walreceiver.c| 6 +- src/include/access/xlog.h| 1 + src/include/access/xlogarchive.h | 4 +- src/include/access/xlogdefs.h| 1 + 8 files changed, 234 insertions(+), 24 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..acd5c2431d 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, if (XLogArchivingActive()) { TLHistoryFileName(histfname, newTLI); - XLogArchiveNotify(histfname); + XLogArchiveNotify(histfname, true); } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a749d..8e7c3a364a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -724,6 +724,18 @@ typedef struct XLogCtlData XLogRecPtr lastFpwDisableRecPtr; slock_t info_lck; /* locks shared variables shown above */ + + /* + * Variables used to track cross-segment records. Protected by + * segtrack_lck. + */ + XLogSegNo lastNotifiedSeg; + XLogSegNo earliestSegBoundary; + XLogRecPtr earliestSegBoundaryRecPtr; + XLogSegNo latestSegBoundary; + XLogRecPtr latestSegBoundaryRecPtr; + + slock_t segtrack_lck; /* locks shared variables shown above */ } XLogCtlData; static XLogCtlData *XLogCtl = NULL; @@ -920,6 +932,7 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo, XLogSegNo *endlogSegNo); static void UpdateLastRemovedPtr(char *filename); static void ValidateXLOGDirectoryStructure(void); +static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos); static void CleanupBackupHistory(void); static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord *ReadRecord(XLogReaderState *xlogreader, @@ -1154,23 +1167,56 @@ XLogInsertRecord(XLogRecData *rdata, END_CRIT_SECTION(); /* - * Update shared LogwrtRqst.Write, if we crossed page boundary. + * If we crossed page boundary, update LogwrtRqst.Write; if we crossed + * se
Re: archive status ".ready" files may be created too early
On 2021-Aug-23, Bossart, Nathan wrote: > Hm. My expectation would be that if there are no registrations, we > cannot create .ready files for the flushed segments. The scenario > where I can see that happening is when a record gets flushed to disk > prior to registration. In that case, we'll still eventually register > the record and wake up the WAL writer process, which will take care of > creating the .ready files that were missed earlier. Is there another > case you are thinking of where we could miss registration for a cross- > segment record altogether? I'm thinking of the case where no record cross segment boundaries ever. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 2021-Aug-23, Bossart, Nathan wrote: > Sorry, I'm still not following this one. If we skipped creating > .ready segments due to a crash, we rely on RemoveOldXlogFiles() to > create them as needed in the end-of-recovery checkpoint. If a record > fits perfectly in the end of a segment, we'll still register it as a > boundary for the next segment (hence why we use XLByteToSeg() instead > of XLByteToPrevSeg()). If database activity stops completely, there > shouldn't be anything to mark ready. The only way .ready files are created is that XLogNotifyWrite() is called. For regular WAL files during regular operation, that only happens in XLogNotifyWriteSeg(). That, in turn, only happens in NotifySegmentsReadyForArchive(). But if the system runs and never writes WAL records that cross WAL boundaries, that function will see that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno, and return without doing anything. So no segments will be notified. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Re: archive status ".ready" files may be created too early
On 2021-Aug-23, alvhe...@alvh.no-ip.org wrote: > The only way .ready files are created is that XLogNotifyWrite() is > called. For regular WAL files during regular operation, that only > happens in XLogNotifyWriteSeg(). That, in turn, only happens in > NotifySegmentsReadyForArchive(). But if the system runs and never > writes WAL records that cross WAL boundaries, that function will see > that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno, > and return without doing anything. So no segments will be notified. Nevermind -- I realized that all segments get registered, not just those for which we generate continuation records. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Use it up, wear it out, make it do, or do without"
Re: archive status ".ready" files may be created too early
On 2021-Aug-23, Bossart, Nathan wrote: > Ah, okay. BTW the other changes you mentioned made sense to me. Thanks. I've pushed this now to all live branches. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ It does it in a really, really complicated way why does it need to be complicated? Because it's MakeMaker.
Re: archive status ".ready" files may be created too early
On 2021-Aug-23, Bossart, Nathan wrote: > On 8/23/21, 12:55 PM, "alvhe...@alvh.no-ip.org" > wrote: > > Thanks. I've pushed this now to all live branches. > > Thank you! I appreciate the thorough reviews. Should we make a new > thread for the streaming replication fix? Yeah, this one is long enough :-) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: prevent immature WAL streaming
On 2021-Aug-24, Bossart, Nathan wrote: > I wonder if we need to move the call to RegisterSegmentBoundary() to > somewhere before WALInsertLockRelease() for this to work as intended. > Right now, boundary registration could take place after the flush > pointer has been advanced, which means GetSafeFlushRecPtr() could > still return an unsafe position. Yeah, you're right -- that's a definite risk. I didn't try to reproduce a problem with that, but it is seems pretty obvious that it can happen. I didn't have a lot of luck with a reliable reproducer script. I was able to reproduce the problem starting with Ryo Matsumura's script and attaching a replica; most of the time the replica would recover by restarting from a streaming position earlier than where the problem occurred; but a few times it would just get stuck with a WAL segment containing a bogus record. Then, after patch, the problem no longer occurs. I attach the patch with the change you suggested. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ Tom: There seems to be something broken here. Teodor: I'm in sackcloth and ashes... Fixed. http://archives.postgresql.org/message-id/482d1632.8010...@sigaev.ru >From 9f39d6f186c3af55db85ee73427a9ed94750939e Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 23 Aug 2021 18:25:48 -0400 Subject: [PATCH v2] Don't stream non-final WAL segments Avoid setting the physical-stream replication read pointer in the middle of a WAL record. This can occur if a record is split in two (or more) across segment boundaries. The reason to avoid it is that if we stream the segment containing the first half, and then we crash before writing the next segment, the primary will rewrite the tail of the segment with a new WAL record (having discarded the incomplete record), but the replica will be stuck trying to replay a broken file (since the next segment will never contain the now-gone data). To do this, change streaming replication to retreat the flush pointer according to registered segment boundaries. --- src/backend/access/transam/xlog.c | 80 + src/backend/replication/walsender.c | 2 +- src/include/access/xlog.h | 1 + 3 files changed, 62 insertions(+), 21 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 24165ab03e..4d010c86dd 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -731,8 +731,10 @@ typedef struct XLogCtlData */ XLogSegNo lastNotifiedSeg; XLogSegNo earliestSegBoundary; + XLogRecPtr earliestSegBoundaryStartPtr; XLogRecPtr earliestSegBoundaryEndPtr; XLogSegNo latestSegBoundary; + XLogRecPtr latestSegBoundaryStartPtr; XLogRecPtr latestSegBoundaryEndPtr; slock_t segtrack_lck; /* locks shared variables shown above */ @@ -932,7 +934,7 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo, XLogSegNo *endlogSegNo); static void UpdateLastRemovedPtr(char *filename); static void ValidateXLOGDirectoryStructure(void); -static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos); +static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr startpos, XLogRecPtr endpos); static void CleanupBackupHistory(void); static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord *ReadRecord(XLogReaderState *xlogreader, @@ -1022,6 +1024,8 @@ XLogInsertRecord(XLogRecData *rdata, info == XLOG_SWITCH); XLogRecPtr StartPos; XLogRecPtr EndPos; + XLogSegNo StartSeg; + XLogSegNo EndSeg; bool prevDoPageWrites = doPageWrites; /* we assume that all of the record header is in the first chunk */ @@ -1157,6 +1161,30 @@ XLogInsertRecord(XLogRecData *rdata, */ } + /* + * Before releasing our WAL insertion lock, register that we crossed the + * segment boundary if that occurred. We need to do it with the lock held + * for GetSafeFlushRecPtr's sake: otherwise it could see the WAL flush + * point advance but not see the registration, which would lead it to + * wrongly conclude that our flush point is safe to use. + */ + if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) + { + XLByteToSeg(StartPos, StartSeg, wal_segment_size); + XLByteToSeg(EndPos, EndSeg, wal_segment_size); + + /* + * Register our crossing the segment boundary if that occurred. + * + * Note that we did not use XLByteToPrevSeg() for determining the + * ending segment. This is so that a record that fits perfectly into + * the end of the segment causes said segment to get marked ready for + * archival immediately. + */ + if (StartSeg != EndSeg && XLogArchivingActive()) + RegisterSegmentBoundary(EndSeg, StartPos, EndPos); + } + /* * Done! Let others know that we're finished. */ @@ -1168,27 +1196,10 @@ XLogInsertRecord(XLogRecData *rdata, /* * If we crossed page boundary, update LogwrtRqst.Write; if we crossed - * segment boundary, register that and
Re: prevent immature WAL streaming
On 2021-Aug-24, Bossart, Nathan wrote: > If moving RegisterSegmentBoundary() is sufficient to prevent the flush > pointer from advancing before we register the boundary, I bet we could > also remove the WAL writer nudge. Can you elaborate on this? I'm not sure I see the connection. > Another interesting thing I see is that the boundary stored in > earliestSegBoundary is not necessarily the earliest one. It's just > the first one that has been registered. I did this for simplicity for > the .ready file fix, but I can see it causing problems here. Hmm, is there really a problem here? Surely the flush point cannot go past whatever has been written. If somebody is writing an earlier section of WAL, then we cannot move the flush pointer to a later position. So it doesn't matter if the earliest point we have registered is the true earliest -- we only care for it to be the earliest that is past the flush point. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/