Re: Patch a potential memory leak in describeOneTableDetails()
At Fri, 18 Feb 2022 16:12:34 +0800 (GMT+08:00), wli...@stu.xidian.edu.cn wrote in > I find a potential memory leak in PostgresSQL 14.1, which is in the function > describeOneTableDetails (./src/bin/psql/describe.c). The bug has been > confirmed by an auditor of . > > Specifically, at line 1603, a memory chunk is allocated with pg_strdup and > assigned to tableinfo.reloptions. If the branch takes true at line 1621, the > execution path will eventually jump to error_return at line 3394. > Unfortunately, the memory is not freed when the function > describeOneTableDetail() returns. As a result, the memory is leaked. > > Similarly, the two memory chunks (tableinfo.reloftype and tableinfo.relam) > allocated at line 1605, 1606 and 1612 may also be leaked for the same reason. I think it is not potential leaks but real leaks as it accumulates as repeatedly doing \d . > We believe we can fix the problems by adding corresponding release function > before the function returns, such as. > > if (tableinfo.reloptions) > pg_free (tableinfo.reloptions); > if (tableinfo.reloftype) > pg_free (tableinfo.reloftype); > if (tableinfo.relam) > pg_free (tableinfo. relam); > > > I'm looking forward to your reply. Good catch, and the fix is logically correct. The results from other use of pg_strdup() (and pg_malloc) seems to be released properly. About the patch, we should make a single chunk to do free(). And I think we don't insert whitespace between function name and the following left parenthesis. Since view_def is allocated by pg_strdup(), we might be better use pg_free() for it for symmetry. footers[0] is allocated using the frontend-alternative of "palloc()" so should use pfree() instead? Honestly I'm not sure we need to be so strict on that correspondence... So it would be something like this. diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index 654ef2d7c3..5da2b61a88 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -3412,7 +3412,13 @@ error_return: termPQExpBuffer(&tmpbuf); if (view_def) - free(view_def); + pg_free(view_def); +if (tableinfo.reloptions) +pg_free(tableinfo.reloptions); +if (tableinfo.reloftype) +pg_free(tableinfo.reloftype); +if (tableinfo.relam) +pg_free(tableinfo.relam); if (res) PQclear(res); @@ -3621,8 +3627,8 @@ describeRoles(const char *pattern, bool verbose, bool showSystem) printTableCleanup(&cont); for (i = 0; i < nrows; i++) - free(attr[i]); - free(attr); + pg_free(attr[i]); + pg_free(attr); PQclear(res); return true; (However, I got a mysterious -Wmisleading-indentation warning with this..) > describe.c: In function ‘describeOneTableDetails’: > describe.c:3420:5: warning: this ‘if’ clause does not guard... > [-Wmisleading-indentation] > if (tableinfo.relam) > ^~ > describe.c:3423:2: note: ...this statement, but the latter is misleadingly > indented as if it were guarded by the ‘if’ > if (res) ^~ regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Patch a potential memory leak in describeOneTableDetails()
At Mon, 21 Feb 2022 10:24:23 +0100, Daniel Gustafsson wrote in > I think it's because you've indented your new code differently from the > existing, such that the if (res) clause is indented equally to the previous > pg_free(tableinfo.relam) call, making it look like they are in the same block: Yeah, I (wrongly) thought that they have the same indentation. > > +if (tableinfo.relam) > > +pg_free(tableinfo.relam); > > > > if (res) > > PQclear(res); But actually the lines I added are space-indented but the following existing line is tab-indented. Anyway, don't we use the -ftabstop option flag to silence compiler? The warning is contradicting our convention. The attached adds that flag. By the way, the doc says that: https://www.postgresql.org/docs/14/source-format.html > The text browsing tools more and less can be invoked as: >more -x4 >less -x4 > to make them show tabs appropriately. But AFAICS the "more" command on CentOS doesn't have -x options nor any option to change tab width and I don't find a document that suggests its existence on other platforms. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 4e3b0c2ce6a2dd60d11100deb2db80ff07455d0d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 22 Feb 2022 14:41:10 +0900 Subject: [PATCH] Fit compiler tabstop to our convension Set -ftabstop to 4 so that -Wmisleading-indentation doesn't makes false warnings. --- configure| 92 configure.ac | 3 ++ 2 files changed, 95 insertions(+) diff --git a/configure b/configure index f3cb5c2b51..887ed733eb 100755 --- a/configure +++ b/configure @@ -6219,6 +6219,98 @@ if test x"$pgac_cv_prog_CXX_cxxflags__fexcess_precision_standard" = x"yes"; then fi + # Fit tab width for -Wmisleading-indentation to our convention + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -ftabstop=4, for CFLAGS" >&5 +$as_echo_n "checking whether ${CC} supports -ftabstop=4, for CFLAGS... " >&6; } +if ${pgac_cv_prog_CC_cflags__ftabstop_4+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +pgac_save_CC=$CC +CC=${CC} +CFLAGS="${CFLAGS} -ftabstop=4" +ac_save_c_werror_flag=$ac_c_werror_flag +ac_c_werror_flag=yes +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile "$LINENO"; then : + pgac_cv_prog_CC_cflags__ftabstop_4=yes +else + pgac_cv_prog_CC_cflags__ftabstop_4=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +ac_c_werror_flag=$ac_save_c_werror_flag +CFLAGS="$pgac_save_CFLAGS" +CC="$pgac_save_CC" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__ftabstop_4" >&5 +$as_echo "$pgac_cv_prog_CC_cflags__ftabstop_4" >&6; } +if test x"$pgac_cv_prog_CC_cflags__ftabstop_4" = x"yes"; then + CFLAGS="${CFLAGS} -ftabstop=4" +fi + + + { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports -ftabstop=4, for CXXFLAGS" >&5 +$as_echo_n "checking whether ${CXX} supports -ftabstop=4, for CXXFLAGS... " >&6; } +if ${pgac_cv_prog_CXX_cxxflags__ftabstop_4+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CXXFLAGS=$CXXFLAGS +pgac_save_CXX=$CXX +CXX=${CXX} +CXXFLAGS="${CXXFLAGS} -ftabstop=4" +ac_save_cxx_werror_flag=$ac_cxx_werror_flag +ac_cxx_werror_flag=yes +ac_ext=cpp +ac_cpp='$CXXCPP $CPPFLAGS' +ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_cxx_compiler_gnu + +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_cxx_try_compile "$LINENO"; then : + pgac_cv_prog_CXX_cxxflags__ftabstop_4=yes +else + pgac_cv_prog_CXX_cxxflags__ftabstop_4=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +ac_ext=c +ac_cpp='$CPP $CPPFLAGS' +ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_c_compiler_gnu + +ac_cxx_werror_flag=$ac_save_cxx_werror_flag +CXXFLAGS="$pgac_save_CXXFLAGS" +CXX="$pgac_save_CXX" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CXX_cxxflags__ftabstop_4" >&5 +$as_echo "$pgac_cv_prog_CXX_cxxflags__ftabstop_4" >&6; } +if test
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
At Tue, 22 Feb 2022 01:59:45 +0900, Fujii Masao wrote in > > > Of the following, I think we should do (a) and (b) to make future > > backpatchings easier. > > a) Use RedoRecPtr and PriorRedoPtr after they are assigned. > > b) Move assignment to PriorRedoPtr into the ControlFileLock section. > > I failed to understand how (a) and (b) can make the backpatching > easier. How easy to backpatch seems the same whether we apply (a) and > (b) or not... That premises that the patch applied to master contains (a) and (b). So if it doesn't, those are not need by older branches. > > c) Skip udpate of minRecoveryPoint only when the checkpoint gets old. > > Yes. > > > > d) Skip call to UpdateCheckPointDistanceEstimate() when RedoRecPtr <= > >PriorRedoPtr. > > But "RedoRecPtr <= PriorRedoPtr" will never happen, will it? Because a I didn't believe that it happens. (So, it came from my convervativeness, or laziness, or both:p) The code dates from 2009 and StartupXLOG makes a concurrent checkpoint with bgwriter. But as of at least 9.5, StartupXLOG doesn't directly call CreateCheckPoint. So I think that won't happen. So, in short, I agree to remove it or turn it into Assert(). > restartpoint is skipped at the beginning of CreateRestartPoint() in > that case. If this understanding is right, the check of "RedoRecPtr <= > PriorRedoPtr" is not necessary before calling > UpdateCheckPointDistanceEstimate(). > > > + ControlFile->minRecoveryPoint = InvalidXLogRecPtr; > + ControlFile->minRecoveryPointTLI = 0; > > Don't we need to update LocalMinRecoveryPoint and > LocalMinRecoveryPointTLI after this? Maybe it's not necessary, but > ISTM that it's safer and better to always update them whether the > state is DB_IN_ARCHIVE_RECOVERY or not. Agree that it's safer and tidy. > if (flags & CHECKPOINT_IS_SHUTDOWN) > ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY; > > Same as above. IMO it's safer and better to always update the state > (whether the state is DB_IN_ARCHIVE_RECOVERY or not) if > CHECKPOINT_IS_SHUTDOWN flag is passed. That means we may exit recovery mode after ShutdownXLOG called CreateRestartPoint. I don't think that may happen. So I'd rather add Assert ((flags&CHECKPOINT_IS_SHUTDOWN) == 0) there instaed. I'll post the new version later. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: remove more archiving overhead
At Mon, 21 Feb 2022 17:19:48 -0800, Nathan Bossart wrote in > I also spent some time investigating whether durably renaming the archive > status files was even necessary. In theory, it shouldn't be. If a crash > happens before the rename is persisted, we might try to re-archive files, > but your archive_command/archive_library should handle that. If the file I believe we're trying to archive a WAL segment once and only once, even though actually we fail in certain cases. https://www.postgresql.org/docs/14/continuous-archiving.html | The archive command should generally be designed to refuse to | overwrite any pre-existing archive file. This is an important safety | feature to preserve the integrity of your archive in case of | administrator error (such as sending the output of two different | servers to the same archive directory). The recommended behavior of archive_command above is to avoid this kind of corruption. If we officially admit that a WAL segment can be archive twice, we need to revise that part (and the following part) of the doc. > was already recycled or removed, the archiver will skip it (thanks to > 6d8727f). But when digging further, I found that WAL file recyling uses > durable_rename_excl(), which has the following note: > >* Note that a crash in an unfortunate moment can leave you with two > links to >* the target file. > > IIUC this means that in theory, a crash at an unfortunate moment could > leave you with a .ready file, the file to archive, and another link to that > file with a "future" WAL filename. If you re-archive the file after it has > been reused, you could end up with corrupt WAL archives. I think this IMHO the difference would be that the single system call (maybe) cannot be interrupted by sigkill. So I'm not sure that that is worth considering. The sigkill window can be elimianted if we could use renameat(RENAME_NOREPLACE). But finally power-loss can cause that. > might already be possible today. Syncing the directory after every rename > probably reduces the likelihood quite substantially, but if > RemoveOldXlogFiles() quickly picks up the .done file and attempts > recycling before durable_rename() calls fsync() on the directory, > presumably the same problem could occur. If we don't fsync at every rename, we don't even need for RemoveOldXlogFiles to pickup .done very quickly. Isn't it a big difference? > I believe the current recommendation is to fail if the archive file already > exists. The basic_archive contrib module fails if the archive already > exists and has different contents, and it succeeds without re-archiving if > it already exists with identical contents. Since something along these > lines is recommended as a best practice, perhaps this is okay. But I think > we might be able to do better. If we ensure that the archive_status > directory is sync'd prior to recycling/removing a WAL file, I believe we > are safe from the aforementioned problem. > > The drawback of this is that we are essentially offloading more work to the > checkpointer process. In addition to everything else it must do, it will > also need to fsync() the archive_status directory many times. I'm not sure > how big of a problem this is. It should be good to ensure that archiving > keeps up, and there are far more expensive tasks that the checkpointer must > perform that could make the added fsync() calls relatively insignificant. I think we don't want make checkpointer slower in exchange of making archiver faster. Perhaps we need to work using a condensed information rather than repeatedly scanning over the distributed information like archive_status? At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart wrote in > On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote: > > In my testing, I found that when I killed the server just before unlink() > > during WAL recyling, I ended up with links to the same file in pg_wal after > > restarting. My latest test produced links to the same file for the current > > WAL file and the next one. Maybe WAL recyling should use durable_rename() > > instead of durable_rename_excl(). > > Here is an updated patch. Is it intentional that v2 loses the xlogarchive.c part? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Patch a potential memory leak in describeOneTableDetails()
At Tue, 22 Feb 2022 09:59:09 +0100, Daniel Gustafsson wrote in > > On 22 Feb 2022, at 07:14, Kyotaro Horiguchi wrote: > > > Anyway, don't we use the -ftabstop option flag to silence compiler? > > The warning is contradicting our convention. The attached adds that > > flag. > > Isn't this flag contradicting our convention? From the docs section you > reference further down: > > "Source code formatting uses 4 column tab spacing, with tabs preserved > (i.e., tabs are not expanded to spaces)." Ugg. Right. > The -ftabstop option is (to a large extent, not entirely) to warn when tabs > and > spaces are mixed creating misleading indentation that the author didn't even > notice due to tabwidth settings? ISTM we are better of getting these warnings > than suppressing them. At Tue, 22 Feb 2022 10:00:46 -0500, Tom Lane wrote in > No, we use pgindent to not have such inconsistent indentation. Understood. Thaks for clarification. > > By the way, the doc says that: > > > > https://www.postgresql.org/docs/14/source-format.html > > > >> The text browsing tools more and less can be invoked as: > >> more -x4 > >> less -x4 > >> to make them show tabs appropriately. > > > > But AFAICS the "more" command on CentOS doesn't have -x options nor > > any option to change tab width and I don't find a document that > > suggests its existence on other platforms. > > macOS, FreeBSD and NetBSD both show the less(1) manpage for more(1) which > suggests that more is implemented using less there, and thus supports -x (it > does on macOS). OpenBSD and Solaris more(1) does not show a -x option, but > AIX > does have it. Linux in its various flavors doesn't seem to have it. Thanks for the explanation. > The section in question was added to our docs 22 years ago, to make it reflect > the current operating systems in use maybe just not mentioning more(1) is the > easiest?: > > "The text browsing tool less can be invoked as less -x4 to make it show > tabs appropriately." > > Or perhaps remove that section altogether? I think the former is better. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Patch a potential memory leak in describeOneTableDetails()
At Thu, 24 Feb 2022 14:44:56 +0900 (JST), Kyotaro Horiguchi wrote in > At Tue, 22 Feb 2022 09:59:09 +0100, Daniel Gustafsson wrote > in > > The section in question was added to our docs 22 years ago, to make it > > reflect > > the current operating systems in use maybe just not mentioning more(1) is > > the > > easiest?: > > > > "The text browsing tool less can be invoked as less -x4 to make it show > > tabs appropriately." > > > > Or perhaps remove that section altogether? > > I think the former is better. So the attached does that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From fd833aee1756364f669a3d478763299ce6d3a747 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 24 Feb 2022 15:18:52 +0900 Subject: [PATCH] Fix out-of-date description mentioning the command more In most of the operating systems in use, "more" is implemented using "less", or otherwise "more" does not have an -x option. Remove the descriptions related to the more command to make it reflect the current world better. --- doc/src/sgml/sources.sgml | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/doc/src/sgml/sources.sgml b/doc/src/sgml/sources.sgml index 1b77efb087..7a6e8a951f 100644 --- a/doc/src/sgml/sources.sgml +++ b/doc/src/sgml/sources.sgml @@ -76,13 +76,8 @@ -The text browsing tools more and -less can be invoked as: - -more -x4 -less -x4 - -to make them show tabs appropriately. +The text browsing tool less can be invoked +as less -x4 to make it show tabs appropriately. -- 2.27.0
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
At Wed, 23 Feb 2022 02:58:07 +, "Imseih (AWS), Sami" wrote in > >Ooh, nice find and diagnosys. I can confirm that the test fails as you > >described without the code fix, and doesn't fail with it. > > >I attach the same patch, with the test file put in its final place > >rather than as a patch. Due to recent xlog.c changes this need a bit of > >work to apply to back branches; I'll see about getting it in all > >branches soon. > > Thank you for reviewing! Nice catch! However, I'm not sure I like the patch. * made it through and start writing after the portion that persisted. * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) +* +* If the last wal record is ahead of the missing contrecord, this is a +* recently promoted primary and we should not write an overwrite +* contrecord. Before the part, the comment follows the part shown below. >* Actually, if WAL ended in an incomplete record, skip the parts that So, actually WAL did not ended in an incomplete record. I think FinishWalRecover is the last place to do that. (But it could be earlier.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
At Thu, 24 Feb 2022 16:26:42 +0900 (JST), Kyotaro Horiguchi wrote in > So, actually WAL did not ended in an incomplete record. I think > FinishWalRecover is the last place to do that. (But it could be > earlier.) After some investigation, I finally concluded that we should reset abortedRecPtr and missingContrecPtr at processing XLOG_OVERWRITE_CONTRECORD. --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1953,6 +1953,11 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI) LSN_FORMAT_ARGS(xlrec.overwritten_lsn), timestamptz_to_str(xlrec.overwrite_time; + /* We have safely skipped the aborted record */ + abortedRecPtr = InvalidXLogRecPtr; + missingContrecPtr = InvalidXLogRecPtr; + /* Verifying the record should only happen once */ record->overwrittenRecPtr = InvalidXLogRecPtr; } The last check in the test against "resetting aborted record" is no longer useful since it is already checked by 026_verwrite_contrecord.pl. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add id's to various elements in protocol.sgml
At Tue, 21 Dec 2021 08:47:27 +0100, Brar Piening wrote in > On 20.12.2021 at 16:09, Robert Haas wrote: > > As a data point, this is something I have also wanted to do, from time > > to time. I am generally of the opinion that any place the +1 from me. When I put an URL in the answer for inquiries, I always look into the html for name/id tags so that the inquirer quickly find the information source (or the backing or reference point) on the page. If not found, I place a snippet instead. > > documentation has a long list of things, which should add ids, so that > > people can link to the particular thing in the list to which they want > > to draw someone's attention. > > > Thank you. > > If there is consensus on generally adding links to long lists I'd take > suggestions for other places where people think that this would make > sense and amend my patch. I don't think there is. I remember sometimes wanted ids on some sections(x.x) and items(x.x.x or lower) (or even clauses, ignoring costs:p) FWIW in that perspecive, there's no requirement from me that it should be human-readable. I'm fine with automatically-generated ids. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Buffer Manager and Contention
(I added Yura, as the author of a related patch) At Thu, 24 Feb 2022 12:58:23 +, Simon Riggs wrote in > Thinking about poor performance in the case where the data fits in > RAM, but the working set is too big for shared_buffers, I notice a > couple of things that seem bad in BufMgr, but don't understand why > they are like that. > > 1. If we need to allocate a buffer to a new block we do this in one > step, while holding both partition locks for the old and the new tag. > Surely it would cause less contention to make the old block/tag > invalid (after flushing), drop the old partition lock and then switch > to the new one? i.e. just hold one mapping partition lock at a time. > Is there a specific reason we do it this way? I'm not sure but I guess the developer wanted to make the operation atomic. Yura Sokolov is proposing a patch to separte the two partition locks. https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru And it seems to me viable for me and a benchmarking in the thread showed a good result. I'd appreciate your input on that thread. > 2. Possibly connected to the above, we issue BufTableInsert() BEFORE > we issue BufTableDelete(). That means we need extra entries in the > buffer mapping hash table to allow us to hold both the old and the new > at the same time, for a short period. The way dynahash.c works, we try Yes. > to allocate an entry from the freelist and if that doesn't work, we > begin searching ALL the freelists for free entries to steal. So if we > get enough people trying to do virtual I/O at the same time, then we > will hit a "freelist storm" where everybody is chasing the last few > entries. It would make more sense if we could do BufTableDelete() To reduce that overhead, Yura proposed a surgically modification on dynahash, but I didn't like that and the latest patch doesn't contain that part. > first, then hold onto the buffer mapping entry rather than add it to > the freelist, so we can use it again when we do BufTableInsert() very > shortly afterwards. That way we wouldn't need to search the freelist > at all. What is the benefit or reason of doing the Delete after the > Insert? Hmm. something like hash_swap_key() or hash_reinsert_entry()? That sounds reasonable. (Yura's proposal was taking out an entry from hash then attach it with a new key again.) > Put that another way, it looks like BufTable functions are used in a > way that mismatches against the way dynahash is designed. > > Thoughts? On the first part, I think Yura's patch works. On the second point, Yura already showed it gives a certain amount of gain if we do that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Buffer Manager and Contention
At Fri, 25 Feb 2022 10:20:25 +0900 (JST), Kyotaro Horiguchi wrote in > (I added Yura, as the author of a related patch) > > At Thu, 24 Feb 2022 12:58:23 +, Simon Riggs > wrote in > > Thinking about poor performance in the case where the data fits in > > RAM, but the working set is too big for shared_buffers, I notice a > > couple of things that seem bad in BufMgr, but don't understand why > > they are like that. > > > > 1. If we need to allocate a buffer to a new block we do this in one > > step, while holding both partition locks for the old and the new tag. > > Surely it would cause less contention to make the old block/tag > > invalid (after flushing), drop the old partition lock and then switch > > to the new one? i.e. just hold one mapping partition lock at a time. > > Is there a specific reason we do it this way? > > I'm not sure but I guess the developer wanted to make the operation > atomic. > > Yura Sokolov is proposing a patch to separte the two partition locks. > > https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru > > And it seems to me viable for me and a benchmarking in the thread > showed a good result. I'd appreciate your input on that thread. > > > 2. Possibly connected to the above, we issue BufTableInsert() BEFORE > > we issue BufTableDelete(). That means we need extra entries in the > > buffer mapping hash table to allow us to hold both the old and the new > > at the same time, for a short period. The way dynahash.c works, we try > > Yes. > > > to allocate an entry from the freelist and if that doesn't work, we > > begin searching ALL the freelists for free entries to steal. So if we > > get enough people trying to do virtual I/O at the same time, then we > > will hit a "freelist storm" where everybody is chasing the last few > > entries. It would make more sense if we could do BufTableDelete() > > To reduce that overhead, Yura proposed a surgically modification on > dynahash, but I didn't like that and the latest patch doesn't contain > that part. > > > first, then hold onto the buffer mapping entry rather than add it to > > the freelist, so we can use it again when we do BufTableInsert() very > > shortly afterwards. That way we wouldn't need to search the freelist > > at all. What is the benefit or reason of doing the Delete after the > > Insert? > > Hmm. something like hash_swap_key() or hash_reinsert_entry()? That > sounds reasonable. (Yura's proposal was taking out an entry from hash > then attach it with a new key again.) > > > Put that another way, it looks like BufTable functions are used in a > > way that mismatches against the way dynahash is designed. > > > > Thoughts? > > On the first part, I think Yura's patch works. On the second point, > Yura already showed it gives a certain amount of gain if we do that. On second thought, even if we have a new dynahash API to atomically replace hash key, we need to hold two partition locks to do that. That is contradicting our objective. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
At Tue, 22 Feb 2022 17:44:01 +0900 (JST), Kyotaro Horiguchi wrote in > At Tue, 22 Feb 2022 01:59:45 +0900, Fujii Masao > wrote in > > > > > Of the following, I think we should do (a) and (b) to make future > > > backpatchings easier. > > > a) Use RedoRecPtr and PriorRedoPtr after they are assigned. > > > b) Move assignment to PriorRedoPtr into the ControlFileLock section. > > > > I failed to understand how (a) and (b) can make the backpatching > > easier. How easy to backpatch seems the same whether we apply (a) and > > (b) or not... > > That premises that the patch applied to master contains (a) and (b). > So if it doesn't, those are not need by older branches. I was once going to remove them. But according the discussion below, the patch for back-patching is now quite close to that for the master branch. So I left them alone. > > > d) Skip call to UpdateCheckPointDistanceEstimate() when RedoRecPtr <= > > >PriorRedoPtr. > > > > But "RedoRecPtr <= PriorRedoPtr" will never happen, will it? Because a > > I didn't believe that it happens. (So, it came from my > convervativeness, or laziness, or both:p) The code dates from 2009 and > StartupXLOG makes a concurrent checkpoint with bgwriter. But as of at > least 9.5, StartupXLOG doesn't directly call CreateCheckPoint. So I > think that won't happen. > > So, in short, I agree to remove it or turn it into Assert(). It was a bit out of point. If we assume RedoRecPtr is always larger than PriorRedoPtr and then we don't need to check that there, we should also remove the "if (PriorRedoPtr < RedoRecPtr)" branch just above, which means the patch for back-branches gets very close to that for the master. Do we make such a large change on back branches? Anyways this version once takes that way. > > if (flags & CHECKPOINT_IS_SHUTDOWN) > > ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY; > > > > Same as above. IMO it's safer and better to always update the state > > (whether the state is DB_IN_ARCHIVE_RECOVERY or not) if > > CHECKPOINT_IS_SHUTDOWN flag is passed. > > That means we may exit recovery mode after ShutdownXLOG called > CreateRestartPoint. I don't think that may happen. So I'd rather add > Assert ((flags&CHECKPOINT_IS_SHUTDOWN) == 0) there instaed. So this version for v14 gets updated in the following points. Completely removed the code path for the case some other process runs simultaneous checkpoint. Removed the condition (RedoRecPtr > PriorRedoPtr) for UpdateCheckPointDistanceEstimate() call. Added an assertion to the recoery-end path. # Honestly I feel this is a bit too much for back-patching, though. While making patches for v12, I see a test failure of pg_rewind for uncertain reason. I'm investigating that but I post this for discussion. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From e983f3d4c2dbeea742aed0ef1e209e7821f6687f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 14 Feb 2022 13:04:33 +0900 Subject: [PATCH v2] Correctly update contfol file at the end of archive recovery CreateRestartPoint runs WAL file cleanup basing on the checkpoint just have finished in the function. If the database has exited DB_IN_ARCHIVE_RECOVERY state when the function is going to update control file, the function refrains from updating the file at all then proceeds to WAL cleanup having the latest REDO LSN, which is now inconsistent with the control file. As the result, the succeeding cleanup procedure overly removes WAL files against the control file and leaves unrecoverable database until the next checkpoint finishes. Along with that fix, we remove a dead code path for the case some other process ran a simultaneous checkpoint. It seems like just a preventive measure but it's no longer useful because we are sure that checkpoint is performed only by checkpointer except single process mode. --- src/backend/access/transam/xlog.c | 73 --- 1 file changed, 47 insertions(+), 26 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6208e123e5..ff4a90eacc 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9587,6 +9587,9 @@ CreateRestartPoint(int flags) XLogSegNo _logSegNo; TimestampTz xtime; + /* we don't assume concurrent checkpoint/restartpoint to run */ + Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER); + /* Get a local copy of the last safe checkpoint record. */ SpinLockAcquire(&XLogCtl->info_lck); lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr; @@ -9653,7 +9656,7 @@ CreateR
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
At Fri, 25 Feb 2022 15:31:12 +0900 (JST), Kyotaro Horiguchi wrote in > While making patches for v12, I see a test failure of pg_rewind for > uncertain reason. I'm investigating that but I post this for > discussion. Hmm. Too stupid. Somehow I overly removed the latchet condition for minRecoveryPoint. So the same patch worked for v12. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
At Fri, 25 Feb 2022 16:06:31 +0900 (JST), Kyotaro Horiguchi wrote in > At Fri, 25 Feb 2022 15:31:12 +0900 (JST), Kyotaro Horiguchi > wrote in > > While making patches for v12, I see a test failure of pg_rewind for > > uncertain reason. I'm investigating that but I post this for > > discussion. > > Hmm. Too stupid. Somehow I overly removed the latchet condition for > minRecoveryPoint. So the same patch worked for v12. So, this is the patches for pg12-10. 11 can share the same patch with 12. 10 has differences in two points. 10 has ControlFile->prevCheckPoint. The DETAILS of the "recovery restart point at" message is not capitalized. But I suppose it is so close to EOL so that we don't want to "fix" it risking existing usecases. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From c89e2b509723b68897f2af49a154af2a69f0747b Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 25 Feb 2022 15:04:00 +0900 Subject: [PATCH v3] Correctly update contfol file at the end of archive recovery CreateRestartPoint runs WAL file cleanup basing on the checkpoint just have finished in the function. If the database has exited DB_IN_ARCHIVE_RECOVERY state when the function is going to update control file, the function refrains from updating the file at all then proceeds to WAL cleanup having the latest REDO LSN, which is now inconsistent with the control file. As the result, the succeeding cleanup procedure overly removes WAL files against the control file and leaves unrecoverable database until the next checkpoint finishes. Along with that fix, we remove a dead code path for the case some other process ran a simultaneous checkpoint. It seems like just a preventive measure but it's no longer useful because we are sure that checkpoint is performed only by checkpointer except single process mode. --- src/backend/access/transam/xlog.c | 71 +++ 1 file changed, 44 insertions(+), 27 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 885558f291..2b2568c475 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9334,7 +9334,7 @@ CreateRestartPoint(int flags) /* Also update the info_lck-protected copy */ SpinLockAcquire(&XLogCtl->info_lck); - XLogCtl->RedoRecPtr = lastCheckPoint.redo; + XLogCtl->RedoRecPtr = RedoRecPtr; SpinLockRelease(&XLogCtl->info_lck); /* @@ -9350,7 +9350,10 @@ CreateRestartPoint(int flags) if (log_checkpoints) LogCheckpointStart(flags, true); - CheckPointGuts(lastCheckPoint.redo, flags); + CheckPointGuts(RedoRecPtr, flags); + + /* Update pg_control */ + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); /* * Remember the prior checkpoint's redo ptr for @@ -9358,31 +9361,28 @@ CreateRestartPoint(int flags) */ PriorRedoPtr = ControlFile->checkPointCopy.redo; + Assert (PriorRedoPtr < RedoRecPtr); + + ControlFile->checkPoint = lastCheckPointRecPtr; + ControlFile->checkPointCopy = lastCheckPoint; + + /* Update control file using current time */ + ControlFile->time = (pg_time_t) time(NULL); + /* -* Update pg_control, using current time. Check that it still shows -* IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing; -* this is a quick hack to make sure nothing really bad happens if somehow -* we get here after the end-of-recovery checkpoint. +* Ensure minRecoveryPoint is past the checkpoint record while archive +* recovery is still ongoing. Normally, this will have happened already +* while writing out dirty buffers, but not necessarily - e.g. because no +* buffers were dirtied. We do this because a non-exclusive base backup +* uses minRecoveryPoint to determine which WAL files must be included in +* the backup, and the file (or files) containing the checkpoint record +* must be included, at a minimum. Note that for an ordinary restart of +* recovery there's no value in having the minimum recovery point any +* earlier than this anyway, because redo will begin just after the +* checkpoint record. */ - LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); - if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && - ControlFile->checkPointCopy.redo < lastCheckPoint.redo) + if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY) { - ControlFile->checkPoint = lastCheckPointRecPtr; - ControlFile->checkPointCopy = lastCheckPoint; - ControlFile->time = (pg_time_t) time(NULL); - - /* -* Ensure minRecoveryPoint is pas
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
At Fri, 25 Feb 2022 16:47:01 +0900 (JST), Kyotaro Horiguchi wrote in > So, this is the patches for pg12-10. 11 can share the same patch with > 12. 10 has differences in two points. > > 10 has ControlFile->prevCheckPoint. > > The DETAILS of the "recovery restart point at" message is not > capitalized. But I suppose it is so close to EOL so that we don't > want to "fix" it risking existing usecases. Ugh! Wait for a moment. Something's wrong. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
At Fri, 25 Feb 2022 16:52:30 +0900 (JST), Kyotaro Horiguchi wrote in > Ugh! Wait for a moment. Something's wrong. Sorry, what is wrong was my working directory. It was broken by my bogus operation. All the files apply corresponding versions correctly. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Fri, 25 Feb 2022 00:04:55 -0800, Andres Freund wrote in > Why don't you just use LockBufHdr/UnlockBufHdr? FWIW, v2 looked fine to me in regards to this point. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
At Sat, 26 Feb 2022 12:11:15 +0900, Michael Paquier wrote in > On Fri, Feb 25, 2022 at 01:09:53PM -0800, Nathan Bossart wrote: > > This one has been quiet for a while. Should we mark it as > > returned-with-feedback? > > Yes, that's my feeling and I got cold feet about this change. This > patch would bring some extra visibility for something that's not > incorrect either on HEAD, as end-of-recovery checkpoints are the same > things as shutdown checkpoints. And there is an extra argument where > back-patching would become a bit more tricky in an area that's already > a lot sensitive. That sounds like we should reject the patch as we don't agree to its objective. If someday end-of-recovery checkpoints functionally diverge from shutdown checkpoints but leave (somehow) the transition alone, we may visit this again but it would be another proposal. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: definition of CalculateMaxmumSafeLSN
At Mon, 28 Feb 2022 17:01:10 +0300, Sergei Kornilov wrote in > Hello > I just spotted in src/include/access/xlog.h: > extern XLogRecPtr CalculateMaxmumSafeLSN(void); > > This function doesn't seem to be used anywhere or even defined? "git grep > CalculateMaxmumSafeLSN" shows only this line. > Was added in c6550776394e25c1620bc8258427c8f1d448080d "Allow users to limit > storage reserved by replication slots" Hmm. I think I remember of that name. You're right, it is a function that was once added during development of the commit but eventually gone. So it should be removed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Use "WAL segment" instead of "log segment" consistently in user-facing messages
At Mon, 28 Feb 2022 21:03:07 +0530, Bharath Rupireddy wrote in > Hi, > > It looks like we use "log segment" in various user-facing messages. > The term "log" can mean server logs as well. The "WAL segment" suits > well here and it is consistently used across the other user-facing > messages [1]. > > Here's a small patch attempting to consistently use the "WAL segment". > > Thoughts? I tend to agree to this. I also see "log record(s)" (without prefixed by "write-ahead") in many places especially in the documentation. I'm not sure how we should treat "WAL log", though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: In-placre persistance change of a relation
Rebased on a recent xlog refactoring. No functional changes have been made. - Removed the default case in smgr_desc since it seems to me we don't assume out-of-definition values in xlog records elsewhere. - Simplified some added to storage.c. - Fix copy-pasto'ed comments in extractPageInfo(). - The previous version smgrDoPendingCleanups() assumes that init-fork are not loaded onto shared buffer but it is wrong (SetRelationBuffersPersistence assumes the opposite.). Thus we need to drop buffers before unlink an init fork. But it is already guaranteed by logic so I rewrote the comment for for PCOP_UNLINK_FORK. > * Unlink the fork file. Currently we use this only for > * init forks and we're sure that the init fork is not > * loaded on shared buffers. For RelationDropInitFork > * case, the function dropped that buffers. For > * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true) > * is set and the buffers have been dropped just before. This logic has the same critical window as DropRelFilenodeBuffers. That is, if file deletion fails after successful buffer dropping, theoretically the file content of the init fork may be stale. However, AFAICS init-fork is write-once fork so I don't think that actually matters. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 420a9d9a0dae3bcfb1396c14997624ad67a3e557 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v18 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c| 49 ++ src/backend/access/transam/README | 9 + src/backend/access/transam/xact.c | 7 + src/backend/access/transam/xlogrecovery.c | 18 + src/backend/catalog/storage.c | 548 +- src/backend/commands/tablecmds.c | 266 +-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 86 src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 344 ++ src/backend/storage/smgr/md.c | 94 +++- src/backend/storage/smgr/smgr.c | 32 ++ src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 22 + src/common/relpath.c | 47 +- src/include/catalog/storage.h | 3 + src/include/catalog/storage_xlog.h| 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h| 17 + 23 files changed, 1459 insertions(+), 182 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7547813254..f8908e2c0a 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: +action = "CREATE"; +break; + case XLOG_SMGR_MARK_UNLINK: +action = "DELETE"; +break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +95,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + i
Re: Allow async standbys wait for sync replication
(Now I understand what "async" mean here..) At Mon, 28 Feb 2022 22:05:28 -0800, Nathan Bossart wrote in > On Tue, Mar 01, 2022 at 11:10:09AM +0530, Bharath Rupireddy wrote: > > On Tue, Mar 1, 2022 at 12:27 AM Nathan Bossart > > wrote: > >> My feedback is specifically about this behavior. I don't think we should > >> spin in XLogSend*() waiting for an LSN to be synchronously replicated. I > >> think we should just choose the SendRqstPtr based on what is currently > >> synchronously replicated. > > > > Do you mean something like the following? > > > > /* Main loop of walsender process that streams the WAL over Copy messages. > > */ > > static void > > WalSndLoop(WalSndSendDataCallback send_data) > > { > > /* > > * Loop until we reach the end of this timeline or the client requests > > to > > * stop streaming. > > */ > > for (;;) > > { > > if (am_async_walsender && there_are_sync_standbys) > > { > > XLogRecPtr SendRqstLSN; > > XLogRecPtr SyncFlushLSN; > > > > SendRqstLSN = GetFlushRecPtr(NULL); > > LWLockAcquire(SyncRepLock, LW_SHARED); > > SyncFlushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH]; > > LWLockRelease(SyncRepLock); > > > > if (SendRqstLSN > SyncFlushLSN) > >continue; > > } The current trend is energy-savings. We never add a "wait for some fixed time then exit if the condition makes, otherwise repeat" loop for this kind of purpose where there's no guarantee that the loop exits quite shortly. Concretely we ought to rely on condition variables to do that. > Not quite. Instead of "continue", I would set SendRqstLSN to SyncFlushLSN > so that the WAL sender only sends up to the current synchronously I'm not sure, but doesn't that makes walsender falsely believes it have caught up to the bleeding edge of WAL? > replicated LSN. TBH there are probably other things that need to be > considered (e.g., how do we ensure that the WAL sender sends the rest once > it is replicated?), but I still think we should avoid spinning in the WAL > sender waiting for WAL to be replicated. It seems to me it would be something similar to SyncRepReleaseWaiters(). Or it could be possible to consolidate this feature into the function, I'm not sure, though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: In-placre persistance change of a relation
At Tue, 01 Mar 2022 14:14:13 +0900 (JST), Kyotaro Horiguchi wrote in > - Removed the default case in smgr_desc since it seems to me we don't > assume out-of-definition values in xlog records elsewhere. Stupid. The complier on the CI environemnt complains for uninitialized variable even though it (presumably) knows that the all paths of the switch statement set the variable. Added default value to try to silence compiler. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From ec75c49ffd939f6db8e0d840ef043c18845d1b9d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v19 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c| 49 ++ src/backend/access/transam/README | 9 + src/backend/access/transam/xact.c | 7 + src/backend/access/transam/xlogrecovery.c | 18 + src/backend/catalog/storage.c | 548 +- src/backend/commands/tablecmds.c | 266 +-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 86 src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 344 ++ src/backend/storage/smgr/md.c | 94 +++- src/backend/storage/smgr/smgr.c | 32 ++ src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 22 + src/bin/pg_rewind/pg_rewind.c | 1 - src/common/relpath.c | 47 +- src/include/catalog/storage.h | 3 + src/include/catalog/storage_xlog.h| 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h| 17 + 24 files changed, 1459 insertions(+), 183 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7547813254..225ffbafef 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action = ""; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: +action = "CREATE"; +break; + case XLOG_SMGR_MARK_UNLINK: +action = "DELETE"; +break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +95,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_MARK: + id = "MARK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..2ecd8c8c7c 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -725,6 +725,15 @@ then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The Smgr MARK files + + +An smgr mark file is created when a new relation file is created to +mark the relfilenode needs to be cleaned up at recovery time. In +contrast to the four
Re: Make mesage at end-of-recovery less scary.
At Sat, 19 Feb 2022 09:31:33 +0530, Ashutosh Sharma wrote in > The changes looks good. thanks.! Thanks! Some recent core change changed WAL insertion speed during the TAP test and revealed one forgotton case of EndOfWAL. When a record header flows into the next page, XLogReadRecord does separate check from ValidXLogRecordHeader by itself. >* If the whole record header is on this page, validate it immediately. >* Otherwise do just a basic sanity check on xl_tot_len, and validate > the >* rest of the header after reading it from the next page. The > xl_tot_len >* check is necessary here to ensure that we enter the "Need to > reassemble >* record" code path below; otherwise we might fail to apply >* ValidXLogRecordHeader at all. >*/ > if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord) > { ... >} > else > { > /* XXX: more validation should be done here */ > if (total_len < SizeOfXLogRecord) > { I could simplly copy-in a part of ValidXLogRecordHeader there but that results in rather large duplicate code. I could have ValidXLogRecordHeader handle the partial header case but it seems to me complex. So in this version I split the xl_tot_len part of ValidXLogRecordHeader into ValidXLogRecordLength. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 01cce076d2b3ad536398cc2b716ef64ed9b2c409 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v15] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlogreader.c | 125 ++ src/backend/access/transam/xlogrecovery.c | 92 src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 106 ++ 6 files changed, 297 insertions(+), 47 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..ba1c1ece87 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -42,6 +42,8 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength); static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); +static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess); static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, @@ -121,6 +123,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +295,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -380,12 +384,11 @@ restart: * whole header. */ record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; /* * If the whole record header is on this page, validate it immediately. - * Otherwise do just a basic sanity check on xl_tot_len, and validate the - * rest of the header after reading it from the next page. The xl_tot_len + * Otherwise do just a basic sanity check on record length, and validate + * the rest of the header after reading it from the next page. The length * check is necessary here to ensure that we enter the "Need to reassemble * record" code path below; otherwise we might fail to apply * ValidXLogRecordHeader at all. @@ -399,18 +402,13 @@ restart: } else { - /* XXX: more validation should be done here */ - if (total_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: wanted %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, total_len); + if (!ValidXLogRecordLength(state, RecPtr, record)) goto err; - } + gotheader = false; } + total_len = record
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro Horiguchi wrote in > So what do you say if I propose the following? > > LOG: terminating process %d to release replication slot \"%s\" > because its restart_lsn %X/%X exceeds the limit %X/%X > HINT: You might need to increase max_slot_wal_keep_size. This version emits the following message. [35785:checkpointer] LOG: terminating process 36368 to release replication slot "s1" because its restart_lsn 0/1F000148 exceeds the limit 0/2100 [35785:checkpointer] HINT: You might need to increase max_slot_wal_keep_size. [36368:walsender] FATAL: terminating connection due to administrator command [36368:walsender] STATEMENT: START_REPLICATION SLOT "s1" 0/1F00 TIMELINE 1 [35785:checkpointer] LOG: invalidating slot "s1" because its restart_lsn 0/1F000148 exceeds the limit 0/2100 [35785:checkpointer] HINT: You might need to increase max_slot_wal_keep_size. We can omit the HINT line from the termination log for non-persistent slots but I think we don't want to bother that considering its low frequency. The CI was confused by the mixed patches for multiple PG versions. In this version the patchset for master are attached as .patch and that for PG13 as .txt. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 4909db2893f0b33ab6bca1ffc3ad802cd159c577 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 24 Dec 2021 12:52:07 +0900 Subject: [PATCH v4 1/2] Make a message on process termination more dscriptive The message at process termination due to slot limit doesn't provide the reason. In the major scenario the message is followed by another message about slot invalidatation, which shows the detail for the termination. However the second message is missing if the slot is temporary one. Augment the first message with the reason same as the second message. Backpatch through to 13 where the message was introduced. Reported-by: Alex Enachioaie Author: Kyotaro Horiguchi Reviewed-by: Ashutosh Bapat Reviewed-by: Bharath Rupireddy Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org Backpatch-through: 13 --- src/backend/replication/slot.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index caa6b29756..92f19bcb35 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1300,8 +1300,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\"", -active_pid, NameStr(slotname; + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size", +active_pid, NameStr(slotname), +LSN_FORMAT_ARGS(restart_lsn; (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; -- 2.27.0 >From f31014f85c7dec160bd42e1c48f1c28a7dc22af2 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 24 Dec 2021 12:58:25 +0900 Subject: [PATCH v4 2/2] Add detailed information to slot-invalidation messages The messages says just "exceeds max_slot_wal_keep_size" but doesn't tell the concrete LSN of the limit. That information helps operators' understanding on the issue. Author: Kyotaro Horiguchi Reviewed-by: Ashutosh Bapat Reviewed-by: Masahiko Sawada Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org --- src/backend/replication/slot.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 92f19bcb35..be21b62add 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1300,9 +1300,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size", + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X", active_pid, NameStr(slotname), -LSN_FORMAT_ARGS(restart_lsn; +LSN_FORMAT_ARGS(restart_lsn), +LSN_FORMAT_ARGS(oldestLSN)), + errhint("You might need to increase max_slot_wal_keep_size."))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; @@ -1345,9 +1347,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, ReplicationSlotRelease(); ereport(LOG, - (errmsg("invalidating slot \&qu
Re: Add the replication origin name and commit-LSN to logical replication worker errcontext
At Wed, 2 Mar 2022 14:39:54 +0900, Masahiko Sawada wrote in > On Wed, Mar 2, 2022 at 11:55 AM Amit Kapila wrote: > > > > On Mon, Feb 28, 2022 at 5:46 PM Masahiko Sawada > > wrote: > > > > > > I've attached two patches: the first one changes > > > apply_error_callback() so that it uses complete sentences with if-else > > > blocks in order to have a translation work, > > > > > > > This is an improvement over what we have now but I think this is still > > not completely correct as per message translation rules: > > + else > > + errcontext("processing remote data during \"%s\" in transaction %u at %s", > > +logicalrep_message_type(errarg->command), > > +errarg->remote_xid, > > +(errarg->ts != 0) ? timestamptz_to_str(errarg->ts) : "(not-set)"); > > > > As per guidelines [1][2], we don't prefer to construct messages at > > run-time aka we can do better for the following part: +(errarg->ts > > != 0) ? timestamptz_to_str(errarg->ts) : "(not-set)". I think we need > > to use if-else here to split it further. If you agree, then the same > > needs to be dealt with in other parts of the patch as well. > > I intended to use "(not-set)" as a value rather than a word in the > sentence so I think it doesn't violate the guidelines. We can split it > further as you suggested but we will end up having more if-else > branches. It seems to me exactly our way. In the first place I doubt "(not-set)" fits the place for timestamp even in English. Moreover, if we (I?) translated the message into Japanese, it would look like; CONTEXT: (not-set)にトランザクション 2352314 内で "TRUNCATE" でのリモートデータの処理中 I don't think that looks fine. Translating "(not-set)" makes things even worse. CONTEXT: (非設定)にトランザクション 2352314 内で "TRUNCATE" でのリモートデータの処理中 Yes, I can alleviate that strangeness a bit by modulating it, but it still looks odd. CONTEXT: 時刻(非設定)、トランザクション 2352314 内で "TRUNCATE" でのリモートデータの処理中 Rather, I'd prefer simply to list the attributes. CONTEXT: processing remote data during "MESSAGE". Transaction (unknown). Time (unknown), replication target relation (unknown), column (unknown) CONTEXT: "MESSAGE"でのリモートデータの処理中. トランザクション (不明). 時刻 (不明), レプリケーション対象リレーション (不明), column (不明) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
At Thu, 20 Jan 2022 17:19:04 +0900 (JST), Kyotaro Horiguchi wrote in > At Thu, 20 Jan 2022 15:07:22 +0900 (JST), Kyotaro Horiguchi > wrote in > CI now likes this version for all platforms. An xlog.c refactoring happend recently hit this. Just rebased on the change. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 35958b17c62cd14f81efa26a097c32c273028f77 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 11 Nov 2021 20:42:00 +0900 Subject: [PATCH v16 1/3] Add tablespace support to TAP framework TAP framework doesn't support nodes that have tablespaces. Especially backup and initialization from backups failed if the source node has tablespaces. This commit provides simple way to create tablespace directories and allows backup routines to handle tablespaces. --- src/test/perl/PostgreSQL/Test/Cluster.pm | 264 ++- src/test/perl/PostgreSQL/Test/Utils.pm | 43 2 files changed, 305 insertions(+), 2 deletions(-) diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index be05845248..15d57b9a71 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -298,6 +298,64 @@ sub archive_dir =pod +=item $node->tablespace_storage([, nocreate]) + +Diretory to store tablespace directories. +If nocreate is true, returns undef if not yet created. + +=cut + +sub tablespace_storage +{ + my ($self, $nocreate) = @_; + + if (!defined $self->{_tsproot}) + { + # tablespace is not used, return undef if nocreate is specified. + return undef if ($nocreate); + + # create and remember the tablespae root directotry. + $self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short(); + } + + return $self->{_tsproot}; +} + +=pod + +=item $node->tablespaces() + +Returns a hash from tablespace OID to tablespace directory name. For +example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as +$hash{16384} = "ts1". + +=cut + +sub tablespaces +{ + my ($self) = @_; + my $pg_tblspc = $self->data_dir . '/pg_tblspc'; + my %ret; + + # return undef if no tablespace is used + return undef if (!defined $self->tablespace_storage(1)); + + # collect tablespace entries in pg_tblspc directory + opendir(my $dir, $pg_tblspc); + while (my $oid = readdir($dir)) + { + next if ($oid !~ /^([0-9]+)$/); + my $linkpath = "$pg_tblspc/$oid"; + my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath); + $ret{$oid} = File::Basename::basename($tsppath); + } + closedir($dir); + + return %ret; +} + +=pod + =item $node->backup_dir() The output path for backups taken with $node->backup() @@ -313,6 +371,77 @@ sub backup_dir =pod +=item $node->backup_tablespace_storage_path(backup_name) + +Returns tablespace location path for backup_name. +Retuns the parent directory if backup_name is not given. + +=cut + +sub backup_tablespace_storage_path +{ + my ($self, $backup_name) = @_; + my $dir = $self->backup_dir . '/__tsps'; + + $dir .= "/$backup_name" if (defined $backup_name); + + return $dir; +} + +=pod + +=item $node->backup_create_tablespace_storage(backup_name) + +Create tablespace location directory for backup_name if not yet. +Create the parent tablespace storage that holds all location +directories if backup_name is not supplied. + +=cut + +sub backup_create_tablespace_storage +{ + my ($self, $backup_name) = @_; + my $dir = $self->backup_tablespace_storage_path($backup_name); + + File::Path::make_path $dir if (! -d $dir); +} + +=pod + +=item $node->backup_tablespaces(backup_name) + +Returns a reference to hash from tablespace OID to tablespace +directory name of tablespace directory that the specified backup has. +For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as +$hash{16384} = "ts1". + +=cut + +sub backup_tablespaces +{ + my ($self, $backup_name) = @_; + my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc'; + my %ret; + + #return undef if this backup holds no tablespaces + return undef if (! -d $self->backup_tablespace_storage_path($backup_name)); + + # scan pg_tblspc directory of the backup + opendir(my $dir, $pg_tblspc); + while (my $oid = readdir($dir)) + { + next if ($oid !~ /^([0-9]+)$/); + my $linkpath = "$pg_tblspc/$oid"; + my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath); + $ret{$oid} = File::Basename::basename($tsppath); + } + closedir($dir); + + return \%ret; +} + +=pod + =item $node->install_path() The configured install path (if any) for the node. @@ -370,6 +499,7 @@ sub info print $fh "Data directory: " . $self->data_dir . "\n"; print $fh "Backup directory: " . $self->backup_dir . "\n"; print $fh "Archive directory: " . $self->archive_dir . "\n"; + print $fh "Tablespace directory:
Re: pg_stop_backup() v2 incorrectly marked as proretset
At Wed, 2 Mar 2022 16:46:01 +0900, Michael Paquier wrote in > Hi all, > > In my hunt looking for incorrect SRFs, I have noticed a new case of a > system function marked as proretset while it builds and returns only > one record. And this is a popular one: pg_stop_backup(), labelled > v2. > > This leads to a lot of unnecessary work, as the function creates a > tuplestore it has no need for with the usual set of checks related to > SRFs. The logic can be be simplified as of the attached. > > Thoughts? That direction seems find to me. But the patch forgets to remove an useless variable. > /* Initialise attributes information in the tuple descriptor */ > tupdesc = CreateTemplateTupleDesc(PG_STOP_BACKUP_V2_COLS); > TupleDescInitEntry(tupdesc, (AttrNumber) 1, "lsn", > PG_LSNOID, -1, 0); I think we can use get_call_resuilt_type here. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Two noncritical bugs of pg_waldump
At Fri, 25 Feb 2022 10:48:47 -0800, Andres Freund wrote in > Hi, > > On 2022-02-14 18:18:47 +0900, Kyotaro Horiguchi wrote: > > > If I give an empty file to the tool it complains as the follows. > > > > > > > pg_waldump: fatal: could not read file "hoge": No such file or directory > > > > > > No, the file exists. The cause is it reads uninitialized errno to > > > detect errors from the system call. read(2) is defined to set errno > > > always when it returns -1 and doesn't otherwise. Thus it seems to me > > > that it is better to check that the return value is less than zero > > > than to clear errno before the call to read(). > > > > So I post a patch contains only the indisputable part. > > Thanks for the report and fix. Pushed. This was surprisingly painful, all but > one branch had conflicts... Ah, I didn't expect that this is committed so quickly. I should have created patches for all versions. Anyway thanks for committing this! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Wed, 02 Mar 2022 15:37:19 +0900 (JST), Kyotaro Horiguchi wrote in > The CI was confused by the mixed patches for multiple PG versions. In > this version the patchset for master are attached as .patch and that > for PG13 as .txt. Yeah It is of course the relevant check should be fixed. The attached v5 adjusts 019_replslot_limit.pl. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 4909db2893f0b33ab6bca1ffc3ad802cd159c577 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 24 Dec 2021 12:52:07 +0900 Subject: [PATCH v5 1/2] Make a message on process termination more dscriptive The message at process termination due to slot limit doesn't provide the reason. In the major scenario the message is followed by another message about slot invalidatation, which shows the detail for the termination. However the second message is missing if the slot is temporary one. Augment the first message with the reason same as the second message. Backpatch through to 13 where the message was introduced. Reported-by: Alex Enachioaie Author: Kyotaro Horiguchi Reviewed-by: Ashutosh Bapat Reviewed-by: Bharath Rupireddy Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org Backpatch-through: 13 --- src/backend/replication/slot.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index caa6b29756..92f19bcb35 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1300,8 +1300,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\"", -active_pid, NameStr(slotname; + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size", +active_pid, NameStr(slotname), +LSN_FORMAT_ARGS(restart_lsn; (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; -- 2.27.0 >From 7a9e571e8cf4e5ffc13d101733c1b6fc455b1aec Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 24 Dec 2021 12:58:25 +0900 Subject: [PATCH v5 2/2] Add detailed information to slot-invalidation messages The messages says just "exceeds max_slot_wal_keep_size" but doesn't tell the concrete LSN of the limit. That information helps operators' understanding on the issue. Author: Kyotaro Horiguchi Reviewed-by: Ashutosh Bapat Reviewed-by: Masahiko Sawada Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org --- src/backend/replication/slot.c| 12 src/test/recovery/t/019_replslot_limit.pl | 2 +- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 92f19bcb35..be21b62add 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1300,9 +1300,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size", + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X", active_pid, NameStr(slotname), -LSN_FORMAT_ARGS(restart_lsn; +LSN_FORMAT_ARGS(restart_lsn), +LSN_FORMAT_ARGS(oldestLSN)), + errhint("You might need to increase max_slot_wal_keep_size."))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; @@ -1345,9 +1347,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, ReplicationSlotRelease(); ereport(LOG, - (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size", + (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X", NameStr(slotname), - LSN_FORMAT_ARGS(restart_lsn; + LSN_FORMAT_ARGS(restart_lsn), + LSN_FORMAT_ARGS(oldestLSN)), + errhint("You might need to increase max_slot_wal_keep_size."))); /* done with this slot for now */ break; diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index 9bb71b62c0..bac496a9cf 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -188,7 +188,7 @@ for (my $i = 0; $i < 1; $i++) { if (find_in_log( $node_primary, - "invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds max_slot_wal_keep_size", + "invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds the limit [0-9A-F/]+", $logstart)) { $invalidated = 1; -- 2.27.0
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
At Wed, 02 Mar 2022 16:59:09 +0900 (JST), Kyotaro Horiguchi wrote in > At Thu, 20 Jan 2022 17:19:04 +0900 (JST), Kyotaro Horiguchi > wrote in > > At Thu, 20 Jan 2022 15:07:22 +0900 (JST), Kyotaro Horiguchi > > wrote in > > CI now likes this version for all platforms. > > An xlog.c refactoring happend recently hit this. > Just rebased on the change. A function added to Util.pm used perl2host, which has been removed recently. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From bb714659adcde5265974c46b061966e5dfc556be Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 11 Nov 2021 20:42:00 +0900 Subject: [PATCH v17 1/3] Add tablespace support to TAP framework TAP framework doesn't support nodes that have tablespaces. Especially backup and initialization from backups failed if the source node has tablespaces. This commit provides simple way to create tablespace directories and allows backup routines to handle tablespaces. --- src/test/perl/PostgreSQL/Test/Cluster.pm | 264 ++- src/test/perl/PostgreSQL/Test/Utils.pm | 42 2 files changed, 304 insertions(+), 2 deletions(-) diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index be05845248..15d57b9a71 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -298,6 +298,64 @@ sub archive_dir =pod +=item $node->tablespace_storage([, nocreate]) + +Diretory to store tablespace directories. +If nocreate is true, returns undef if not yet created. + +=cut + +sub tablespace_storage +{ + my ($self, $nocreate) = @_; + + if (!defined $self->{_tsproot}) + { + # tablespace is not used, return undef if nocreate is specified. + return undef if ($nocreate); + + # create and remember the tablespae root directotry. + $self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short(); + } + + return $self->{_tsproot}; +} + +=pod + +=item $node->tablespaces() + +Returns a hash from tablespace OID to tablespace directory name. For +example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as +$hash{16384} = "ts1". + +=cut + +sub tablespaces +{ + my ($self) = @_; + my $pg_tblspc = $self->data_dir . '/pg_tblspc'; + my %ret; + + # return undef if no tablespace is used + return undef if (!defined $self->tablespace_storage(1)); + + # collect tablespace entries in pg_tblspc directory + opendir(my $dir, $pg_tblspc); + while (my $oid = readdir($dir)) + { + next if ($oid !~ /^([0-9]+)$/); + my $linkpath = "$pg_tblspc/$oid"; + my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath); + $ret{$oid} = File::Basename::basename($tsppath); + } + closedir($dir); + + return %ret; +} + +=pod + =item $node->backup_dir() The output path for backups taken with $node->backup() @@ -313,6 +371,77 @@ sub backup_dir =pod +=item $node->backup_tablespace_storage_path(backup_name) + +Returns tablespace location path for backup_name. +Retuns the parent directory if backup_name is not given. + +=cut + +sub backup_tablespace_storage_path +{ + my ($self, $backup_name) = @_; + my $dir = $self->backup_dir . '/__tsps'; + + $dir .= "/$backup_name" if (defined $backup_name); + + return $dir; +} + +=pod + +=item $node->backup_create_tablespace_storage(backup_name) + +Create tablespace location directory for backup_name if not yet. +Create the parent tablespace storage that holds all location +directories if backup_name is not supplied. + +=cut + +sub backup_create_tablespace_storage +{ + my ($self, $backup_name) = @_; + my $dir = $self->backup_tablespace_storage_path($backup_name); + + File::Path::make_path $dir if (! -d $dir); +} + +=pod + +=item $node->backup_tablespaces(backup_name) + +Returns a reference to hash from tablespace OID to tablespace +directory name of tablespace directory that the specified backup has. +For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as +$hash{16384} = "ts1". + +=cut + +sub backup_tablespaces +{ + my ($self, $backup_name) = @_; + my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc'; + my %ret; + + #return undef if this backup holds no tablespaces + return undef if (! -d $self->backup_tablespace_storage_path($backup_name)); + + # scan pg_tblspc directory of the backup + opendir(my $dir, $pg_tblspc); + while (my $oid = readdir($dir)) + { + next if ($oid !~ /^([0-9]+)$/); + my $linkpath = "$pg_tblspc/$oid"; + my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath); + $ret{$oid} = File::Basename::basename($tsppath); + } + closedir($dir); + + return \%ret; +} + +=pod + =item $node->install_path() The configured install path (if any) for the node. @@ -370,6 +499,7 @@ sub info print $fh "Data directory: " . $self->data_dir . "\n"; print $fh &qu
Re: Add CHECKPOINT_REQUESTED flag to the log message in LogCheckpointStart()
At Wed, 2 Mar 2022 18:18:10 +0530, Bharath Rupireddy wrote in > On Wed, Mar 2, 2022 at 5:41 PM Nitin Jadhav > wrote: > > > > Hi, > > > > I have noticed that the CHECKPOINT_REQUESTED flag information is not > > present in the log message of LogCheckpointStart() function. I would > > like to understand if it was missed or left intentionally. The log > > message describes all the possible checkpoint flags except > > CHECKPOINT_REQUESTED flag. I feel we should support this. Thoughts? > > I don't think that's useful. Being in LogCheckpointStart > (CreateCheckPoint or CreateRestartPoint) itself means that somebody > has requested a checkpoint. Having CHECKPOINT_REQUESTED doesn't add > any value. Agreed. > I would suggest removing the CHECKPOINT_REQUESTED flag as it's not > being used anywhere instead CheckpointerShmem->ckpt_flags is used as > an indication of the checkpoint requested in CheckpointerMain [1]. If Actually no one does but RequestCheckpoint() accepts 0 as flags. Checkpointer would be a bit more complex without CHECKPOINT_REQUESTED. I don't think it does us any good to get rid of the flag value. > others don't agree to remove as it doesn't cause any harm, then, I > would add something like this for more readability: if (((volatile CheckpointerShmemStruct *) - CheckpointerShmem)->ckpt_flags) + CheckpointerShmem)->ckpt_flags) & CHECKPOINT_REQUESTED)) I don't particularly object to this, but I don't think that change makes the code significantly easier to read either. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Changing "Hot Standby" to "hot standby"
At Wed, 2 Mar 2022 15:22:44 +, "Daniel Westermann (DWE)" wrote in > > Pretty sure that for titles we should keep English capitalization rules. > > Done like that. Thanks for taking a look. -Hot Standby feedback propagates upstream, whatever the cascaded arrangement. +hot standby feedback propagates upstream, whatever the cascaded arrangement -Hot Standby is the term used to describe the ability to connect to +hot standby is the term used to describe the ability to connect to They look like decapitalizing the first word in a sentsnce. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
ID_PARAMETER_VALUE), > errmsg("WAL record start LSN must be less than end LSN"))); I don't think that validations are worth doing at least for the first two, as far as that value has a special meaning. I see it useful if pg_get_wal_records_info() means dump the all available records for the moment, or records of the last segment, page or something. > > == > > > > > > + if (loc <= read_upto) > > + break; > > + > > + /* Let's not wait for WAL to be available if > > indicated */ > > + if (loc > read_upto && > > + state->private_data != NULL) > > + { > > > > Why loc > read_upto? The first if condition is (loc <= read_upto) > > followed by the second if condition - (loc > read_upto). Is the first > > if condition (loc <= read_upto) not enough to indicate that loc > > > read_upto? > > Yeah, that's unnecessary, I improved the comment there and removed loc > > read_upto. > > > == > > > > +#define IsEndOfWALReached(state) \ > > + (state->private_data != NULL && \ > > + (((ReadLocalXLOGPage2Private *) > > xlogreader->private_data)->no_wait == true) && \ > > + (((ReadLocalXLOGPage2Private *) > > xlogreader->private_data)->reached_end_of_wal == true)) > > > > I think we should either use state or xlogreader. First line says > > state->private_data and second line xlogreader->private_data. > > I've changed it to use state instead of xlogreader. > > > == > > > > + (((ReadLocalXLOGPage2Private *) > > xlogreader->private_data)->reached_end_of_wal == true)) > > + > > > > There is a new patch coming to make the end of WAL messages less > > scary. It introduces the EOW flag in xlogreaderstate maybe we can use > > that instead of introducing new flags in private area to represent the > > end of WAL. > > Yeah that would be great. But we never know which one gets committed > first. Until then it's not good to have dependencies on two "on-going" > patches. Later, we can change. > > > == > > > > +/* > > + * XLogReaderRoutine->page_read callback for reading local xlog files > > + * > > + * This function is same as read_local_xlog_page except that it works in > > both > > + * wait and no wait mode. The callers can specify about waiting in > > private_data > > + * of XLogReaderState. > > + */ > > +int > > +read_local_xlog_page_2(XLogReaderState *state, XLogRecPtr targetPagePtr, > > + int reqLen, XLogRecPtr > > targetRecPtr, char *cur_page) > > +{ > > + XLogRecPtr read_upto, > > > > Do we really need this function? Can't we make use of an existing WAL > > reader function - read_local_xlog_page()? > > I clearly explained the reasons upthread [2]. Please let me know if > you have more thoughts/doubts here, we can connect offlist. *I* also think the function is not needed, as explained above. Why do we need that function while we know how far we can read WAL records *before* calling the function? > Attaching v6 patch set with above review comments addressed. Please > review it further. > > [1] > https://www.postgresql.org/message-id/CAE9k0P%3D9SReU_613TXytZmpwL3ZRpnC5zrf96UoNCATKpK-UxQ%40mail.gmail.com > +PG_FUNCTION_INFO_V1(pg_get_raw_wal_record); > +PG_FUNCTION_INFO_V1(pg_get_first_valid_wal_record_lsn); > +PG_FUNCTION_INFO_V1(pg_verify_raw_wal_record); > +PG_FUNCTION_INFO_V1(pg_get_wal_record_info); > +PG_FUNCTION_INFO_V1(pg_get_wal_records_info); > > I think we should allow all these functions to be executed in wait and > *nowait* mode. If a user specifies nowait mode, the function should > return if no WAL data is present, rather than waiting for new WAL data > to become available, default behaviour could be anything you like. > > [2] > https://www.postgresql.org/message-id/CALj2ACUtqWX95uAj2VNJED0PnixEeQ%3D0MEzpouLi%2Bzd_iTugRA%40mail.gmail.com > I've added a new function read_local_xlog_page_2 (similar to > read_local_xlog_page but works in wait and no wait mode) and the > callers can specify whether to wait or not wait using private_data. > Actually, I wanted to use the private_data structure of > read_local_xlog_page but the logical decoding already has context as > private_data, that is why I had to have a new function. I know it > creates a bit of duplicate code, but its cleaner than using > backend-local variables or additional flags in XLogReaderState or > adding wait/no-wait boolean to page_read callback. Any other > suggestions are welcome here. > > With this, I'm able to have wait/no wait versions for any functions. > But for now, I'm having wait/no wait for two functions > (pg_get_wal_records_info and pg_get_wal_stats) for which it makes more > sense. > > Regards, > Bharath Rupireddy. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Why do spgbuildempty(), btbuildempty(), and blbuildempty() use smgrwrite()?
At Wed, 2 Mar 2022 20:07:14 -0500, Melanie Plageman wrote in > If you enable the CHECK_WRITE_VS_EXTEND-protected assert in mdwrite(), > you'll trip it anytime you exercise btbuildempty(), blbuildempty(), or > spgbuildempty(). > > In this case, it may not make any meaningful difference if smgrwrite() > or smgrextend() is called (_mdfd_getseg() behavior won't differ even > with the different flags, so really only the FileWrite() wait event will > be different). > However, it seems like it should still be changed to call smgrextend(). > Or, since they only write a few pages, why not use shared buffers like > gistbuildempty() and brinbuildempty() do? > > I've attached a patch to move these three into shared buffers. > > And, speaking of spgbuildempty(), there is no test exercising it in > check-world, so I've attached a patch to do so. I wasn't sure if it > belonged in spgist.sql or create_index_spgist.sql (on name alone, seems > like the latter but based on content, seems like the former). > > - Melanie I didn't dig into your specific isssue, but I'm mildly opposed to moving them onto shared buffers. They are cold images of init-fork, which is actually no-use for active servers. Rather I'd like to move brinbuildempty out of shared buffers considering one of my proposing patches.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add CHECKPOINT_REQUESTED flag to the log message in LogCheckpointStart()
At Thu, 3 Mar 2022 10:27:10 +0900, Michael Paquier wrote in > On Thu, Mar 03, 2022 at 09:39:37AM +0900, Kyotaro Horiguchi wrote: > > At Wed, 2 Mar 2022 18:18:10 +0530, Bharath Rupireddy > > wrote in > >> I don't think that's useful. Being in LogCheckpointStart > >> (CreateCheckPoint or CreateRestartPoint) itself means that somebody > >> has requested a checkpoint. Having CHECKPOINT_REQUESTED doesn't add > >> any value. > > > > Agreed. > > Exactly my impression. This would apply now to the WAL shutdown code > paths, and I'd suspect that the callers of CreateCheckPoint() are not > going to increase soon. The point is: the logs already provide some > contexts for any of those callers so I see no need for this additional > information. > > > Actually no one does but RequestCheckpoint() accepts 0 as flags. > > Checkpointer would be a bit more complex without CHECKPOINT_REQUESTED. > > I don't think it does us any good to get rid of the flag value. > > I'd rather keep this code as-is. I fail to identify the nuance of the phrase, so just for a clarification. In short, I think we should keep the exiting code as-is. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Changing "Hot Standby" to "hot standby"
At Thu, 3 Mar 2022 06:55:43 +, "Daniel Westermann (DWE)" wrote in > Hi Kyotaro, > > >> > >>- Hot Standby is the term used to describe the ability to connect to > >>+ hot standby is the term used to describe the ability to connect to > > >They look like decapitalizing the first word in a sentsnce. > > Thanks for having a look. Are you suggesting to change it like this? > -Hot Standby is the term used to describe the ability to connect to > +Hot standby is the term used to describe the ability to connect to Yes. Isn't it the right form of a sentence? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: shared-memory based stats collector
At Wed, 2 Mar 2022 18:16:00 -0800, Andres Freund wrote in > Hi, > > On 2021-07-26 18:27:54 -0700, Andres Freund wrote: > > I had intended to post a rebase by now. But while I did mostly finish > > that (see [1]) I unfortunately encountered a new issue around > > partitioned tables, see [2]. Currently I'm hoping for a few thoughts on > > that thread about the best way to address the issues. > > Now that > https://postgr.es/m/20220125063131.4cmvsxbz2tdg6g65%40alap3.anarazel.de > is resolved, here's a rebased version. With a good bit of further cleanup. > > One "big" thing that I'd like to figure out is a naming policy for the > different types prefixed with PgStat. We have different groups of types: > > - "pending statistics", that are accumulated but not yet submitted to the > shared stats system, like PgStat_TableStatus, PgStat_BackendFunctionEntry > etc > - accumulated statistics like PgStat_StatDBEntry, PgStat_SLRUStats. About > half are > prefixed with PgStat_Stat, the other just with PgStat_ > - random other types like PgStat_Single_Reset_Type, ... > > To me it's very confusing to have these all in an essentially undistinguishing > namespace, particularly the top two items. Profoundly agreed. It was always a pain in the neck. > I think we should at least do s/PgStat_Stat/PgStat_/. Perhaps we should use a > distinct PgStatPending_* for the pending item? I can't quite come up with a > good name for the "accumulated" ones. How about naming "pending stats" as just "Stats" and the "acculumated stats" as "counts" or "counters"? "Counter" doesn't reflect the characteristics so exactly but I think the discriminability of the two is more significant. Specifically; - PgStat_TableStatus + PgStat_TableStats - PgStat_BackendFunctionEntry + PgStat_FunctionStats - PgStat_GlobalStats + PgStat_GlobalCounts - PgStat_ArchiverStats + PgStat_ArchiverCounts - PgStat_BgWriterStats + PgStat_BgWriterCounts Moving to shared stats collector turns them into attributed by "Local" and "Shared". (I don't consider the details at this stage.) PgStatLocal_TableStats PgStatLocal_FunctionStats PgStatLocal_GlobalCounts PgStatLocal_ArchiverCounts PgStatLocal_BgWriterCounts PgStatShared_TableStats PgStatShared_FunctionStats PgStatShared_GlobalCounts PgStatShared_ArchiverCounts PgStatShared_BgWriterCounts PgStatLocal_GlobalCounts somewhat looks odd, but doesn't matter much, maybe. > I'd like that get resolved first because I think that'd allow commiting the > prepatory split and reordering patches. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
At Wed, 02 Mar 2022 19:31:24 +0900 (JST), Kyotaro Horiguchi wrote in > A function added to Util.pm used perl2host, which has been removed > recently. And same function contained a maybe-should-have-been-removed line which makes Windows build unhappy. This should make all platforms in the CI happy. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From ee17f0f4400ce484cdba80c84744ae47d68c6fa4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 11 Nov 2021 20:42:00 +0900 Subject: [PATCH v18 1/3] Add tablespace support to TAP framework TAP framework doesn't support nodes that have tablespaces. Especially backup and initialization from backups failed if the source node has tablespaces. This commit provides simple way to create tablespace directories and allows backup routines to handle tablespaces. --- src/test/perl/PostgreSQL/Test/Cluster.pm | 264 ++- src/test/perl/PostgreSQL/Test/Utils.pm | 42 2 files changed, 304 insertions(+), 2 deletions(-) diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index be05845248..15d57b9a71 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -298,6 +298,64 @@ sub archive_dir =pod +=item $node->tablespace_storage([, nocreate]) + +Diretory to store tablespace directories. +If nocreate is true, returns undef if not yet created. + +=cut + +sub tablespace_storage +{ + my ($self, $nocreate) = @_; + + if (!defined $self->{_tsproot}) + { + # tablespace is not used, return undef if nocreate is specified. + return undef if ($nocreate); + + # create and remember the tablespae root directotry. + $self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short(); + } + + return $self->{_tsproot}; +} + +=pod + +=item $node->tablespaces() + +Returns a hash from tablespace OID to tablespace directory name. For +example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as +$hash{16384} = "ts1". + +=cut + +sub tablespaces +{ + my ($self) = @_; + my $pg_tblspc = $self->data_dir . '/pg_tblspc'; + my %ret; + + # return undef if no tablespace is used + return undef if (!defined $self->tablespace_storage(1)); + + # collect tablespace entries in pg_tblspc directory + opendir(my $dir, $pg_tblspc); + while (my $oid = readdir($dir)) + { + next if ($oid !~ /^([0-9]+)$/); + my $linkpath = "$pg_tblspc/$oid"; + my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath); + $ret{$oid} = File::Basename::basename($tsppath); + } + closedir($dir); + + return %ret; +} + +=pod + =item $node->backup_dir() The output path for backups taken with $node->backup() @@ -313,6 +371,77 @@ sub backup_dir =pod +=item $node->backup_tablespace_storage_path(backup_name) + +Returns tablespace location path for backup_name. +Retuns the parent directory if backup_name is not given. + +=cut + +sub backup_tablespace_storage_path +{ + my ($self, $backup_name) = @_; + my $dir = $self->backup_dir . '/__tsps'; + + $dir .= "/$backup_name" if (defined $backup_name); + + return $dir; +} + +=pod + +=item $node->backup_create_tablespace_storage(backup_name) + +Create tablespace location directory for backup_name if not yet. +Create the parent tablespace storage that holds all location +directories if backup_name is not supplied. + +=cut + +sub backup_create_tablespace_storage +{ + my ($self, $backup_name) = @_; + my $dir = $self->backup_tablespace_storage_path($backup_name); + + File::Path::make_path $dir if (! -d $dir); +} + +=pod + +=item $node->backup_tablespaces(backup_name) + +Returns a reference to hash from tablespace OID to tablespace +directory name of tablespace directory that the specified backup has. +For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as +$hash{16384} = "ts1". + +=cut + +sub backup_tablespaces +{ + my ($self, $backup_name) = @_; + my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc'; + my %ret; + + #return undef if this backup holds no tablespaces + return undef if (! -d $self->backup_tablespace_storage_path($backup_name)); + + # scan pg_tblspc directory of the backup + opendir(my $dir, $pg_tblspc); + while (my $oid = readdir($dir)) + { + next if ($oid !~ /^([0-9]+)$/); + my $linkpath = "$pg_tblspc/$oid"; + my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath); + $ret{$oid} = File::Basename::basename($tsppath); + } + closedir($dir); + + return \%ret; +} + +=pod + =item $node->install_path() The configured install path (if any) for the node. @@ -370,6 +499,7 @@ sub info print $fh "Data directory: " . $self->data_dir . "\n"; print $fh "Backup directory: " . $self->backup_dir . "\n"; print $fh "Archive directory: " . $self->archive_dir . "\n"; + print $
Re: Make mesage at end-of-recovery less scary.
At Thu, 3 Mar 2022 15:39:44 +0530, Ashutosh Sharma wrote in > The new changes made in the patch look good. Thanks to the recent > changes to speed WAL insertion that have helped us catch this bug. Thanks for the quick checking. > One small comment: > > record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); > - total_len = record->xl_tot_len; > > Do you think we need to change the position of the comments written > for above code that says: Yeah, I didn't do that since it is about header verification. But as you pointed, the result still doesn't look perfect. On second thought the two seems repeating the same things. Thus I merged the two comments together. In this verion 16 it looks like this. > /* >* Validate the record header. >* >* Even though we use an XLogRecord pointer here, the whole record > header >* might not fit on this page. If the whole record header is on this > page, >* validate it immediately. Even otherwise xl_tot_len must be on this > page >* (it is the first field of MAXALIGNed records), but we still cannot >* access any further fields until we've verified that we got the whole >* header, so do just a basic sanity check on record length, and > validate >* the rest of the header after reading it from the next page. The > length >* check is necessary here to ensure that we enter the "Need to > reassemble >* record" code path below; otherwise we might fail to apply >* ValidXLogRecordHeader at all. >*/ > record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); > > if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord) regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 00d848df6bb8b9966dfbd39c98a388fda42a3e3c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v16] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlogreader.c | 144 +- src/backend/access/transam/xlogrecovery.c | 92 ++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 +- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 106 6 files changed, 305 insertions(+), 58 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..bd0f211a23 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -42,6 +42,8 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength); static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); +static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess); static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, @@ -121,6 +123,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +295,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -371,25 +375,21 @@ restart: Assert(pageHeaderSize <= readOff); /* - * Read the record length. + * Validate the record header. * - * NB: Even though we use an XLogRecord pointer here, the whole record - * header might not fit on this page. xl_tot_len is the first field of the - * struct, so it must be on this page (the records are MAXALIGNed), but we - * cannot access any other fields until we've verified that we got the - * whole header. - */ - record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; - - /* - * If the whole record header is on this page, validate it immediately. - * Otherwise do just a basic sanity check on xl_tot_len, and validate the - * rest of the header
Re: pg_stop_backup() v2 incorrectly marked as proretset
At Fri, 4 Mar 2022 10:09:19 +0900, Michael Paquier wrote in > On Thu, Mar 03, 2022 at 04:40:42PM -0500, Tom Lane wrote: > > The point is to make it clear that the macro isn't intended to affect > > code outside the function. Since C lacks block-scoped macros, > > there's no other way to do that. > > > > I concede that a lot of our code is pretty sloppy about this, but > > that doesn't make it a good practice. > > Well, if we change that, better to do that in all the places where > this would be affected, but I am not sure to see a style appealing > enough on this thread. > > From what I can see, history shows that the style of using a #define > for the number of columns originates from da2c1b8, aka 9.0. Its use > inside a function originates from a755ea3 as of 9.1 and then it has > just spread around without any undefs, so it looks like people like > that way of doing things. I'm one of them. Not unliking #undef, though. It seems to me the name "PG_STOP_BACKUP_V2_COLS" alone is specific enough for the purpose of avoiding misuse. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
false failure of test_docoding regression test
Hello. The CF-CI complained on one of my patch for seemingly a reason unrelated to the patch. https://cirrus-ci.com/task/5544213843542016?logs=test_world#L1666 > diff -U3 > /tmp/cirrus-ci-build/contrib/test_decoding/expected/slot_creation_error.out > /tmp/cirrus-ci-build/contrib/test_decoding/output_iso/results/slot_creation_error.out > --- > /tmp/cirrus-ci-build/contrib/test_decoding/expected/slot_creation_error.out > 2022-03-03 22:45:04.708072000 + > +++ > /tmp/cirrus-ci-build/contrib/test_decoding/output_iso/results/slot_creation_error.out > 2022-03-03 22:54:49.621351000 + > @@ -96,13 +96,13 @@ > t > (1 row) > > +step s1_c: COMMIT; > step s2_init: <... completed> > FATAL: terminating connection due to administrator command > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > -step s1_c: COMMIT; > step s1_view_slot: > SELECT slot_name, slot_type, active FROM pg_replication_slots WHERE > slot_name = 'slot_creation_error' This comes from the permuattion 'permutation s1_b s1_xid s2_init s1_terminate_s2 s1_c s1_view_slot'. That means the process termination by s1_terminate_s2 is a bit delayed until the next s1_c ends. So it is rare false failure but it is annoying enough on the CI. It seems to me we need to wait for process termination at the time. postgres_fdw does that in regression test. Thoughts? Simliar use is found in temp-schema-cleanup. There's another possible instability between s2_advisory and s2_check_schema but this change alone reduces the chance for false failures. The attached fixes the both points. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 82cdee92c3726a7f248849a310ec3f392ba02384 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 4 Mar 2022 11:03:12 +0900 Subject: [PATCH] Wait for process termination during isolation tests slot_creation_error.spec and temp-schema-cleanup.spec used pg_terminate_backend() without specifying timeout, thus there may be a case of proceeding to the next step before the process actually terminates then false failure. Supply timeout to pg_terminate_backend() so that it waits for process termination to avoid that failure mode. --- contrib/test_decoding/expected/slot_creation_error.out | 2 +- contrib/test_decoding/specs/slot_creation_error.spec | 2 +- src/test/isolation/expected/temp-schema-cleanup.out| 2 +- src/test/isolation/specs/temp-schema-cleanup.spec | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/contrib/test_decoding/expected/slot_creation_error.out b/contrib/test_decoding/expected/slot_creation_error.out index 043bdae0a2..3707482cb8 100644 --- a/contrib/test_decoding/expected/slot_creation_error.out +++ b/contrib/test_decoding/expected/slot_creation_error.out @@ -87,7 +87,7 @@ step s2_init: SELECT 'init' FROM pg_create_logical_replication_slot('slot_creation_error', 'test_decoding'); step s1_terminate_s2: -SELECT pg_terminate_backend(pid) +SELECT pg_terminate_backend(pid, 18) FROM pg_stat_activity WHERE application_name = 'isolation/slot_creation_error/s2'; diff --git a/contrib/test_decoding/specs/slot_creation_error.spec b/contrib/test_decoding/specs/slot_creation_error.spec index 6816696b9d..32161b9e7f 100644 --- a/contrib/test_decoding/specs/slot_creation_error.spec +++ b/contrib/test_decoding/specs/slot_creation_error.spec @@ -13,7 +13,7 @@ step s1_cancel_s2 { } step s1_terminate_s2 { -SELECT pg_terminate_backend(pid) +SELECT pg_terminate_backend(pid, 18) FROM pg_stat_activity WHERE application_name = 'isolation/slot_creation_error/s2'; } diff --git a/src/test/isolation/expected/temp-schema-cleanup.out b/src/test/isolation/expected/temp-schema-cleanup.out index 35b91d9e45..cb4302739a 100644 --- a/src/test/isolation/expected/temp-schema-cleanup.out +++ b/src/test/isolation/expected/temp-schema-cleanup.out @@ -83,7 +83,7 @@ exec (1 row) step s1_exit: -SELECT pg_terminate_backend(pg_backend_pid()); +SELECT pg_terminate_backend(pg_backend_pid(), 18); FATAL: terminating connection due to administrator command server closed the connection unexpectedly diff --git a/src/test/isolation/specs/temp-schema-cleanup.spec b/src/test/isolation/specs/temp-schema-cleanup.spec index a9417b7e90..f0d3928996 100644 --- a/src/test/isolation/specs/temp-schema-cleanup.spec +++ b/src/test/isolation/specs/temp-schema-cleanup.spec @@ -47,7 +47,7 @@ step s1_discard_temp { } step s1_exit { -SELECT pg_terminate_backend(pg_backend_pid()); +SELECT pg_terminate_backend(pg_backend_pid(), 18); } -- 2.27.0
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
At Mon, 14 Feb 2022 14:52:15 +0900 (JST), Kyotaro Horiguchi wrote in > In this version , 0001 gets one fix and two comment updates. While the disucssion on back-patching of 0001 proceeding on another branch, the main patch iself gets looks like as if rotten on CF-App. So I rebased v10 on the current master. 0001 is replaced by an adjusted patch based on the latest "control file update fix" patch. If someone wants to voice on the message-fix patches (0002-0004), be our guest. 0005 also wants opinions. 0001: Fix possible incorrect controlfile update that leads to unrecoverable database. 0002: Add REDO/Checkpiont LSNs to checkpoinkt-end log message. (The main patch in this thread) 0003: Replace (WAL-)location to LSN in pg_controldata. 0004: Replace (WAL-)location to LSN in user-facing texts. (This doesn't reflect my recent comments.) 0005: Unhyphenate the word archive-recovery and similars. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 0bf164c66ba2c27856c014ed1d452e5e07678541 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 4 Mar 2022 13:18:30 +0900 Subject: [PATCH v11 1/5] Correctly update contfol file at the end of archive recovery CreateRestartPoint runs WAL file cleanup basing on the checkpoint just have finished in the function. If the database has exited DB_IN_ARCHIVE_RECOVERY state when the function is going to update control file, the function refrains from updating the file at all then proceeds to WAL cleanup having the latest REDO LSN, which is now inconsistent with the control file. As the result, the succeeding cleanup procedure overly removes WAL files against the control file and leaves unrecoverable database until the next checkpoint finishes. Along with that fix, we remove a dead code path for the case some other process ran a simultaneous checkpoint. It seems like just a preventive measure but it's no longer useful because we are sure that checkpoint is performed only by checkpointer except single process mode. --- src/backend/access/transam/xlog.c | 72 --- 1 file changed, 47 insertions(+), 25 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0d2bd7a357..3987aa81de 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6899,6 +6899,9 @@ CreateRestartPoint(int flags) XLogSegNo _logSegNo; TimestampTz xtime; + /* we don't assume concurrent checkpoint/restartpoint to run */ + Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER); + /* Get a local copy of the last safe checkpoint record. */ SpinLockAcquire(&XLogCtl->info_lck); lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr; @@ -6964,7 +6967,7 @@ CreateRestartPoint(int flags) /* Also update the info_lck-protected copy */ SpinLockAcquire(&XLogCtl->info_lck); - XLogCtl->RedoRecPtr = lastCheckPoint.redo; + XLogCtl->RedoRecPtr = RedoRecPtr; SpinLockRelease(&XLogCtl->info_lck); /* @@ -6983,7 +6986,10 @@ CreateRestartPoint(int flags) /* Update the process title */ update_checkpoint_display(flags, true, false); - CheckPointGuts(lastCheckPoint.redo, flags); + CheckPointGuts(RedoRecPtr, flags); + + /* Update pg_control */ + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); /* * Remember the prior checkpoint's redo ptr for @@ -6991,30 +6997,29 @@ CreateRestartPoint(int flags) */ PriorRedoPtr = ControlFile->checkPointCopy.redo; + Assert (PriorRedoPtr < RedoRecPtr); + + ControlFile->checkPoint = lastCheckPointRecPtr; + ControlFile->checkPointCopy = lastCheckPoint; + + /* Update control file using current time */ + ControlFile->time = (pg_time_t) time(NULL); + /* - * Update pg_control, using current time. Check that it still shows - * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing; - * this is a quick hack to make sure nothing really bad happens if somehow - * we get here after the end-of-recovery checkpoint. + * Ensure minRecoveryPoint is past the checkpoint record while archive + * recovery is still ongoing. Normally, this will have happened already + * while writing out dirty buffers, but not necessarily - e.g. because no + * buffers were dirtied. We do this because a non-exclusive base backup + * uses minRecoveryPoint to determine which WAL files must be included in + * the backup, and the file (or files) containing the checkpoint record + * must be included, at a minimum. Note that for an ordinary restart of + * recovery there's no value in having the minimum recovery point any + * earlier than this anyway, because redo will begin just after the + * checkpoint record. This is a quick hack to make sure nothing really bad + * happens if somehow we get here after the end-of-recovery checkpoint. */ - LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); - if (ControlFile->state ==
Re: Doc about how to set max_wal_senders when setting minimal wal_level
At Fri, 04 Mar 2022 12:18:29 +0800, Japin Li wrote in > > On Thu, 03 Mar 2022 at 12:10, Japin Li wrote: > > Attach v3 patch to fix missing close varname tag. +A precondition for using minimal WAL is to disable WAL archiving and +streaming replication by setting +to 0, and archive_mode +to off. It is a bit odd that the features to stop and the corresponding GUCs are written irrespectively. It would be better they're in the same order. +servers. If setting max_wal_senders to +0 consider also reducing the amount of WAL produced +by changing wal_level to minimal. Those who anively follow this suggestion may bump into failure when arhive_mode is on. Thus archive_mode is also worth referred to. But, anyway, IMHO, it is mere a performance tips that is not necessarily required in this section, or even in this documentaiotn. Addtion to that, if we write this for max_wal_senders, archive_mode will deserve the similar tips but I think it is too verbose. In short, I think I would not add that description at all. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Thanks to look this! At Fri, 4 Mar 2022 13:51:12 +0900, Michael Paquier wrote i n > On Fri, Mar 04, 2022 at 09:10:48AM +0900, Kyotaro Horiguchi wrote: > > And same function contained a maybe-should-have-been-removed line > > which makes Windows build unhappy. > > > > This should make all platforms in the CI happy. > > d6d317d as solved the issue of tablespace paths across multiple nodes > with the new GUC called allow_in_place_tablespaces, and is getting > successfully used in the recovery tests as of 027_stream_regress.pl. The feature allows only one tablespace directory. but that uses (I'm not sure it needs, though) multiple tablespace directories so I think the feature doesn't work for the test. Maybe I'm missing something, but it doesn't use tablespaces. I see that in 002_tablespace.pl but but the test uses only one tablespace location. > Shouldn't we rely on that rather than extending more our test perl > modules? One tricky part is the emulation of readlink for junction > points on Windows (dir_readlink in your patch), and the root of the Yeah, I don't like that as I said before... > problem is that 0003 cares about the path structure of the > tablespaces so we have no need, as far as I can see, for any > dependency with link follow-up in the scope of this patch. I'm not sure how this related to 0001 but maybe I don't follow this. > This means that you should be able to simplify the patch set, as we > could entirely drop 0001 in favor of enforcing the new dev GUC in the > nodes created in the TAP test of 0002. Maybe it's possible by breaking the test into ones that need only one tablespace. I'll give it a try. > Speaking of 0002, perhaps this had better be in its own file rather > than extending more 011_crash_recovery.pl. 0003 looks like a good Ok, no problem. > idea to check after the consistency of the path structures created > during replay, and it touches paths I'd expect it to touch, as of > database and tbspace redos. > > + if (!reachedConsistency) > + XLogForgetMissingDir(xlrec->ts_id, InvalidOid); > + > + XLogFlush(record->EndRecPtr); > Not sure to understand why this is required. A comment may be in > order to explain the hows and the whys. Is it about XLogFlush? As my understanding it is to update minRecoveryPoint to that LSN. I'll add a comment like that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
At Fri, 4 Mar 2022 15:44:22 +0900, Michael Paquier wrote in > Hi all, > > While playing with tablespaces and recovery in a TAP test, I have > noticed that retrieving the location of a tablespace created with > allow_in_place_tablespaces enabled fails in pg_tablespace_location(), > because readlink() sees a directory in this case. ERROR: could not read symbolic link "pg_tblspc/16407": Invalid argument > The use may be limited to any automated testing and > allow_in_place_tablespaces is a developer GUC, still it seems to me > that there is an argument to allow the case rather than tweak any > tests to hardcode a path with the tablespace OID. And any other code > paths are able to handle such tablespaces, be they in recovery or in > tablespace create/drop. +1 > A junction point is a directory on WIN32 as far as I recall, but > pgreadlink() is here to ensure that we get the correct path on > a source found as pgwin32_is_junction(), so we can rely on that. This > stuff has led me to the attached. > > Thoughts? The function I think is expected to return a absolute path but it returns a relative path for in-place tablespaces. While it is apparently incovenient for general use, there might be a case where we want to know whether the tablespace is in-place or not. So I'm not sure which is better.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
At Fri, 04 Mar 2022 16:41:03 +0900 (JST), Kyotaro Horiguchi wrote in > At Fri, 4 Mar 2022 15:44:22 +0900, Michael Paquier > wrote in > > The use may be limited to any automated testing and > > allow_in_place_tablespaces is a developer GUC, still it seems to me > > that there is an argument to allow the case rather than tweak any > > tests to hardcode a path with the tablespace OID. And any other code > > paths are able to handle such tablespaces, be they in recovery or in > > tablespace create/drop. > > +1 By the way, regardless of the patch, I got an error from pg_basebackup for an in-place tablespace. pg_do_start_backup calls readlink believing pg_tblspc/* is always a symlink. # Running: pg_basebackup -D /home/horiguti/work/worktrees/tsp_replay_2/src/test/recovery/tmp_check/t_029_replay_tsp_drops_primary1_data/backup/my_backup -h /tmp/X8E4nbF4en -p 51584 --checkpoint fast --no-sync WARNING: could not read symbolic link "pg_tblspc/16384": Invalid argument regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
At Fri, 04 Mar 2022 16:54:49 +0900 (JST), Kyotaro Horiguchi wrote in > At Fri, 04 Mar 2022 16:41:03 +0900 (JST), Kyotaro Horiguchi > wrote in > > At Fri, 4 Mar 2022 15:44:22 +0900, Michael Paquier > > wrote in > > > The use may be limited to any automated testing and > > > allow_in_place_tablespaces is a developer GUC, still it seems to me > > > that there is an argument to allow the case rather than tweak any > > > tests to hardcode a path with the tablespace OID. And any other code > > > paths are able to handle such tablespaces, be they in recovery or in > > > tablespace create/drop. > > > > +1 > > By the way, regardless of the patch, I got an error from pg_basebackup > for an in-place tablespace. pg_do_start_backup calls readlink > believing pg_tblspc/* is always a symlink. > > # Running: pg_basebackup -D > /home/horiguti/work/worktrees/tsp_replay_2/src/test/recovery/tmp_check/t_029_replay_tsp_drops_primary1_data/backup/my_backup > -h /tmp/X8E4nbF4en -p 51584 --checkpoint fast --no-sync > WARNING: could not read symbolic link "pg_tblspc/16384": Invalid argument So now we know that there are three places that needs the same processing. pg_tablespace_location: this patch tries to fix sendDir: it already supports in-place tsp do_pg_start_backup: not supports in-place tsp yet. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
At Fri, 04 Mar 2022 17:28:45 +0900 (JST), Kyotaro Horiguchi wrote in > > By the way, regardless of the patch, I got an error from pg_basebackup > > for an in-place tablespace. pg_do_start_backup calls readlink > > believing pg_tblspc/* is always a symlink. > > > > # Running: pg_basebackup -D > > /home/horiguti/work/worktrees/tsp_replay_2/src/test/recovery/tmp_check/t_029_replay_tsp_drops_primary1_data/backup/my_backup > > -h /tmp/X8E4nbF4en -p 51584 --checkpoint fast --no-sync > > WARNING: could not read symbolic link "pg_tblspc/16384": Invalid argument > > So now we know that there are three places that needs the same > processing. > > pg_tablespace_location: this patch tries to fix > sendDir:it already supports in-place tsp > do_pg_start_backup: not supports in-place tsp yet. And I made a quick hack on do_pg_start_backup. And I found that pg_basebackup copies in-place tablespaces under the *current directory*, which is not ok at all:( regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
At Fri, 4 Mar 2022 23:26:43 +1300, Thomas Munro wrote in > On Fri, Mar 4, 2022 at 10:04 PM Kyotaro Horiguchi > wrote: > > And I made a quick hack on do_pg_start_backup. And I found that > > pg_basebackup copies in-place tablespaces under the *current > > directory*, which is not ok at all:( > > Hmm. Which OS are you on? Looks OK here -- the "in place" tablespace > gets copied as a directory under pg_tblspc, no symlink: > The warning from readlink() while making the mapping file isn't ideal, > and perhaps we should suppress that with something like the attached. > Or does the missing map file entry break something on Windows? Ah.. Ok, somehow I thought that pg_basebackup failed for readlink failure and the tweak I made made things worse. I got to make it work. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
So the new framework has been dropped in this version. The second test is removed as it is irrelevant to this bug. In this version the patch is a single file that contains the test. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 43bb3ba8900edd53a1feb0acb1a72bdc22bb1627 Mon Sep 17 00:00:00 2001 From: P Date: Mon, 7 Mar 2022 17:10:07 +0900 Subject: [PATCH v20] Fix replay of create database records on standby Crash recovery on standby may encounter missing directories when replaying create database WAL records. Prior to this patch, the standby would fail to recover in such a case. However, the directories could be legitimately missing. Consider a sequence of WAL records as follows: CREATE DATABASE DROP DATABASE DROP TABLESPACE If, after replaying the last WAL record and removing the tablespace directory, the standby crashes and has to replay the create database record again, the crash recovery must be able to move on. This patch adds mechanism similar to invalid page hash table, to track missing directories during crash recovery. If all the missing directory references are matched with corresponding drop records at the end of crash recovery, the standby can safely enter archive recovery. Bug identified by Paul Guo. Authored by Paul Guo, Kyotaro Horiguchi and Asim R P. --- src/backend/access/transam/xlogrecovery.c | 6 + src/backend/access/transam/xlogutils.c | 145 src/backend/commands/dbcommands.c | 56 src/backend/commands/tablespace.c | 16 +++ src/include/access/xlogutils.h | 4 + src/test/recovery/t/029_replay_tsp_drops.pl | 62 + 6 files changed, 289 insertions(+) create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index f9f212680b..97fed1e04d 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -2043,6 +2043,12 @@ CheckRecoveryConsistency(void) */ XLogCheckInvalidPages(); + /* + * Check if the XLOG sequence contained any unresolved references to + * missing directories. + */ + XLogCheckMissingDirs(); + reachedConsistency = true; ereport(LOG, (errmsg("consistent recovery state reached at %X/%X", diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 54d5f20734..3f8f7dadac 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -79,6 +79,151 @@ typedef struct xl_invalid_page static HTAB *invalid_page_tab = NULL; +/* + * If a create database WAL record is being replayed more than once during + * crash recovery on a standby, it is possible that either the tablespace + * directory or the template database directory is missing. This happens when + * the directories are removed by replay of subsequent drop records. Note + * that this problem happens only on standby and not on master. On master, a + * checkpoint is created at the end of create database operation. On standby, + * however, such a strategy (creating restart points during replay) is not + * viable because it will slow down WAL replay. + * + * The alternative is to track references to each missing directory + * encountered when performing crash recovery in the following hash table. + * Similar to invalid page table above, the expectation is that each missing + * directory entry should be matched with a drop database or drop tablespace + * WAL record by the end of crash recovery. + */ +typedef struct xl_missing_dir_key +{ + Oid spcNode; + Oid dbNode; +} xl_missing_dir_key; + +typedef struct xl_missing_dir +{ + xl_missing_dir_key key; + char path[MAXPGPATH]; +} xl_missing_dir; + +static HTAB *missing_dir_tab = NULL; + +void +XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path) +{ + xl_missing_dir_key key; + bool found; + xl_missing_dir *entry; + + /* + * Database OID may be invalid but tablespace OID must be valid. If + * dbNode is InvalidOid, we are logging a missing tablespace directory, + * otherwise we are logging a missing database directory. + */ + Assert(OidIsValid(spcNode)); + + if (missing_dir_tab == NULL) + { + /* create hash table when first needed */ + HASHCTL ctl; + + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(xl_missing_dir_key); + ctl.entrysize = sizeof(xl_missing_dir); + + missing_dir_tab = hash_create("XLOG missing directory table", + 100, + &ctl, + HASH_ELEM | HASH_BLOBS); + } + + key.spcNode = spcNode; + key.dbNode = dbNode; + + entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found); + + if (found) + { + if (dbNode == InvalidOid) + elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s", + path, spcNode, entry->path); + else + elog(DEBUG2, "missing directory %s (t
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
At Tue, 8 Mar 2022 10:39:06 +1300, Thomas Munro wrote in > On Tue, Mar 8, 2022 at 12:58 AM Michael Paquier wrote: > > On Fri, Mar 04, 2022 at 11:26:43PM +1300, Thomas Munro wrote: > > > + /* Skip in-place tablespaces (testing use only) */ > > > + if (get_dirent_type(fullpath, de, false, ERROR) == PGFILETYPE_DIR) > > > + continue; > > > > I saw the warning when testing base backups with in-place tablespaces > > and it did not annoy me much, but, yes, that can be confusing. > > > > Junction points are directories, no? Are you sure that this works > > correctly on WIN32? It seems to me that we'd better use readlink() > > only for entries in pg_tlbspc/ that are PGFILETYPE_LNK on non-WIN32 > > and pgwin32_is_junction() on WIN32. > > Thanks, you're right. Test on a Win10 VM. Here's a new version. Thanks! It works for me on CentOS8 and Windows11. FYI, on Windows11, pg_basebackup didn't work correctly without the patch. So this looks like fixing an undiscovered bug as well. === > pg_basebackup -D copy WARNING: could not read symbolic link "pg_tblspc/16384": Invalid argument pg_basebackup: error: tar member has empty name > dir copy Volume in drive C has no label. Volume serial number: 10C6-4BA6 Directory of c:\..\copy 2022/03/08 09:53 . 2022/03/08 09:53 .. 2022/03/08 09:53 0 nbase.tar 2022/03/08 09:53 pg_wal 1 File(s) 0 bytes 3 Dir(s) 171,920,613,376 bytes free regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
At Tue, 8 Mar 2022 10:28:46 +0900, Michael Paquier wrote in > On Tue, Mar 08, 2022 at 10:06:50AM +0900, Kyotaro Horiguchi wrote: > > At Tue, 8 Mar 2022 10:39:06 +1300, Thomas Munro > > wrote in > >> Thanks, you're right. Test on a Win10 VM. Here's a new version. > > Looks fine to me. > > > FYI, on Windows11, pg_basebackup didn't work correctly without the > > patch. So this looks like fixing an undiscovered bug as well. > > Well, that's not really a long-time bug but just a side effect of > in-place tablespaces because we don't use them in many test cases > yet, is it? No, we don't. So just FYI. > >> pg_basebackup -D copy > > WARNING: could not read symbolic link "pg_tblspc/16384": Invalid argument > > pg_basebackup: error: tar member has empty name > > > >1 File(s) 0 bytes > >3 Dir(s) 171,920,613,376 bytes free > > That's a lot of free space. The laptop has a 512GB storage, so 160GB is pretty normal, maybe. 129GB of the storage is used by some VMs.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Naming of the different stats systems / "stats collector"
At Tue, 8 Mar 2022 15:55:04 -0700, "David G. Johnston" wrote in > On Tue, Mar 8, 2022 at 1:54 PM Andres Freund wrote: > > the immediate question for the patch is what to replace "collector" with. > > > > > Not really following the broader context here so this came out of nowhere > for me. What is the argument for changing the status quo here? Collector > seems like good term. The name "stats collector" is tied with the story that "there is a process that only collects stats data that arrive from working proceses.". We have such modules like bgwriter, checkpointer, walwriter and so on. On the other hand we have many features with no dedicate process instead work on shared storage area as a part of working prcesses. table/column statistics, XLOG, heap, SLUR and so on. In the world where every working process writes statitics to shared meomry area by its own, no such process exists. I think we no longer name it "stats collector". > > The patch currently uses "activity statistics" in a number of places, but > > that > > is confusing too, because pg_stat_activity is a different kind of stats. > > > > Any ideas? > > > > If the complaint is that not all of these statistics modules use the > statistics collector then maybe we say each non-collector module defines an > "Event Listener". Or, and without looking at the source code, have the > collector simply forward events like "reset now" to the appropriate module > but keep the collector as the single point of message interchange for all. > And so "tell the collector about" is indeed the correct phrasing of what > happens. So the collector as a process is going to die. We need alternative name for the non-collector. Metrics, as you mentioned below, sounds good to me. The name "activity stat(istics)?s" is an answer to my desire to discriminate it from "table/column statistics" but I have to admit that it is still not great. > > The postgresql.conf.sample section header seems particularly odd - "index > > statistics"? We collect more data about tables etc. > > > > No argument for bringing the header current. > > > > > A more general point: Our naming around different types of stats is > > horribly > > confused. We have stats describing the current state (e.g. > > pg_stat_activity, > > pg_stat_replication, pg_stat_progress_*, ...) and accumulated stats > > (pg_stat_user_tables, pg_stat_database, etc) in the same namespace. > > Should we > > try to move towards something more coherent, at least going forward? > > > > > I'm not sure trying to improve this going forward, and thus having at least > three categories, is particularly desirable. While it is unfortunate that > we don't have separate pg_metric and pg_status namespaces (combining > pg_stat with pg_status or pg_state, the two obvious choices, would be > undesirable being they all have a shared leading character sequence) that > is where we are today. We are probably stuck with just using the pg_stat > namespace and doing a better job of letting users know about the underlying > implementation choice each pg_stat relation took in order to know whether > what is being reported is considered reliable (self-managed shared memory) > or not (leverages the unreliable collector). In short, deal with this > mainly in documentation/comments and implementation details but leave the > public facing naming alone. > > David J. If we could, I like the namings like pg_metrics.process, pg_metrics.replication, pg_progress.vacuum, pg_progress.basebackup, and pg_stats.database, pg_stats.user_tables.. With such eyes, it looks somewhat odd that pg_stat_* views are belonging to the pg_catalog namespace. If we had system table-aliases, people who insist on the good-old names can live with that. Even if there isn't, we can instead provide views with the old names. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
e ID. > + > > > * Can we mark this extension 'trusted'? I'm not 100% clear on the > > standards for that marker, but it seems reasonable for a database owner > > with the right privileges might want to install it. > > 'trusted' extensions concept is added by commit 50fc694 [3]. Since > pg_walinspect deals with WAL, we strictly want to control who creates > and can execute functions exposed by it, so I don't know if 'trusted' > is a good idea here. Also, pageinspect isn't a 'trusted' extension. > > > * pg_get_raw_wal_record() seems too powerful for pg_monitor. Maybe that > > function should require pg_read_server_files? Or at least > > pg_read_all_data? > > pg_read_all_data may not be the right choice, but pg_read_server_files > is. However, does it sound good if some functions are allowed to be > executed by users with a pg_monitor role and others > pg_get_raw_wal_record by users with pg_read_server_files? Since the > extension itself can be created by superusers, isn't the > pg_get_raw_wal_record sort of safe with pg_mointor itself? > > If hackers don't agree, I'm happy to grant execution on > pg_get_raw_wal_record() to the pg_read_server_files role. > > Attaching the v8 patch-set resolving above comments and some tests for > checking function permissions. Please review it further. > > [1] > https://www.postgresql.org/message-id/CALj2ACWtToUQ5hCCBJP%2BmKeVUcN-g7cMb9XvhAcicPxUDsdcKg%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/CA%2BTgmobYrTgMEF0SV%2ByDYyCCh44DAGjZVs7BYGrD8xD3vwNjHA%40mail.gmail.com > [3] commit 50fc694e43742ce3d04a5e9f708432cb022c5f0d > Author: Tom Lane > Date: Wed Jan 29 18:42:43 2020 -0500 > > Invent "trusted" extensions, and remove the pg_pltemplate catalog. I played with this a bit, and would like to share some thoughts on it. It seems to me too rigorous that pg_get_wal_records_info/stats() reject future LSNs as end-LSN and I think WARNING or INFO and stop at the real end-of-WAL is more kind to users. I think the same with the restriction that start and end LSN are required to be different. The definition of end-lsn is fuzzy here. If I fed a future LSN to the functions, they tell me the beginning of the current insertion point in error message. On the other hand they don't accept the same value as end-LSN. I think it is right that they tell the current insertion point and they should take the end-LSN as the LSN to stop reading. I think pg_get_wal_stats() is worth to have but I think it should be implemented in SQL. Currently pg_get_wal_records_info() doesn't tell about FPI since pg_waldump doesn't but it is internally collected (of course!) and easily revealed. If we do that, the pg_get_wal_records_stats() would be reduced to the following SQL statement SELECT resource_manager resmgr, count(*) AS N, (count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS "%N", sum(total_length) AS "combined size", (sum(total_length) * 100 / sum(sum(total_length)) OVER tot)::numeric(5,2) AS "%combined size", sum(fpi_len) AS fpilen, (sum(fpi_len) * 100 / sum(sum(fpi_len)) OVER tot)::numeric(5,2) AS "%fpilen" FROM pg_get_wal_records_info('0/100', '0/175DD7f') GROUP by resource_manager WINDOW tot AS () ORDER BY "combined size" desc; The only difference with pg_waldump is the statement above doesn't show lines for the resource managers that don't contained in the result of pg_get_wal_records_info(). But I don't think that matters. Sometimes the field description has very long (28kb long) content. It makes the result output almost unreadable and I had a bit hard time struggling with the output full of '-'s. I would like have a default limit on the length of such fields that can be long but I'm not sure we want that. The difference between pg_get_wal_record_info and _records_ other than the number of argument is the former accepts incorrect LSNs. The following works, pg_get_wal_record_info('0/100'); pg_get_wal_records_info('0/100'); but this doesn't pg_get_wal_records_info('0/100', '0/100'); > ERROR: WAL start LSN must be less than end LSN But the following works pg_get_wal_records_info('0/100', '0/129'); > 0/128 | 0/0 | 0 So I think we can consolidate the two functions as: - pg_get_wal_records_info('0/100'); (current behavior) find the first record and show all records thereafter. - pg_get_wal_records_info('0/100', '0/1000000'); finds the first record since the start lsn and show it. - pg_get_wal_records_info('0/100', '0/130'); finds the first record since the start lsn then show records up to the end-lsn. And about pg_get_raw_wal_record(). I don't see any use-case of the function alone on SQL interface. Even if we need to inspect broken WAL files, it needs profound knowledge of WAL format and tools that doesn't work on SQL interface. However like pageinspect, if we separate the WAL-record fetching and parsing it could be thought as useful. pg_get_wal_records_info woule be like: SELECT * FROM pg_walinspect_parse(raw) FROM (SELECT * FROM pg_walinspect_get_raw(start_lsn, end_lsn)); And pg_get_wal_stats woule be like: SELECT * FROM pg_walinpect_stat(pg_walinspect_parse(raw)) FROM (SELECT * FROM pg_walinspect_get_raw(start_lsn, end_lsn))); Regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
Sorry, some minor non-syntactical corrections. At Fri, 11 Mar 2022 11:38:22 +0900 (JST), Kyotaro Horiguchi wrote in > I played with this a bit, and would like to share some thoughts on it. > > It seems to me too rigorous that pg_get_wal_records_info/stats() > reject future LSNs as end-LSN and I think WARNING or INFO and stop at > the real end-of-WAL is more kind to users. I think the same with the > restriction that start and end LSN are required to be different. > > The definition of end-lsn is fuzzy here. If I fed a future LSN to the > functions, they tell me the beginning of the current insertion point > in error message. On the other hand they don't accept the same > value as end-LSN. I think it is right that they tell the current > insertion point and they should take the end-LSN as the LSN to stop > reading. > > I think pg_get_wal_stats() is worth to have but I think it should be > implemented in SQL. Currently pg_get_wal_records_info() doesn't tell > about FPI since pg_waldump doesn't but it is internally collected (of > course!) and easily revealed. If we do that, the > pg_get_wal_records_stats() would be reduced to the following SQL > statement > > SELECT resource_manager resmgr, >count(*) AS N, > (count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS "%N", > sum(total_length) AS "combined size", > (sum(total_length) * 100 / sum(sum(total_length)) OVER > tot)::numeric(5,2) AS "%combined size", > sum(fpi_len) AS fpilen, > (sum(fpi_len) * 100 / sum(sum(fpi_len)) OVER tot)::numeric(5,2) AS > "%fpilen" > FROM pg_get_wal_records_info('0/100', '0/175DD7f') > GROUP by resource_manager > WINDOW tot AS () > ORDER BY "combined size" desc; > > The only difference with pg_waldump is the statement above doesn't > show lines for the resource managers that don't contained in the > result of pg_get_wal_records_info(). But I don't think that matters. > > > Sometimes the field description has very long (28kb long) content. It > makes the result output almost unreadable and I had a bit hard time > struggling with the output full of '-'s. I would like have a default > limit on the length of such fields that can be long but I'm not sure > we want that. > > - The difference between pg_get_wal_record_info and _records_ other than - the number of argument is the former accepts incorrect LSNs. The discussion is somewhat confused after some twists and turns.. It should be something like the following. pg_get_wal_record_info and pg_get_wal_records_info are almost same since the latter can show a single record. However it is a bit annoying to do that. Since, other than it doens't accept same LSNs for start and end, it doesn't show a record when there' no record in the specfied LSN range. But I don't think there's no usefulness of the behavior. The following works, pg_get_wal_record_info('0/100'); pg_get_wal_records_info('0/100'); but this doesn't pg_get_wal_records_info('0/100', '0/100'); > ERROR: WAL start LSN must be less than end LSN And the following shows no records. pg_get_wal_records_info('0/100', '0/101'); pg_get_wal_records_info('0/100', '0/128'); But the following works pg_get_wal_records_info('0/100', '0/129'); > 0/128 | 0/0 | 0 > So I think we can consolidate the two functions as: > > - pg_get_wal_records_info('0/100'); > > (current behavior) find the first record and show all records > thereafter. > > - pg_get_wal_records_info('0/100', '0/100'); > > finds the first record since the start lsn and show it. > > - pg_get_wal_records_info('0/100', '0/130'); > > finds the first record since the start lsn then show records up to > the end-lsn. > > > And about pg_get_raw_wal_record(). I don't see any use-case of the > function alone on SQL interface. Even if we need to inspect broken > WAL files, it needs profound knowledge of WAL format and tools that > doesn't work on SQL interface. > > However like pageinspect, if we separate the WAL-record fetching and > parsing it could be thought as useful. > > pg_get_wal_records_info woule be like: > > SELECT * FROM pg_walinspect_parse(raw) > FROM (SELECT * FROM pg_walinspect_get_raw(start_lsn, end_lsn)); > > And pg_get_wal_stats woule be like: > > SELECT * FROM pg_walinpect_stat(pg_walinspect_parse(raw)) > FROM (SELECT * FROM pg_walinspect_get_raw(start_lsn, end_lsn))); Regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov wrote in > В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет: > > Ok, here is v4. > > And here is v5. > > First, there was compilation error in Assert in dynahash.c . > Excuse me for not checking before sending previous version. > > Second, I add third commit that reduces HASHHDR allocation > size for non-partitioned dynahash: > - moved freeList to last position > - alloc and memset offset(HASHHDR, freeList[1]) for > non-partitioned hash tables. > I didn't benchmarked it, but I will be surprised if it > matters much in performance sence. > > Third, I put all three commits into single file to not > confuse commitfest application. Thanks! I looked into dynahash part. struct HASHHDR { - /* -* The freelist can become a point of contention in high-concurrency hash Why did you move around the freeList? - longnentries; /* number of entries in associated buckets */ + longnfree; /* number of free entries in the list */ + longnalloced; /* number of entries initially allocated for Why do we need nfree? HASH_ASSING should do the same thing with HASH_REMOVE. Maybe the reason is the code tries to put the detached bucket to different free list, but we can just remember the freelist_idx for the detached bucket as we do for hashp. I think that should largely reduce the footprint of this patch. -static void hdefault(HTAB *hashp); +static void hdefault(HTAB *hashp, bool partitioned); That optimization may work even a bit, but it is not irrelevant to this patch? + case HASH_REUSE: + if (currBucket != NULL) + { + /* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */ + Assert(DynaHashReuse.hashp == NULL); + Assert(DynaHashReuse.element == NULL); I think all cases in the switch(action) other than HASH_ASSIGN needs this assertion and no need for checking both, maybe only for element would be enough. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi wrote in > Thanks! I looked into dynahash part. > > struct HASHHDR > { > - /* > - * The freelist can become a point of contention in high-concurrency > hash > > Why did you move around the freeList? > > > - longnentries; /* number of entries in > associated buckets */ > + longnfree; /* number of free entries in > the list */ > + longnalloced; /* number of entries initially > allocated for > > Why do we need nfree? HASH_ASSING should do the same thing with > HASH_REMOVE. Maybe the reason is the code tries to put the detached > bucket to different free list, but we can just remember the > freelist_idx for the detached bucket as we do for hashp. I think that > should largely reduce the footprint of this patch. > > -static void hdefault(HTAB *hashp); > +static void hdefault(HTAB *hashp, bool partitioned); > > That optimization may work even a bit, but it is not irrelevant to > this patch? > > + case HASH_REUSE: > + if (currBucket != NULL) > + { > + /* check there is no unfinished > HASH_REUSE+HASH_ASSIGN pair */ > + Assert(DynaHashReuse.hashp == NULL); > + Assert(DynaHashReuse.element == NULL); > > I think all cases in the switch(action) other than HASH_ASSIGN needs > this assertion and no need for checking both, maybe only for element > would be enough. While I looked buf_table part, I came up with additional comments. BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id) { hash_search_with_hash_value(SharedBufHash, HASH_ASSIGN, ... BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse) BufTableDelete considers both reuse and !reuse cases but BufTableInsert doesn't and always does HASH_ASSIGN. That looks odd. We should use HASH_ENTER here. Thus I think it is more reasonable that HASH_ENTRY uses the stashed entry if exists and needed, or returns it to freelist if exists but not needed. What do you think about this? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi wrote in > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi > wrote in > > Thanks! I looked into dynahash part. Then I looked into bufmgr part. It looks fine to me but I have some comments on code comments. >* To change the association of a valid buffer, we'll need to > have >* exclusive lock on both the old and new mapping partitions. > if (oldFlags & BM_TAG_VALID) We don't take lock on the new mapping partition here. +* Clear out the buffer's tag and flags. We must do this to ensure that +* linear scans of the buffer array don't think the buffer is valid. We +* also reset the usage_count since any recency of use of the old content +* is no longer relevant. +* +* We are single pinner, we hold buffer header lock and exclusive +* partition lock (if tag is valid). Given these statements it is safe to +* clear tag since no other process can inspect it to the moment. This comment is a merger of the comments from InvalidateBuffer and BufferAlloc. But I think what we need to explain here is why we invalidate the buffer here despite of we are going to reuse it soon. And I think we need to state that the old buffer is now safe to use for the new tag here. I'm not sure the statement is really correct but clearing-out actually looks like safer. > Now it is safe to use victim buffer for new tag. Invalidate the > buffer before releasing header lock to ensure that linear scans of > the buffer array don't think the buffer is valid. It is safe > because it is guaranteed that we're the single pinner of the buffer. > That pin also prevents the buffer from being stolen by others until > we reuse it or return it to freelist. So I want to revise the following comment. -* Now it is safe to use victim buffer for new tag. +* Now reuse victim buffer for new tag. >* Make sure BM_PERMANENT is set for buffers that must be written at > every >* checkpoint. Unlogged buffers only need to be written at shutdown >* checkpoints, except for their "init" forks, which need to be treated >* just like permanent relations. >* >* The usage_count starts out at 1 so that the buffer can survive one >* clock-sweep pass. But if you think the current commet is fine, I don't insist on the comment chagnes. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Fri, 11 Mar 2022 11:30:27 +0300, Yura Sokolov wrote in > В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет: > > At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov > > wrote in > > > В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет: > > > > Ok, here is v4. > > > > > > And here is v5. > > > > > > First, there was compilation error in Assert in dynahash.c . > > > Excuse me for not checking before sending previous version. > > > > > > Second, I add third commit that reduces HASHHDR allocation > > > size for non-partitioned dynahash: > > > - moved freeList to last position > > > - alloc and memset offset(HASHHDR, freeList[1]) for > > > non-partitioned hash tables. > > > I didn't benchmarked it, but I will be surprised if it > > > matters much in performance sence. > > > > > > Third, I put all three commits into single file to not > > > confuse commitfest application. > > > > Thanks! I looked into dynahash part. > > > > struct HASHHDR > > { > > - /* > > -* The freelist can become a point of contention in > > high-concurrency hash > > > > Why did you move around the freeList? > > > > > > - longnentries; /* number of entries in > > associated buckets */ > > + longnfree; /* number of free entries > > in the list */ > > + longnalloced; /* number of entries > > initially allocated for > > > > Why do we need nfree? HASH_ASSING should do the same thing with > > HASH_REMOVE. Maybe the reason is the code tries to put the detached > > bucket to different free list, but we can just remember the > > freelist_idx for the detached bucket as we do for hashp. I think that > > should largely reduce the footprint of this patch. > > If we keep nentries, then we need to fix nentries in both old > "freeList" partition and new one. It is two freeList[partition]->mutex > lock+unlock pairs. > > But count of free elements doesn't change, so if we change nentries > to nfree, then no need to fix freeList[partition]->nfree counters, > no need to lock+unlock. Ah, okay. I missed that bucket reuse chages key in most cases. But still I don't think its good to move entries around partition freelists for another reason. I'm afraid that the freelists get into imbalanced state. get_hash_entry prefers main shmem allocation than other freelist so that could lead to freelist bloat, or worse contension than the traditinal way involving more than two partitions. I'll examine the possibility to resolve this... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Fri, 11 Mar 2022 12:34:32 +0300, Yura Sokolov wrote in > В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет: > > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi > > > BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool > > reuse) > > > > BufTableDelete considers both reuse and !reuse cases but > > BufTableInsert doesn't and always does HASH_ASSIGN. That looks > > odd. We should use HASH_ENTER here. Thus I think it is more > > reasonable that HASH_ENTRY uses the stashed entry if exists and > > needed, or returns it to freelist if exists but not needed. > > > > What do you think about this? > > Well... I don't like it but I don't mind either. > > Code in HASH_ENTER and HASH_ASSIGN cases differs much. > On the other hand, probably it is possible to merge it carefuly. > I'll try. Honestly, I'm not sure it wins on performance basis. It just came from interface consistency (mmm. a bit different, maybe.. convincibility?). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
At Fri, 11 Mar 2022 15:39:13 -0500, Robert Haas wrote in > On Thu, Mar 10, 2022 at 9:38 PM Kyotaro Horiguchi > wrote: > > It seems to me too rigorous that pg_get_wal_records_info/stats() > > reject future LSNs as end-LSN and I think WARNING or INFO and stop at > > the real end-of-WAL is more kind to users. I think the same with the > > restriction that start and end LSN are required to be different. > > In his review just yesterday, Jeff suggested this: "Don't issue > WARNINGs or other messages for ordinary situations, like when > pg_get_wal_records_info() hits the end of WAL." I think he's entirely > right, and I don't think any patch that does otherwise should get It depends on what we think is the "ordinary" here. If we don't expect that specified LSN range is not filled-out, the case above is ordinary and no need for any WARNING nor INFO. I'm fine with that definition here. > committed. It is worth remembering that the results of queries are > often examined by something other than a human being sitting at a psql > terminal. Any tool that uses this is going to want to understand what > happened from the result set, not by parsing strings that may show up > inside warning messages. Right. I don't think it is right that WARNING is required to evaluate the result. And I think that the WARNING like 'reached end-of-wal before end LSN' is a kind that is not required in evaluation of the result. Since each WAL row contains at least start LSN. > I think that the right answer here is to have a function that returns > one row per record parsed, and each row should also include the start > and end LSN of the record. If for some reason the WAL records return > start after the specified start LSN (e.g. because we skip over a page > header) or end before the specified end LSN (e.g. because we reach > end-of-WAL) the user can figure it out from looking at the LSNs in the > output rows and comparing them to the LSNs provided as input. I agree with you here. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Allow async standbys wait for sync replication
At Sat, 12 Mar 2022 14:33:32 -0800, Nathan Bossart wrote in > On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote: > > To me it's architecturally the completely wrong direction. We should move in > > the *other* direction, i.e. allow WAL to be sent to standbys before the > > primary has finished flushing it locally. Which requires similar > > infrastructure to what we're discussing here. > > I think this is a good point. After all, WALRead() has the following > comment: > > * XXX probably this should be improved to suck data directly from the > * WAL buffers when possible. > > Once you have all the infrastructure for that, holding back WAL replay on > async standbys based on synchronous replication might be relatively easy. That is, (as my understanding) async standbys are required to allow overwriting existing unreplayed records after reconnection. But, putting aside how to remember that LSN, if that happens at a segment boundary, the async replica may run into the similar situation with the missing-contrecord case. But standby cannot insert any original record to get out from that situation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Allow async standbys wait for sync replication
At Mon, 14 Mar 2022 11:30:02 +0900 (JST), Kyotaro Horiguchi wrote in > At Sat, 12 Mar 2022 14:33:32 -0800, Nathan Bossart > wrote in > > On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote: > > > To me it's architecturally the completely wrong direction. We should move > > > in > > > the *other* direction, i.e. allow WAL to be sent to standbys before the > > > primary has finished flushing it locally. Which requires similar > > > infrastructure to what we're discussing here. > > > > I think this is a good point. After all, WALRead() has the following > > comment: > > > > * XXX probably this should be improved to suck data directly from the > > * WAL buffers when possible. > > > > Once you have all the infrastructure for that, holding back WAL replay on > > async standbys based on synchronous replication might be relatively easy. Just to make sure and a bit off from the topic, I think the optimization itself is quite promising and want to have. > That is, (as my understanding) async standbys are required to allow > overwriting existing unreplayed records after reconnection. But, > putting aside how to remember that LSN, if that happens at a segment > boundary, the async replica may run into the similar situation with > the missing-contrecord case. But standby cannot insert any original > record to get out from that situation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Mon, 14 Mar 2022 09:39:48 +0900 (JST), Kyotaro Horiguchi wrote in > I'll examine the possibility to resolve this... The existence of nfree and nalloc made me confused and I found the reason. In the case where a parittion collects many REUSE-ASSIGN-REMOVEed elemetns from other paritiotns, nfree gets larger than nalloced. This is a strange point of the two counters. nalloced is only referred to as (sum(nalloced[])). So we don't need nalloced per-partition basis and the formula to calculate the number of used elements would be as follows. sum(nalloced - nfree) = - sum(nfree) We rarely create fresh elements in shared hashes so I don't think there's additional contention on the even if it were a global atomic. So, the remaining issue is the possible imbalancement among partitions. On second thought, by the current way, if there's a bad deviation in partition-usage, a heavily hit partition finally collects elements via get_hash_entry(). By the patch's way, similar thing happens via the REUSE-ASSIGN-REMOVE sequence. But buffers once used for something won't be freed until buffer invalidation. But bulk buffer invalidation won't deviatedly distribute freed buffers among partitions. So I conclude for now that is a non-issue. So my opinion on the counters is: I'd like to ask you to remove nalloced from partitions then add a global atomic for the same use? No need to do something for the possible deviation issue. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH] Accept IP addresses in server certificate SANs
At Thu, 17 Feb 2022 17:29:15 +, Jacob Champion wrote in > On Tue, 2022-02-15 at 15:16 +0900, Kyotaro Horiguchi wrote: > > (This needs rebasing) > > Done in v6, attached. Thanks! > > # I forgot to mention that, the test fails for me even without the > > # change. I didn't checked what is wrong there, though. > > Ah. We should probably figure that out, then -- what failures do you > see? I forgot the detail but v6 still fails for me. I think it is that. t/003_sslinfo.pl ... 1/? # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 6. t/003_sslinfo.pl ... Dubious, test returned 29 (wstat 7424, 0x1d00) All 6 subtests passed ... Result: FAIL The script complains like this: ok 6 - ssl_client_cert_present() for connection with cert connection error: 'psql: error: connection to server at "127.0.0.1", port 62656 failed: SSL error: tlsv1 alert unknown ca' while running 'psql -XAtq -d sslrootcert=ssl/root+server_ca.crt sslmode=require dbname=trustdb hostaddr=127.0.0.1 user=ssltestuser host=localhost -f - -v ON_ERROR_STOP=1' at /home/horiguti/work/worktrees/ipsan/src/test/ssl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 1873. So, psql looks like disliking the ca certificate. I also will dig into that. > > Mmm. after the end of the loop, rc is non-zero only when the loop has > > exited by the break and otherwise rc is zero. Why isn't it equivalent > > to setting check_cn to false at the break? > > check_cn can be false if rc is zero, too; it means that we found a SAN > of the correct type but it didn't match. Don't we count unmatched certs as "existed"? In that case I think we don't go to CN. > > Anyway, apart from that detail, I reconfirmed the spec the patch is > > going to implement. > > > > * If connhost contains a DNS name, and the certificate's SANs contain any > > * dNSName entries, then we'll ignore the Subject Common Name entirely; > > * otherwise, we fall back to checking the CN. (This behavior matches the > > * RFC.) > > > > Sure. > > > > * If connhost contains an IP address, and the SANs contain iPAddress > > * entries, we again ignore the CN. Otherwise, we allow the CN to match, > > * EVEN IF there is a dNSName in the SANs. (RFC 6125 prohibits this: "A > > * client MUST NOT seek a match for a reference identifier of CN-ID if the > > * presented identifiers include a DNS-ID, SRV-ID, URI-ID, or any > > * application-specific identifier types supported by the client.") > > > > Actually the patch searches for a match of IP address connhost from > > dNSName SANs even if iPAddress SANs exist. I think we've not > > explicitly defined thebehavior in that case. > > That's a good point; I didn't change the prior behavior. I feel more > comfortable leaving that check, since it is technically possible to > push something that looks like an IP address into a dNSName SAN. We > should probably make an explicit decision on that, as you say. > > But I don't think that contradicts the code comment, does it? The > comment is just talking about CN fallback scenarios. If you find a > match in a dNSName, there's no reason to fall back to the CN. The comment explains the spec correctly. From a practical view, the behavior above doesn't seem to make things insecure. So I don't have a strong opinion on the choice of the behaviors. The only thing I'm concerned here is the possibility that the decision corners us to some uncomfortable state between the RFC and our spec in future. On the other hand, changing the behavior can immediately make someone uncomfortable. So, I'd like to leave it to committers:p > > I supposed that we only > > be deviant in the case "IP address connhost and no SANs of any type > > exists". What do you think about it? > > We fall back in the case of "IP address connhost and dNSName SANs > exist", which is prohibited by that part of RFC 6125. I don't think we > deviate in the case you described; can you explain further? In that case, i.e., connhost is IP address and no SANs exist at all, we go to CN. On the other hand in RFC6125: rfc6125> In some cases, the URI is specified as an IP address rather rfc6125> than a hostname. In this case, the iPAddress subjectAltName rfc6125> must be present in the certificate and must exactly match the rfc6125> IP in the URI. It (seems to me) denies that behavior. Regardless of the existence of other types of SANs, iPAddress is required if connname is an IP address. (That is, it doesn't seem to me that there's any contex
Re: BufferAlloc: don't take two simultaneous locks
At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov wrote in > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет: > > I'd like to ask you to remove nalloced from partitions then add a > > global atomic for the same use? > > I really believe it should be global. I made it per-partition to > not overcomplicate first versions. Glad you tell it. > > I thought to protect it with freeList[0].mutex, but probably atomic > is better idea here. But which atomic to chose: uint64 or uint32? > Based on sizeof(long)? > Ok, I'll do in next version. Current nentries is a long (= int64 on CentOS). And uint32 can support roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe enough. So it would be uint64. > Whole get_hash_entry look strange. > Doesn't it better to cycle through partitions and only then go to > get_hash_entry? > May be there should be bitmap for non-empty free lists? 32bit for > 32 partitions. But wouldn't bitmap became contention point itself? The code puts significance on avoiding contention caused by visiting freelists of other partitions. And perhaps thinks that freelist shortage rarely happen. I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on 128kB shared buffers and I saw that get_hash_entry never takes the !element_alloc() path and always allocate a fresh entry, then saturates at 30 new elements allocated at the medium of a 100 seconds run. Then, I tried the same with the patch, and I am surprized to see that the rise of the number of newly allocated elements didn't stop and went up to 511 elements after the 100 seconds run. So I found that my concern was valid. The change in dynahash actually continuously/repeatedly causes lack of free list entries. I'm not sure how much the impact given on performance if we change get_hash_entry to prefer other freelists, though. By the way, there's the following comment in StrategyInitalize. >* Initialize the shared buffer lookup hashtable. >* >* Since we can't tolerate running out of lookup table entries, we must > be >* sure to specify an adequate table size here. The maximum > steady-state >* usage is of course NBuffers entries, but BufferAlloc() tries to > insert >* a new entry before deleting the old. In principle this could be >* happening in each partition concurrently, so we could need as many as >* NBuffers + NUM_BUFFER_PARTITIONS entries. >*/ > InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS); "but BufferAlloc() tries to insert a new entry before deleting the old." gets false by this patch but still need that additional room for stashed entries. It seems like needing a fix. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Mon, 14 Mar 2022 17:12:48 +0900 (JST), Kyotaro Horiguchi wrote in > Then, I tried the same with the patch, and I am surprized to see that > the rise of the number of newly allocated elements didn't stop and > went up to 511 elements after the 100 seconds run. So I found that my > concern was valid. Which means my last decision was wrong with high odds.. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add 64-bit XIDs into PostgreSQL 15
At Mon, 14 Mar 2022 19:43:40 +0400, Pavel Borisov wrote in > > I'd like to add a few quick notes on what's been done in v17. I have some commens by a quick look-through. Apologize in advance for wrong comments from the lack of the knowledge of the whole patch-set. > > Patches 0001 and 0002 that are planned to be committed to PG15 are almost > > unchanged with the exception of one unnecessary cast in 0002 removed. 0001: The XID_FMT has quite bad impact on the translatability of error messages. 3286065651 has removed INT64_FORMAT from translatable texts for the reason. This re-introduces that in several places. 0001 itself does not harm but 0005 replaces XID_FMT with INT64_FORMAT. Other patches have the same issue, too. > > We've also addressed several issues in patch 0005 (which is planned for > > PG16): > > - The bug with frozen xids after pg_upgrade, reported by Justin [1] > > - Added proper processing of double xmax pages in > > HeapPageSetPruneXidInternal() > > - Fixed xids comparison. Initially in the patch it was changed to simple < > > <= => > for 64 bit values. Now v17 patch has returned this to the way > > similar to what is used in STABLE for 32-bit xids, but using modulus-64 > > numeric ring. The main goal of this change was to fix SRLU tests that were > > mentioned If IIUC, the following part in 0002 doesn't consider wraparound. -asyncQueuePagePrecedes(int p, int q) +asyncQueuePagePrecedes(int64 p, int64 q) { - return asyncQueuePageDiff(p, q) < 0; + return p < q; } > > by Alexander to have been disabled. We've fixed and enabled most of them, > > but some of them are still need to be fixed and enabled. > > > > Also, we've pgindent-ed all the patches. 0005 has "new blank line at EOF". > > As patches that are planned to be delivered to PG15 are almost unchanged, > > I completely agree with Alexander's plan to consider these patches (0001 > > and 0002) as RfC. > > > > All activity, improvement, review, etc. related to the whole patchset is > > also very much appreciated. Big thanks to Alexander for working on the > > patch set! > > > > [1] > > https://www.postgresql.org/message-id/20220115063925.GS14051%40telsasoft.com > > > Also, the patch v17 (0005) returns SLRU_PAGES_PER_SEGMENT to the previous > value of 32. 0002 re-introduces pg_strtouint64, which have been recently removed by 3c6f8c011f. > Simplify the general-purpose 64-bit integer parsing APIs > > pg_strtouint64() is a wrapper around strtoull/strtoul/_strtoui64, but > it seems no longer necessary to have this indirection. .. >type definition int64/uint64. For that, add new macros strtoi64() and >strtou64() in c.h as thin wrappers around strtol()/strtoul() or >strtoll()/stroull(). This makes these functions available everywhere regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
At Mon, 14 Mar 2022 17:37:40 -0400, Robert Haas wrote in > On Mon, Mar 7, 2022 at 3:39 AM Kyotaro Horiguchi > wrote: > > So the new framework has been dropped in this version. > > The second test is removed as it is irrelevant to this bug. > > > > In this version the patch is a single file that contains the test. > > The status of this patch in the CommitFest was set to "Waiting for > Author." Since a new patch has been submitted since that status was > set, I have changed it to "Needs Review." Since this is now in its > 15th CommitFest, we really should get it fixed; that's kind of > ridiculous. (I am as much to blame as anyone.) It does seem to be a > legitimate bug. > > A few questions about the patch: Thanks for looking this! > 1. Why is it OK to just skip the operation without making it up later? Does "it" mean removal of directories? It is not okay, but in the first place it is out-of-scope of this patch to fix that. The patch leaves the existing code alone. This patch just has recovery ignore invalid accesses into eventually removed objects. Maybe, I don't understand you question.. > 2. Why not instead change the code so that the operation can succeed, > by creating the prerequisite parent directories? Do we not have enough > information for that? I'm not saying that we definitely should do it > that way rather than this way, but I think we do take that approach in > some cases. It is proposed first by Paul Guo [1] then changed so that it ignores failed directory creations in the very early stage in this thread. After that, it gets conscious of recovery consistency by managing invalid-access list. [1] https://www.postgresql.org/message-id/flat/20210327142316.GA32517%40alvherre.pgsql#a557bd47207a446ce206879676e0140a I think there was no strong reason for the current shape but I personally rather like the remembering-invalid-access way because it doesn't dirty the data directory and it is consistent with how we treat missing heap pages. I tried a slightly tweaked version (attached) of the first version and confirmed that it works for the current test script. It doesn't check recovery consistency but otherwise that way also seems fine. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index c37e3c9a9a..28aed8d296 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -47,6 +47,7 @@ #include "commands/defrem.h" #include "commands/seclabel.h" #include "commands/tablespace.h" +#include "common/file_perm.h" #include "mb/pg_wchar.h" #include "miscadmin.h" #include "pgstat.h" @@ -2382,6 +2383,7 @@ dbase_redo(XLogReaderState *record) xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record); char *src_path; char *dst_path; + char *parent_path; struct stat st; src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id); @@ -2401,6 +2403,41 @@ dbase_redo(XLogReaderState *record) dst_path))); } + /* +* It is possible that the tablespace was later dropped, but we are +* re-redoing database create before that. In that case, those +* directories are gone, and we do not create symlink. +*/ + if (stat(dst_path, &st) < 0 && errno == ENOENT) + { + parent_path = pstrdup(dst_path); + get_parent_directory(parent_path); + elog(WARNING, "creating missing directory: %s", parent_path); + if (stat(parent_path, &st) != 0 && pg_mkdir_p(parent_path, pg_dir_create_mode) != 0) + { + ereport(WARNING, + (errmsg("can not recursively create directory \"%s\"", + parent_path))); + } + } + + /* +* There's a case where the copy source directory is missing for the +* same reason above. Create the emtpy source directory so that +* copydir below doesn't fail. The directory will be dropped soon by +* recovery. +*/ + if (stat(src_path, &st) < 0 && errno == ENOENT) + { + elog(WARNING, "creating missing copy source directory: %
Re: BufferAlloc: don't take two simultaneous locks
Thanks for the new version. At Tue, 15 Mar 2022 08:07:39 +0300, Yura Sokolov wrote in > В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет: > > В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет: > > > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov > > > wrote in > > > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет: > > > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on > > > 128kB shared buffers and I saw that get_hash_entry never takes the > > > !element_alloc() path and always allocate a fresh entry, then > > > saturates at 30 new elements allocated at the medium of a 100 seconds > > > run. > > > > > > Then, I tried the same with the patch, and I am surprized to see that > > > the rise of the number of newly allocated elements didn't stop and > > > went up to 511 elements after the 100 seconds run. So I found that my > > > concern was valid. The change in dynahash actually > > > continuously/repeatedly causes lack of free list entries. I'm not > > > sure how much the impact given on performance if we change > > > get_hash_entry to prefer other freelists, though. > > > > Well, it is quite strange SharedBufHash is not allocated as > > HASH_FIXED_SIZE. Could you check what happens with this flag set? > > I'll try as well. > > > > Other way to reduce observed case is to remember freelist_idx for > > reused entry. I didn't believe it matters much since entries migrated > > netherless, but probably due to some hot buffers there are tention to > > crowd particular freelist. > > Well, I did both. Everything looks ok. Hmm. v8 returns stashed element with original patition index when the element is *not* reused. But what I saw in the previous test runs is the REUSE->ENTER(reuse)(->REMOVE) case. So the new version looks like behaving the same way (or somehow even worse) with the previous version. get_hash_entry continuously suffer lack of freelist entry. (FWIW, attached are the test-output diff for both master and patched) master finally allocated 31 fresh elements for a 100s run. > ALLOCED: 31;; freshly allocated v8 finally borrowed 33620 times from another freelist and 0 freshly allocated (ah, this version changes that..) Finally v8 results in: > RETURNED: 50806;; returned stashed elements > BORROWED: 33620;; borrowed from another freelist > REUSED: 1812664;; stashed > ASSIGNED: 1762377 ;; reused >(ALLOCED: 0);; freshly allocated It contains a huge degradation by frequent elog's so they cannot be naively relied on, but it should show what is happening sufficiently. > I lost access to Xeon 8354H, so returned to old Xeon X5675. ... > Strange thing: both master and patched version has higher > peak tps at X5676 at medium connections (17 or 27 clients) > than in first october version [1]. But lower tps at higher > connections number (>= 191 clients). > I'll try to bisect on master this unfortunate change. The reversing of the preference order between freshly-allocation and borrow-from-another-freelist might affect. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c index dc439940fa..ac651b98e6 100644 --- a/src/backend/storage/buffer/buf_table.c +++ b/src/backend/storage/buffer/buf_table.c @@ -31,7 +31,7 @@ typedef struct int id; /* Associated buffer ID */ } BufferLookupEnt; -static HTAB *SharedBufHash; +HTAB *SharedBufHash; /* diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 3babde8d70..294516ef01 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -195,6 +195,11 @@ struct HASHHDR longssize; /* segment size --- must be power of 2 */ int sshift; /* segment shift = log2(ssize) */ int nelem_alloc;/* number of entries to allocate at once */ + int alloc; + int reuse; + int borrow; + int assign; + int ret; #ifdef HASH_STATISTICS @@ -963,6 +968,7 @@ hash_search(HTAB *hashp, foundPtr); } +extern HTAB *SharedBufHash; void * hash_search_with_hash_value(HTAB *hashp, const void *keyPtr, @@ -1354,6 +1360,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx) hctl->freeList[freelist_idx].nentries++; SpinLockRelease(&hctl->freeL
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
At Tue, 15 Mar 2022 12:19:47 +0530, Bharath Rupireddy wrote in > On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi > wrote: > 0001 - I don't think you need to do this as UpdateControlFile > (update_controlfile) will anyway update it, no? > + /* Update control file using current time */ > + ControlFile->time = (pg_time_t) time(NULL); Ugh.. Yes. It is a copy-pasto from older versions. They may have the same copy-pasto.. > > 0002: Add REDO/Checkpiont LSNs to checkpoinkt-end log message. > > (The main patch in this thread) > > 0002 - If at all the intention is to say that no ControlFileLock is > required while reading ControlFile->checkPoint and > ControlFile->checkPointCopy.redo, let's say it, no? How about > something like "No ControlFileLock is required while reading > ControlFile->checkPoint and ControlFile->checkPointCopy.redo as there > can't be any other process updating them concurrently."? > > + /* we are the only updator of these variables */ > + LSN_FORMAT_ARGS(ControlFile->checkPoint), > + LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo; I thought the comment explains that. But it would be better to be more specific. It is changed as the follows. > * ControlFileLock is not required as we are the only > * updator of these variables. > > 0003: Replace (WAL-)location to LSN in pg_controldata. > > > > 0004: Replace (WAL-)location to LSN in user-facing texts. > > (This doesn't reflect my recent comments.) > > If you don't mind, can you please put the comments here? Okay. It's the following message. https://www.postgresql.org/message-id/20220209.115204.1794224638476710282.horikyota@gmail.com > The old 0003 (attached 0004): > > > > +++ b/src/backend/access/rmgrdesc/xlogdesc.c > - appendStringInfo(buf, "redo %X/%X; " > + appendStringInfo(buf, "redo lsn %X/%X; " > > > > It is shown in the context of a checkpoint record, so I think it is > not needed or rather lengthning the dump line uselessly. > > > > +++ b/src/backend/access/transam/xlog.c > - (errmsg("request to flush past end of generated > WAL; request %X/%X, current position %X/%X", > + (errmsg("request to flush past end of generated > WAL; request lsn %X/%X, current lsn %X/%X", > > > > +++ b/src/backend/replication/walsender.c > - (errmsg("requested starting point %X/%X > is ahead of the WAL flush position of this server %X/%X", > + (errmsg("requested starting point %X/%X > is ahead of the WAL flush LSN of this server %X/%X", > > > > "WAL" is upper-cased. So it seems rather strange that the "lsn" is > lower-cased. In the first place the message doesn't look like a > user-facing error message and I feel we don't need position or lsn > there.. > > > > +++ b/src/bin/pg_rewind/pg_rewind.c > - pg_log_info("servers diverged at WAL location %X/%X on timeline > %u", > + pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u", > > > > I feel that we don't need "WAL" there. > > > > +++ b/src/bin/pg_waldump/pg_waldump.c > - printf(_(" -e, --end=RECPTR stop reading at WAL location > RECPTR\n")); > + printf(_(" -e, --end=RECPTR stop reading at WAL LSN RECPTR\n")); > > > > Mmm.. "WAL LSN RECPTR" looks strange to me. In the first place I > don't think "RECPTR" is a user-facing term. Doesn't something like the > follows work? > > > > + printf(_(" -e, --end=WAL-LSN stop reading at WAL-LSN\n")); > > > > In some changes in this patch shorten the main message text of > fprintf-ish functions. That makes the succeeding parameters can be > inlined. > > 0005: Unhyphenate the word archive-recovery and similars. > > 0005 - How about replacing "crash-recovery" to "crash recovery" in > postgres-ref.sgml too? Oh, that's a left-over. Fixed. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 37f7433dce1ea9e41aa54f2c18c0bf3d2aa3c814 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 4 Mar 2022 13:18:30 +0900 Subject: [PATCH v12 1/5] Correctly update contfol file at the end of archive recovery CreateRestartPoint runs WAL file cleanup basing on the checkpoint just have finished in the function. If the database has exited DB_IN_ARCHIVE_RECOVERY state when the f
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
At Tue, 15 Mar 2022 18:26:26 +0900, Michael Paquier wrote in > On Tue, Mar 15, 2022 at 05:23:40PM +0900, Kyotaro Horiguchi wrote: > > At Tue, 15 Mar 2022 12:19:47 +0530, Bharath Rupireddy > > wrote in > >> On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi > >> wrote: > >> 0001 - I don't think you need to do this as UpdateControlFile > >> (update_controlfile) will anyway update it, no? > >> + /* Update control file using current time */ > >> + ControlFile->time = (pg_time_t) time(NULL); > > > > Ugh.. Yes. It is a copy-pasto from older versions. They may have the > > same copy-pasto.. > > This thread has shifted to an entirely different discussion, > presenting patches that touch code paths unrelated to what was first > stated. Shouldn't you create a new thread with a proper $subject to > attract a more correct audience? I felt the same since some messages ago. I thought Fujii-san thought that he wants to fix the CreateRestartPoint issue before the checkpoint log patch but he looks like busy these days. Since the CreateRestartPoint issue is orthogonal to the main patch, I'll separate that part into anther thread. This thread is discussing one other topic, wordings in user-facing texts. This is also orthogonal (3-dimentionally?) to the two topics. In short, I split out the two topics other than checkpoint log to other threads. Thanks for cueing me to do that! regareds. -- Kyotaro Horiguchi NTT Open Source Software Center
Possible corruption by CreateRestartPoint at promotion
Hello, (Cc:ed Fujii-san) This is a diverged topic from [1], which is summarized as $SUBJECT. To recap: While discussing on additional LSNs in checkpoint log message, Fujii-san pointed out [2] that there is a case where CreateRestartPoint leaves unrecoverable database when concurrent promotion happens. That corruption is "fixed" by the next checkpoint so it is not a severe corruption. AFAICS since 9.5, no check(/restart)pionts won't run concurrently with restartpoint [3]. So I propose to remove the code path as attached. regards. [1] https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com [2] https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com [3] https://www.postgresql.org/message-id/20220222.174401.765586897814316743.horikyota.ntt%40gmail.com -- Kyotaro Horiguchi NTT Open Source Software Center >From 57823c5690469cf404de9fbdf37f55d09abaedd5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 4 Mar 2022 13:18:30 +0900 Subject: [PATCH v4] Correctly update contfol file at the end of archive recovery CreateRestartPoint runs WAL file cleanup basing on the checkpoint just have finished in the function. If the database has exited DB_IN_ARCHIVE_RECOVERY state when the function is going to update control file, the function refrains from updating the file at all then proceeds to WAL cleanup having the latest REDO LSN, which is now inconsistent with the control file. As the result, the succeeding cleanup procedure overly removes WAL files against the control file and leaves unrecoverable database until the next checkpoint finishes. Along with that fix, we remove a dead code path for the case some other process ran a simultaneous checkpoint. It seems like just a preventive measure but it's no longer useful because we are sure that checkpoint is performed only by checkpointer except single process mode. --- src/backend/access/transam/xlog.c | 69 --- 1 file changed, 44 insertions(+), 25 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ed16f279b1..dd9c564988 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6900,6 +6900,9 @@ CreateRestartPoint(int flags) XLogSegNo _logSegNo; TimestampTz xtime; + /* we don't assume concurrent checkpoint/restartpoint to run */ + Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER); + /* Get a local copy of the last safe checkpoint record. */ SpinLockAcquire(&XLogCtl->info_lck); lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr; @@ -6965,7 +6968,7 @@ CreateRestartPoint(int flags) /* Also update the info_lck-protected copy */ SpinLockAcquire(&XLogCtl->info_lck); - XLogCtl->RedoRecPtr = lastCheckPoint.redo; + XLogCtl->RedoRecPtr = RedoRecPtr; SpinLockRelease(&XLogCtl->info_lck); /* @@ -6984,7 +6987,10 @@ CreateRestartPoint(int flags) /* Update the process title */ update_checkpoint_display(flags, true, false); - CheckPointGuts(lastCheckPoint.redo, flags); + CheckPointGuts(RedoRecPtr, flags); + + /* Update pg_control */ + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); /* * Remember the prior checkpoint's redo ptr for @@ -6992,30 +6998,26 @@ CreateRestartPoint(int flags) */ PriorRedoPtr = ControlFile->checkPointCopy.redo; + Assert (PriorRedoPtr < RedoRecPtr); + + ControlFile->checkPoint = lastCheckPointRecPtr; + ControlFile->checkPointCopy = lastCheckPoint; + /* - * Update pg_control, using current time. Check that it still shows - * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing; - * this is a quick hack to make sure nothing really bad happens if somehow - * we get here after the end-of-recovery checkpoint. + * Ensure minRecoveryPoint is past the checkpoint record while archive + * recovery is still ongoing. Normally, this will have happened already + * while writing out dirty buffers, but not necessarily - e.g. because no + * buffers were dirtied. We do this because a non-exclusive base backup + * uses minRecoveryPoint to determine which WAL files must be included in + * the backup, and the file (or files) containing the checkpoint record + * must be included, at a minimum. Note that for an ordinary restart of + * recovery there's no value in having the minimum recovery point any + * earlier than this anyway, because redo will begin just after the + * checkpoint record. This is a quick hack to make sure nothing really bad + * happens if somehow we get here after the end-of-recovery checkpoint. */ - LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); - if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && - ControlFile->checkPointCopy.redo < lastCheckPoint.redo) + if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY) { - ControlFile->checkPoint = lastCheckPoin
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
Hello, this is a follow-up topic of [1] (add LSNs to checkpint logs). Many user-facing texts contains wording like "WAL location" or such like. The attached is WIP patches that fixes such wordings to "WAL LSN" or alikes. The attached patches is v13 but it is not changed at all from v12. My lastest comments on this version are as follows. https://www.postgresql.org/message-id/20220209.115204.1794224638476710282.horikyota@gmail.com > The old 0003 (attached 0004): > > > > +++ b/src/backend/access/rmgrdesc/xlogdesc.c > - appendStringInfo(buf, "redo %X/%X; " > + appendStringInfo(buf, "redo lsn %X/%X; " > > > > It is shown in the context of a checkpoint record, so I think it is > not needed or rather lengthning the dump line uselessly. > > > > +++ b/src/backend/access/transam/xlog.c > - (errmsg("request to flush past end of generated > WAL; request %X/%X, current position %X/%X", > + (errmsg("request to flush past end of generated > WAL; request lsn %X/%X, current lsn %X/%X", > > > > +++ b/src/backend/replication/walsender.c > - (errmsg("requested starting point %X/%X > is ahead of the WAL flush position of this server %X/%X", > + (errmsg("requested starting point %X/%X > is ahead of the WAL flush LSN of this server %X/%X", > > > > "WAL" is upper-cased. So it seems rather strange that the "lsn" is > lower-cased. In the first place the message doesn't look like a > user-facing error message and I feel we don't need position or lsn > there.. > > > > +++ b/src/bin/pg_rewind/pg_rewind.c > - pg_log_info("servers diverged at WAL location %X/%X on timeline > %u", > + pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u", > > > > I feel that we don't need "WAL" there. > > > > +++ b/src/bin/pg_waldump/pg_waldump.c > - printf(_(" -e, --end=RECPTR stop reading at WAL location > RECPTR\n")); > + printf(_(" -e, --end=RECPTR stop reading at WAL LSN RECPTR\n")); > > > > Mmm.. "WAL LSN RECPTR" looks strange to me. In the first place I > don't think "RECPTR" is a user-facing term. Doesn't something like the > follows work? > > > > + printf(_(" -e, --end=WAL-LSN stop reading at WAL-LSN\n")); > > > > In some changes in this patch shorten the main message text of > fprintf-ish functions. That makes the succeeding parameters can be > inlined. regards. [1] https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com -- Kyotaro Horiguchi NTT Open Source Software Center >From afed063f09b788bc07a0e2665499c3e3c3edc214 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Tue, 1 Feb 2022 03:57:04 + Subject: [PATCH v13 1/2] Change "location" to "lsn" in pg_controldata --- src/bin/pg_controldata/pg_controldata.c| 10 +- src/test/recovery/t/016_min_consistency.pl | 2 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c index f911f98d94..59f39267df 100644 --- a/src/bin/pg_controldata/pg_controldata.c +++ b/src/bin/pg_controldata/pg_controldata.c @@ -235,9 +235,9 @@ main(int argc, char *argv[]) dbState(ControlFile->state)); printf(_("pg_control last modified: %s\n"), pgctime_str); - printf(_("Latest checkpoint location: %X/%X\n"), + printf(_("Latest checkpoint LSN:%X/%X\n"), LSN_FORMAT_ARGS(ControlFile->checkPoint)); - printf(_("Latest checkpoint's REDO location:%X/%X\n"), + printf(_("Latest checkpoint's REDO LSN: %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo)); printf(_("Latest checkpoint's REDO WAL file:%s\n"), xlogfilename); @@ -274,13 +274,13 @@ main(int argc, char *argv[]) ckpttime_str); printf(_("Fake LSN counter for unlogged rels: %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->unloggedLSN)); - printf(_("Minimum recovery ending location: %X/%X\n"), + printf(_("Minimum recovery ending LSN: %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint)); printf(_("Min recovery ending loc's timeline: %u\n"), ControlFile->minRecoveryPointTLI); - printf(_("Backup start location:%X/%X\n"), + printf(_(
Unhyphenation of crash-recovery
Hello, this is a derived topic from [1], summarized as $SUBJECT. This just removes useless hyphens from the words "(crash|emergency)-recovery". We don't have such wordings for "archive recovery" This patch fixes non-user-facing texts as well as user-facing ones. regards. [1] https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com -- Kyotaro Horiguchi NTT Open Source Software Center >From 751acd2510e036a5c484e5768dd9e4e85cd8b2cf Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 4 Mar 2022 13:53:38 +0900 Subject: [PATCH v13] Get rid of uses of some hyphenated words There are use of both "crash-recovery" and "crash recovery" in the tree. All occurances of archive recovery is spelled unhyphenated. Unify the similar uses to unhyphenated. --- doc/src/sgml/ref/pg_ctl-ref.sgml | 2 +- doc/src/sgml/ref/postgres-ref.sgml| 2 +- src/backend/commands/trigger.c| 2 +- src/backend/utils/cache/relcache.c| 2 +- src/test/recovery/t/023_pitr_prepared_xact.pl | 2 +- src/test/regress/parallel_schedule| 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) diff --git a/doc/src/sgml/ref/pg_ctl-ref.sgml b/doc/src/sgml/ref/pg_ctl-ref.sgml index 3946fa52ea..7cbe5c3048 100644 --- a/doc/src/sgml/ref/pg_ctl-ref.sgml +++ b/doc/src/sgml/ref/pg_ctl-ref.sgml @@ -194,7 +194,7 @@ PostgreSQL documentation rolled back and clients are forcibly disconnected, then the server is shut down. Immediate mode will abort all server processes immediately, without a clean shutdown. This choice - will lead to a crash-recovery cycle during the next server start. + will lead to a crash recovery cycle during the next server start. diff --git a/doc/src/sgml/ref/postgres-ref.sgml b/doc/src/sgml/ref/postgres-ref.sgml index 55a3f6c69d..fc22fbc4fe 100644 --- a/doc/src/sgml/ref/postgres-ref.sgml +++ b/doc/src/sgml/ref/postgres-ref.sgml @@ -719,7 +719,7 @@ PostgreSQL documentation is also unwise to send SIGKILL to a server process — the main postgres process will interpret this as a crash and will force all the sibling processes - to quit as part of its standard crash-recovery procedure. + to quit as part of its standard crash recovery procedure. diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c index e08bd9a370..ce170b4c6e 100644 --- a/src/backend/commands/trigger.c +++ b/src/backend/commands/trigger.c @@ -1887,7 +1887,7 @@ RelationBuildTriggers(Relation relation) /* * Note: since we scan the triggers using TriggerRelidNameIndexId, we will * be reading the triggers in name order, except possibly during - * emergency-recovery operations (ie, IgnoreSystemIndexes). This in turn + * emergency recovery operations (ie, IgnoreSystemIndexes). This in turn * ensures that triggers will be fired in name order. */ ScanKeyInit(&skey, diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index fccffce572..2419cf5285 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -6598,7 +6598,7 @@ RelationCacheInitFilePostInvalidate(void) * Remove the init files during postmaster startup. * * We used to keep the init files across restarts, but that is unsafe in PITR - * scenarios, and even in simple crash-recovery cases there are windows for + * scenarios, and even in simple crash recovery cases there are windows for * the init files to become out-of-sync with the database. So now we just * remove them during startup and expect the first backend launch to rebuild * them. Of course, this has to happen in each database of the cluster. diff --git a/src/test/recovery/t/023_pitr_prepared_xact.pl b/src/test/recovery/t/023_pitr_prepared_xact.pl index 39e8a8fa17..0f48c20ed1 100644 --- a/src/test/recovery/t/023_pitr_prepared_xact.pl +++ b/src/test/recovery/t/023_pitr_prepared_xact.pl @@ -1,7 +1,7 @@ # Copyright (c) 2021-2022, PostgreSQL Global Development Group -# Test for point-in-time-recovery (PITR) with prepared transactions +# Test for point-in-time recovery (PITR) with prepared transactions use strict; use warnings; use PostgreSQL::Test::Cluster; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 6d8f524ae9..8251744bbf 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -13,7 +13,7 @@ test: test_setup # run tablespace by itself, and early, because it forces a checkpoint; # we'd prefer not to have checkpoints later in the tests because that -# interferes with crash-recovery testing. +# interferes with crash recovery testing. test: tablespace # -- -- 2.27.0
Reword "WAL location" as "WAL LSN"
Uh.. Sorry, I forgot to change the subject. Resent with the correct subject. = Hello, this is a follow-up topic of [1] (add LSNs to checkpint logs). Many user-facing texts contains wording like "WAL location" or such like. The attached is WIP patches that fixes such wordings to "WAL LSN" or alikes. The attached patches is v13 but it is not changed at all from v12. My lastest comments on this version are as follows. https://www.postgresql.org/message-id/20220209.115204.1794224638476710282.horikyota@gmail.com > The old 0003 (attached 0004): > > > > +++ b/src/backend/access/rmgrdesc/xlogdesc.c > - appendStringInfo(buf, "redo %X/%X; " > + appendStringInfo(buf, "redo lsn %X/%X; " > > > > It is shown in the context of a checkpoint record, so I think it is > not needed or rather lengthning the dump line uselessly. > > > > +++ b/src/backend/access/transam/xlog.c > - (errmsg("request to flush past end of generated > WAL; request %X/%X, current position %X/%X", > + (errmsg("request to flush past end of generated > WAL; request lsn %X/%X, current lsn %X/%X", > > > > +++ b/src/backend/replication/walsender.c > - (errmsg("requested starting point %X/%X > is ahead of the WAL flush position of this server %X/%X", > + (errmsg("requested starting point %X/%X > is ahead of the WAL flush LSN of this server %X/%X", > > > > "WAL" is upper-cased. So it seems rather strange that the "lsn" is > lower-cased. In the first place the message doesn't look like a > user-facing error message and I feel we don't need position or lsn > there.. > > > > +++ b/src/bin/pg_rewind/pg_rewind.c > - pg_log_info("servers diverged at WAL location %X/%X on timeline > %u", > + pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u", > > > > I feel that we don't need "WAL" there. > > > > +++ b/src/bin/pg_waldump/pg_waldump.c > - printf(_(" -e, --end=RECPTR stop reading at WAL location > RECPTR\n")); > + printf(_(" -e, --end=RECPTR stop reading at WAL LSN RECPTR\n")); > > > > Mmm.. "WAL LSN RECPTR" looks strange to me. In the first place I > don't think "RECPTR" is a user-facing term. Doesn't something like the > follows work? > > > > + printf(_(" -e, --end=WAL-LSN stop reading at WAL-LSN\n")); > > > > In some changes in this patch shorten the main message text of > fprintf-ish functions. That makes the succeeding parameters can be > inlined. regards. [1] https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com -- Kyotaro Horiguchi NTT Open Source Software Center >From afed063f09b788bc07a0e2665499c3e3c3edc214 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Tue, 1 Feb 2022 03:57:04 + Subject: [PATCH v13 1/2] Change "location" to "lsn" in pg_controldata --- src/bin/pg_controldata/pg_controldata.c| 10 +- src/test/recovery/t/016_min_consistency.pl | 2 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c index f911f98d94..59f39267df 100644 --- a/src/bin/pg_controldata/pg_controldata.c +++ b/src/bin/pg_controldata/pg_controldata.c @@ -235,9 +235,9 @@ main(int argc, char *argv[]) dbState(ControlFile->state)); printf(_("pg_control last modified: %s\n"), pgctime_str); - printf(_("Latest checkpoint location: %X/%X\n"), + printf(_("Latest checkpoint LSN:%X/%X\n"), LSN_FORMAT_ARGS(ControlFile->checkPoint)); - printf(_("Latest checkpoint's REDO location:%X/%X\n"), + printf(_("Latest checkpoint's REDO LSN: %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo)); printf(_("Latest checkpoint's REDO WAL file:%s\n"), xlogfilename); @@ -274,13 +274,13 @@ main(int argc, char *argv[]) ckpttime_str); printf(_("Fake LSN counter for unlogged rels: %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->unloggedLSN)); - printf(_("Minimum recovery ending location: %X/%X\n"), + printf(_("Minimum recovery ending LSN: %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint)); printf(_("Min recovery ending loc's timeline: %u\n"), ControlFile->minRecoveryPointTLI
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
At Wed, 16 Mar 2022 09:19:13 +0900 (JST), Kyotaro Horiguchi wrote in > In short, I split out the two topics other than checkpoint log to > other threads. So, this is about the main topic of this thread, adding LSNs to checkpint log. Other topics have moved to other treads [1], [2] , [3]. I think this is no longer controversial alone. So this patch is now really Read-for-Commiter and is waiting to be picked up. regards. [1] https://www.postgresql.org/message-id/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com [2] https://www.postgresql.org/message-id/20220316.102900.2003692961119672246.horikyota.ntt%40gmail.com [3] https://www.postgresql.org/message-id/20220316.102509.785466054344164656.horikyota.ntt%40gmail.com -- Kyotaro Horiguchi NTT Open Source Software Center >From 4c3d40e7aaf29cb905f3561a291d35981d762456 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 4 Mar 2022 13:22:41 +0900 Subject: [PATCH v13] Add checkpoint and redo LSN to LogCheckpointEnd log message It is useful (for debugging purposes) if the checkpoint end message has the checkpoint LSN(end) and REDO LSN(start). It gives more context while analyzing checkpoint-related issues. The pg_controldata gives the last checkpoint LSN and REDO LSN, but having this info alongside the log message helps analyze issues that happened previously, connect the dots and identify the root cause. --- src/backend/access/transam/xlog.c | 22 ++ 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ed16f279b1..b85c76d8f8 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6121,7 +6121,8 @@ LogCheckpointEnd(bool restartpoint) "%d WAL file(s) added, %d removed, %d recycled; " "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; " "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; " - "distance=%d kB, estimate=%d kB", + "distance=%d kB, estimate=%d kB; " + "lsn=%X/%X, redo lsn=%X/%X", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, CheckpointStats.ckpt_segs_added, @@ -6134,14 +6135,21 @@ LogCheckpointEnd(bool restartpoint) longest_msecs / 1000, (int) (longest_msecs % 1000), average_msecs / 1000, (int) (average_msecs % 1000), (int) (PrevCheckPointDistance / 1024.0), - (int) (CheckPointDistanceEstimate / 1024.0; + (int) (CheckPointDistanceEstimate / 1024.0), + /* + * ControlFileLock is not required as we are the only + * updator of these variables. + */ + LSN_FORMAT_ARGS(ControlFile->checkPoint), + LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo; else ereport(LOG, (errmsg("checkpoint complete: wrote %d buffers (%.1f%%); " "%d WAL file(s) added, %d removed, %d recycled; " "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; " "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; " - "distance=%d kB, estimate=%d kB", + "distance=%d kB, estimate=%d kB; " + "lsn=%X/%X, redo lsn=%X/%X", CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, CheckpointStats.ckpt_segs_added, @@ -6154,7 +6162,13 @@ LogCheckpointEnd(bool restartpoint) longest_msecs / 1000, (int) (longest_msecs % 1000), average_msecs / 1000, (int) (average_msecs % 1000), (int) (PrevCheckPointDistance / 1024.0), - (int) (CheckPointDistanceEstimate / 1024.0; + (int) (CheckPointDistanceEstimate / 1024.0), + /* + * ControlFileLock is not required as we are the only + * updator of these variables. + */ + LSN_FORMAT_ARGS(ControlFile->checkPoint), + LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo; } /* -- 2.27.0
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
At Tue, 15 Mar 2022 23:16:52 +1300, Thomas Munro wrote in > On Tue, Mar 15, 2022 at 10:30 PM Michael Paquier wrote: > > So, which one of a relative path or an absolute path do you think > > would be better for the user? My preference tends toward the relative > > path, as we know that all those tablespaces stay in pg_tblspc/ so one > > can make the difference with normal tablespaces more easily. The > > barrier is thin, though :p > > Sounds good to me. +1. Desn't the doc need to mention that? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov wrote in > В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет: > > Hmm. v8 returns stashed element with original patition index when the > > element is *not* reused. But what I saw in the previous test runs is > > the REUSE->ENTER(reuse)(->REMOVE) case. So the new version looks like > > behaving the same way (or somehow even worse) with the previous > > version. > > v8 doesn't differ in REMOVE case neither from master nor from > previous version. It differs in RETURNED case only. > Or I didn't understand what you mean :( In v7, HASH_ENTER returns the element stored in DynaHashReuse using the freelist_idx of the new key. v8 uses that of the old key (at the time of HASH_REUSE). So in the case "REUSE->ENTER(elem exists and returns the stashed)" case the stashed element is returned to its original partition. But it is not what I mentioned. On the other hand, once the stahsed element is reused by HASH_ENTER, it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow from old partition) case. I suspect that ththat the frequent freelist starvation comes from the latter case. > > get_hash_entry continuously suffer lack of freelist > > entry. (FWIW, attached are the test-output diff for both master and > > patched) > > > > master finally allocated 31 fresh elements for a 100s run. > > > > > ALLOCED: 31;; freshly allocated > > > > v8 finally borrowed 33620 times from another freelist and 0 freshly > > allocated (ah, this version changes that..) > > Finally v8 results in: > > > > > RETURNED: 50806;; returned stashed elements > > > BORROWED: 33620;; borrowed from another freelist > > > REUSED: 1812664;; stashed > > > ASSIGNED: 1762377 ;; reused > > >(ALLOCED: 0);; freshly allocated (I misunderstand that v8 modified get_hash_entry's preference between allocation and borrowing.) I re-ran the same check for v7 and it showed different result. RETURNED: 1 ALLOCED: 15 BORROWED: 0 REUSED: 505435 ASSIGNED: 505462 (-27) ## the counters are not locked. > Is there any measurable performance hit cause of borrowing? > Looks like "borrowed" happened in 1.5% of time. And it is on 128kb > shared buffers that is extremely small. (Or it was 128MB?) It is intentional set small to get extremely frequent buffer replacements. The point here was the patch actually can induce frequent freelist starvation. And as you do, I also doubt the significance of the performance hit by that. Just I was not usre. I re-ran the same for v8 and got a result largely different from the previous trial on the same v8. RETURNED: 2 ALLOCED: 0 BORROWED: 435 REUSED: 495444 ASSIGNED: 495467 (-23) Now "BORROWED" happens 0.8% of REUSED. > Well, I think some spare entries could reduce borrowing if there is > a need. I'll test on 128MB with spare entries. If there is profit, > I'll return some, but will keep SharedBufHash fixed. I don't doubt the benefit of this patch. And now convinced by myself that the downside is negligible than the benefit. > Master branch does less freelist manipulations since it tries to > insert first and if there is collision it doesn't delete victim > buffer. > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675. > > ... > > > Strange thing: both master and patched version has higher > > > peak tps at X5676 at medium connections (17 or 27 clients) > > > than in first october version [1]. But lower tps at higher > > > connections number (>= 191 clients). > > > I'll try to bisect on master this unfortunate change. > > > > The reversing of the preference order between freshly-allocation and > > borrow-from-another-freelist might affect. > > `master` changed its behaviour as well. > It is not problem of the patch at all. Agreed. So I think we should go on this direction. There are some last comments on v8. + HASH_FIXED_SIZE); Ah, now I understand that this prevented allocation of new elements. I think this good to do for SharedBufHash. + longnfree; /* number of free entries in the list */ HASHELEMENT *freeList; /* chain of free elements */ } FreeListData; +#if SIZEOF_LONG == 4 +typedef pg_atomic_uint32 nalloced_store_t; +typedef uint32 nalloced_value_t; +#define nalloced_read(a) (long)pg_atomic_read_u32(a) +#define nalloced_add(a, v) pg_atomic_fetch_add_u32((a), (uint32)(v)) I don't think nalloced needs to be the same width to long. For the platforms with 32-bit long, anyway
Re: Add 64-bit XIDs into PostgreSQL 15
At Tue, 15 Mar 2022 18:48:34 +0300, Maxim Orlov wrote in > Hi Kyotaro! > > 0001: > > > > The XID_FMT has quite bad impact on the translatability of error > > messages. 3286065651 has removed INT64_FORMAT from translatable > > texts for the reason. This re-introduces that in several places. > > 0001 itself does not harm but 0005 replaces XID_FMT with > > INT64_FORMAT. Other patches have the same issue, too. > > > I do understand your concern and I wonder how I can do this better? My > first intention was to replace XID_FMT with %llu and INT64_FORMAT with > %lld. This should solve the translatability issue, but I'm not sure about > portability of this. Should this work on Windows, etc? Can you advise me on > the best solution? Doesn't doing "errmsg("blah blah %lld ..", (long long) xid)" work? > We've fixed all the other things mentioned. Thanks! > > Also added two fixes: > - CF bot was unhappy with pg_upgrade test in v17 because I forgot to add a > fix for computation of relminmxid during vacuum on a fresh database. > - Replace frozen or invalid x_min with FrozenTransactionId or > InvalidTransactionId respectively during tuple conversion to 64xid. > > Reviews are welcome as always! Thanks! My pleasure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Corruption during WAL replay
At Tue, 15 Mar 2022 12:44:49 -0400, Robert Haas wrote in > On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi > wrote: > > The attached is the fixed version and it surely works with the repro. > > Hi, > > I spent the morning working on this patch and came up with the > attached version. I wrote substantial comments in RelationTruncate(), > where I tried to make it more clear exactly what the bug is here, and > also in storage/proc.h, where I tried to clarify both the use of the > DELAY_CHKPT_* flags in general terms. If nobody is too sad about this > version, I plan to commit it. Thanks for taking this and for the time. The additional comments seems describing the flags more clearly. storage.c: +* Make sure that a concurrent checkpoint can't complete while truncation +* is in progress. +* +* The truncation operation might drop buffers that the checkpoint +* otherwise would have flushed. If it does, then it's essential that +* the files actually get truncated on disk before the checkpoint record +* is written. Otherwise, if reply begins from that checkpoint, the +* to-be-truncated buffers might still exist on disk but have older +* contents than expected, which can cause replay to fail. It's OK for +* the buffers to not exist on disk at all, but not for them to have the +* wrong contents. FWIW, this seems like slightly confusing between buffer and its content. I can read it correctly so I don't mind if it is natural enough. Otherwise all the added/revised comments looks fine. Thanks for the labor. > I think it should be back-patched, too, but that looks like a bit of a > pain. I think every back-branch will require different adjustments. I'll try that, if you are already working on it, please inform me. (It may more than likely be too late..) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH] Accept IP addresses in server certificate SANs
At Tue, 15 Mar 2022 21:41:49 +, Jacob Champion wrote in > On Mon, 2022-03-14 at 15:30 +0900, Kyotaro Horiguchi wrote: > > t/003_sslinfo.pl ... 1/? # Tests were run but no plan was declared and > > done_testing() was not seen. > > # Looks like your test exited with 29 just after 6. > > t/003_sslinfo.pl ... Dubious, test returned 29 (wstat 7424, 0x1d00) > > All 6 subtests passed > > ... > > Result: FAIL > > > > The script complains like this: > > > > ok 6 - ssl_client_cert_present() for connection with cert > > connection error: 'psql: error: connection to server at "127.0.0.1", port > > 62656 failed: SSL error: tlsv1 alert unknown ca' > > while running 'psql -XAtq -d sslrootcert=ssl/root+server_ca.crt > > sslmode=require dbname=trustdb hostaddr=127.0.0.1 user=ssltestuser > > host=localhost -f - -v ON_ERROR_STOP=1' at > > /home/horiguti/work/worktrees/ipsan/src/test/ssl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm > > line 1873. > > > > So, psql looks like disliking the ca certificate. I also will dig > > into that. > > Hmm, the sslinfo tests are failing? I wouldn't have expected that based > on the patch changes. Just to confirm -- they pass for you without the > patch? Mmm I'm not sure how come I didn't noticed that, master also fails for me fo the same reason. In the past that may fail when valid clinent-certs exists in the users ~/.postgresql but I believe that has been fixed. > > > > Mmm. after the end of the loop, rc is non-zero only when the loop has > > > > exited by the break and otherwise rc is zero. Why isn't it equivalent > > > > to setting check_cn to false at the break? > > > > > > check_cn can be false if rc is zero, too; it means that we found a SAN > > > of the correct type but it didn't match. I'm not discussing on the meaning. Purely on the logical equivalancy of the two ways to decide whether to visit CN. Concretely, the equivalancy between this: = check_cn = true; rc = 0; for (i < san_len) { if (type matches) check_cn = false; if (some error happens) rc = nonzero; if (rc != 0) break; } !if ((rc == 0) && check_cn) {} = and this. = check_cn = true; rc = 0; for (i < san_len) { if (type matches) check_cn = false; if (some error happens) rc = nonzero; if (rc != 0) { !check_cn = false; break; } } !if (check_cn) {} = The two are equivalant to me. And if it is, the latter form smees simpler and clearner to me. > > > check_cn can be false if rc is zero, too; it means that we found a SAN > > > of the correct type but it didn't match. > > > Don't we count unmatched certs as "existed"? In that case I think we > > don't go to CN. > > Unmatched names, you mean? I'm not sure I understand. Sorry, I was confused here. Please ignore that. I shoudl have said something like the following instead. > check_cn can be false if rc is zero, too; it means that we found a SAN > of the correct type but it didn't match. Yes, in that case we don't visit CN because check_cn is false even if we don't exit by (rc != 0) and that behavior is not changed by my proposal. > > > > I supposed that we only > > > > be deviant in the case "IP address connhost and no SANs of any type > > > > exists". What do you think about it? > > > > > > We fall back in the case of "IP address connhost and dNSName SANs > > > exist", which is prohibited by that part of RFC 6125. I don't think we > > > deviate in the case you described; can you explain further? > > > > In that case, i.e., connhost is IP address and no SANs exist at all, > > we go to CN. On the other hand in RFC6125: > > > > rfc6125> In some cases, the URI is specified as an IP address rather > > rfc6125> than a hostname. In this case, the iPAddress subjectAltName > > rfc6125> must be present in the certificate and must exactly match the > > rfc6125> IP in the URI. > > > > It (seems to me) denies that behavior. Regardless of the existence of > > other types of SANs, iPAddress is required if connname is an IP > > address. (That is, it doesn't seem to me that there's any context > > like "if any SANs exists", but I'm not so sure I read it perfectly.) > > I see what you mean now. Yes, we deviate there as well (and have done > so for a while now). I think breaking compatibility there would > probably not go over well. Agreed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH] Accept IP addresses in server certificate SANs
At Wed, 16 Mar 2022 15:56:02 +0900 (JST), Kyotaro Horiguchi wrote in > Mmm I'm not sure how come I didn't noticed that, master also fails > for me fo the same reason. In the past that may fail when valid > clinent-certs exists in the users ~/.postgresql but I believe that has > been fixed. And finally my fear found to be true.. The test doesn't fail after removing files under ~/.postgresql, which should not happen. I'll fix it apart from this. Sorry for the confusion. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Out-of-tree certificate interferes ssltest
Hello. 003_sslinfo.pl fails for me. ok 6 - ssl_client_cert_present() for connection with cert connection error: 'psql: error: connection to server at "127.0.0.1", port 61688 failed: could not read certificate file "/home/horiguti/.postgresql/postgresql.crt": no start line' while running 'psql -XAtq -d sslrootcert=ssl/root+server_ca.crt sslmode=require dbname=trustdb hostaddr=127.0.0.1 user=ssltestuser host=localhost -f - -v ON_ERR I think we don't want this behavior. The attached fixes that and make-world successfully finished even if I have a cert file in my home direcotory. regareds. -- Kyotaro Horiguchi NTT Open Source Software Center >From 308b55f06907ccaf4ac5669daacf04fea8a18fe1 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 16 Mar 2022 16:20:46 +0900 Subject: [PATCH v1] Prevent out-of-tree certificates from interfering ssl tests 003_sslinfo.pl fails when there is a certificate file in ~/.postgresql directory. Prevent that failure by explicitly setting sslcert option in connection string. --- src/test/ssl/t/003_sslinfo.pl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/test/ssl/t/003_sslinfo.pl b/src/test/ssl/t/003_sslinfo.pl index 95742081f3..81da94f18d 100644 --- a/src/test/ssl/t/003_sslinfo.pl +++ b/src/test/ssl/t/003_sslinfo.pl @@ -93,7 +93,7 @@ $result = $node->safe_psql("certdb", "SELECT ssl_client_cert_present();", is($result, 't', "ssl_client_cert_present() for connection with cert"); $result = $node->safe_psql("trustdb", "SELECT ssl_client_cert_present();", - connstr => "sslrootcert=ssl/root+server_ca.crt sslmode=require " . + connstr => "sslcert=invalid sslrootcert=ssl/root+server_ca.crt sslmode=require " . "dbname=trustdb hostaddr=$SERVERHOSTADDR user=ssltestuser host=localhost"); is($result, 'f', "ssl_client_cert_present() for connection without cert"); @@ -108,7 +108,7 @@ $result = $node->psql("certdb", "SELECT ssl_client_dn_field('invalid');", is($result, '3', "ssl_client_dn_field() for an invalid field"); $result = $node->safe_psql("trustdb", "SELECT ssl_client_dn_field('commonName');", - connstr => "sslrootcert=ssl/root+server_ca.crt sslmode=require " . + connstr => "sslcert=invalid sslrootcert=ssl/root+server_ca.crt sslmode=require " . "dbname=trustdb hostaddr=$SERVERHOSTADDR user=ssltestuser host=localhost"); is($result, '', "ssl_client_dn_field() for connection without cert"); -- 2.27.0
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
At Wed, 16 Mar 2022 15:42:52 +0900, Michael Paquier wrote in > On Wed, Mar 16, 2022 at 10:34:15AM +0900, Kyotaro Horiguchi wrote: > > +1. Desn't the doc need to mention that? > > Yes, I agree that it makes sense to add a note, even if > allow_in_place_tablespaces is a developer option. I have added the > following paragraph in the docs: > +A full path of the symbolic link in pg_tblspc/ > +is returned. A relative path to the data directory is returned > +for tablespaces created with > + enabled. I'm not sure that the "of the symbolic link in pg_tblspc/" is needed. And allow_in_place_tablespaces alone doesn't create in-place tablesapce. So this might need rethink at least for the second point. > Another thing that was annoying in the first version of the patch is > the useless call to lstat() on Windows, not needed because it is > possible to rely just on pgwin32_is_junction() to check if readlink() > should be called or not. Agreed. And v2 looks cleaner. The test detects the lack of the feature. It successfully builds and runs on Rocky8 and Windows11. > This leads me to the revised version attached. What do you think? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Standby got invalid primary checkpoint after crashed right after promoted.
At Wed, 16 Mar 2022 07:16:16 +, hao harry wrote in > Hi, pgsql-hackers, > > I think I found a case that database is not recoverable, would you please > give a look? > > Here is how it happens: > > - setup primary/standby > - do a lots INSERT at primary > - create a checkpoint at primary > - wait until standby start doing restart point, it take about 3mins syncing > buffers to complete > - before the restart point update ControlFile, promote the standby, that > changed ControlFile > ->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving > the ControlFile > ->checkPoint pointing to a removed file Yeah, it seems like exactly the same issue pointed in [1]. A fix is proposed in [1]. Maybe I can remove "possible" from the mail subject:p [1] https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com [2] https://www.postgresql.org/message-id/20220316.102444.2193181487576617583.horikyota@gmail.com > - before the promoted standby request the post-recovery checkpoint (fast > promoted), > one backend crashed, it will kill other server process, so the > post-recovery checkpoint skipped > - the database restart startup process, which report: "could not locate a > valid checkpoint record" > > I attached a test to reproduce it, it does not fail every time, it fails > every 10 times to me. > To increase the chance CreateRestartPoint skip update ControlFile and to > simulate a crash, > the patch 0001 is needed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Standby got invalid primary checkpoint after crashed right after promoted.
(My previous mail hass crossed with this one) At Wed, 16 Mar 2022 08:21:46 +, hao harry wrote in > Found this issue is duplicated to [1], after applied that patch, I cannot > reproduce it anymore. > > [1] > https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com<https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota@gmail.com> Glad to hear that. Thanks for checking it! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Possible corruption by CreateRestartPoint at promotion
Just for the record. An instance of the corruption showed up in this mailing list [1]. [1] https://www.postgresql.org/message-id/flat/9EB4CF63-1107-470E-B5A4-061FB9EF8CC8%40outlook.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov wrote in > В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет: > > At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov > > wrote in > > In v7, HASH_ENTER returns the element stored in DynaHashReuse using > > the freelist_idx of the new key. v8 uses that of the old key (at the > > time of HASH_REUSE). So in the case "REUSE->ENTER(elem exists and > > returns the stashed)" case the stashed element is returned to its > > original partition. But it is not what I mentioned. > > > > On the other hand, once the stahsed element is reused by HASH_ENTER, > > it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow > > from old partition) case. I suspect that ththat the frequent freelist > > starvation comes from the latter case. > > Doubtfully. Due to probabilty theory, single partition doubdfully > will be too overflowed. Therefore, freelist. Yeah. I think so generally. > But! With 128kb shared buffers there is just 32 buffers. 32 entry for > 32 freelist partition - certainly some freelist partition will certainly > have 0 entry even if all entries are in freelists. Anyway, it's an extreme condition and the starvation happens only at a neglegible ratio. > > RETURNED: 2 > > ALLOCED: 0 > > BORROWED: 435 > > REUSED: 495444 > > ASSIGNED: 495467 (-23) > > > > Now "BORROWED" happens 0.8% of REUSED > > 0.08% actually :) Mmm. Doesn't matter:p > > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675. > > > > ... > > > > > Strange thing: both master and patched version has higher > > > > > peak tps at X5676 at medium connections (17 or 27 clients) > > > > > than in first october version [1]. But lower tps at higher > > > > > connections number (>= 191 clients). > > > > > I'll try to bisect on master this unfortunate change. ... > I've checked. Looks like something had changed on the server, since > old master commit behaves now same to new one (and differently to > how it behaved in October). > I remember maintainance downtime of the server in november/december. > Probably, kernel were upgraded or some system settings were changed. One thing I have a little concern is that numbers shows 1-2% of degradation steadily for connection numbers < 17. I think there are two possible cause of the degradation. 1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER. This might cause degradation for memory-contended use. 2. nallocs operation might cause degradation on non-shared dynahasyes? I believe doesn't but I'm not sure. On a simple benchmarking with pgbench on a laptop, dynahash allocation (including shared and non-shared) happend about at 50 times per second with 10 processes and 200 with 100 processes. > > I don't think nalloced needs to be the same width to long. For the > > platforms with 32-bit long, anyway the possible degradation if any by > > 64-bit atomic there doesn't matter. So don't we always define the > > atomic as 64bit and use the pg_atomic_* functions directly? > > Some 32bit platforms has no native 64bit atomics. Then they are > emulated with locks. > > Well, and for 32bit platform long is just enough. Why spend other > 4 bytes per each dynahash? I don't think additional bytes doesn't matter, but emulated atomic operations can matter. However I'm not sure which platform uses that fallback implementations. (x86 seems to have __sync_fetch_and_add() since P4). My opinion in the previous mail is that if that level of degradation caued by emulated atomic operations matters, we shouldn't use atomic there at all since atomic operations on the modern platforms are not also free. In relation to 2 above, if we observe that the degradation disappears by (tentatively) use non-atomic operations for nalloced, we should go back to the previous per-freelist nalloced. I don't have access to such a musculous machines, though.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
At Wed, 16 Mar 2022 20:49:12 +0530, Ashutosh Sharma wrote in > I can see that the pg_get_wal_records_info function shows the details > of the WAL record whose existence is beyond the user specified > stop/end lsn pointer. See below: > > ashu@postgres=# select * from pg_get_wal_records_info('0/0128', > '0/0129'); > -[ RECORD 1 > ]+-- > start_lsn| 0/128 > end_lsn | 0/19F > prev_lsn | 0/0 ... > record_length| 114 ... > In this case, the end lsn pointer specified by the user is > '0/0129'. There is only one WAL record which starts before this > specified end lsn pointer whose start pointer is at 0128, but that > WAL record ends at 0/19F which is way beyond the specified end > lsn. So, how come we are able to display the complete WAL record info? > AFAIU, end lsn is the lsn pointer where you need to stop reading the > WAL data. If that is true, then there exists no valid WAL record > between the start and end lsn in this particular case. You're right considering how pg_waldump behaves. pg_waldump works almost the way as you described above. The record above actually ends at 199 and pg_waldump shows that record by specifying -s 0/128 -e 0/19a, but not for -e 0/199. # I personally think the current behavior is fine, though.. It still suggests unspecifiable end-LSN.. > select * from pg_get_wal_records_info('4/4B28EB68', '4/4C60'); > ERROR: cannot accept future end LSN > DETAIL: Last known WAL LSN on the database system is 4/4C60. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Unhyphenation of crash-recovery
At Thu, 17 Mar 2022 07:42:42 +0100, Peter Eisentraut wrote in > On 16.03.22 02:25, Kyotaro Horiguchi wrote: > > Hello, this is a derived topic from [1], summarized as $SUBJECT. > > This just removes useless hyphens from the words > > "(crash|emergency)-recovery". We don't have such wordings for "archive > > recovery" This patch fixes non-user-facing texts as well as > > user-facing ones. > > Most changes in this patch are not the correct direction. The hyphens > are used to group compound adjectives before nouns. For example, > > simple crash-recovery cases > > means > > simple (crash recovery) cases > > rather than > > simple crash (recovery cases) > > if it were without hyphens. Really? The latter recognization doesn't seem to make sense. I might be too-trained so that I capture "(crash|archive|blah) recovery" as implicit compound words. But anyway there's no strong reason to be aggressive to unhyphenate compound words. "point-in-time-recovery" and "(during) emergency-recovery operations" seem like better be unhyphnated, but now I'm not sure it is really so. Thanks for the comments. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Out-of-tree certificate interferes ssltest
At Thu, 17 Mar 2022 16:22:14 +0900, Michael Paquier wrote in > On Thu, Mar 17, 2022 at 02:59:26PM +0900, Michael Paquier wrote: > > In both cases, enforcing sslcrl to a value of "invalid" interferes > > with the failure scenario we expect from sslcrldir. It is possible to > > bypass that with something like the attached, but that's a kind of > > ugly hack. Another alternative would be to drop those two tests, and > > I am not sure how much we care about these two negative scenarios. > > Actually, there is a trick I have recalled here: we can enforce sslcrl > to an empty value in the connection string after the default. This > still ensures that the test won't pick up any SSL data from the local > environment and avoids any interferences of OpenSSL's > X509_STORE_load_locations(). This gives a much simpler and cleaner > patch. > > Thoughts? Ah! I didn't have a thought that we can specify the same parameter twice. It looks like clean and robust. $default_ssl_connstr contains all required options in PQconninfoOptions[]. The same method worked for 003_sslinfo.pl, too. (of course). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_tablespace_location() failure with allow_in_place_tablespaces
At Thu, 17 Mar 2022 16:39:52 +0900, Michael Paquier wrote in > On Thu, Mar 17, 2022 at 07:55:30PM +1300, Thomas Munro wrote: > > I don't really want to spill details of this developer-only stuff onto > > more manual sections... It's not really helping users if we confuse > > them with irrelevant details of a feature they shouldn't be using, is > > it? And the existing treatment "Returns the file system path that > > this tablespace is located in" is not invalidated by this special > > case, so maybe we shouldn't mention it? > > Right, I see your point. The existing description is not wrong > either. Fine by me to just drop all that. +1. Sorry for my otiose comment.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: XID formatting and SLRU refactorings
At Thu, 17 Mar 2022 19:25:00 +0300, Maxim Orlov wrote in > Hi! > > Here is the v20 patch. 0001 and 0002 are proposed into PG15 as > discussed above. > The whole set of patches is added into [1] to be committed into PG16. > > In this version we've made a major revision related to printf/elog format > compatible with localization > as was proposed above. > > We also think of adding 0003 patch related to 64 bit GUC's into this > thread. Suppose it also may be delivered > into PG15. > > Aleksander Alekseev, we've done this major revision mentioned above and you > are free to continue working on this patch set. > > Reviews and proposals are very welcome! (I'm afraid that this thread is not for the discussion of the patch?:) > [1] > https://www.postgresql.org/message-id/flat/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com +/* printf/elog format compatible with 32 and 64 bit xid. */ +typedef unsigned long long XID_TYPE; ... +errmsg_internal("found multixact %llu from before relminmxid %llu", +(XID_TYPE) multi, (XID_TYPE) relminmxid))); "(XID_TYPE) x" is actually equivalent to "(long long) x" here, but the point here is "%llu in format string accepts (long long)" so we should use literally (or bare) (long long) and the maybe-all precedents does that. And.. You looks like applying the change to some inappropriate places? - elog(DEBUG2, "deleted page from block %u has safexid %u", - blkno, opaque->btpo_level); + elog(DEBUG2, "deleted page from block %u has safexid %llu", + blkno, (XID_TYPE) opaque->btpo_level); regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Corruption during WAL replay
At Wed, 16 Mar 2022 10:14:56 -0400, Robert Haas wrote in > Hmm. I think the last two instances of "buffers" in this comment > should actually say "blocks". Ok. I replaced them with "blocks" and it looks nicer. Thanks! > > I'll try that, if you are already working on it, please inform me. (It > > may more than likely be too late..) > > If you want to take a crack at that, I'd be delighted. Finally, no two of from 10 to 14 doesn't accept the same patch. As a cross-version check, I compared all combinations of the patches for two adjacent versions and confirmed that no hunks are lost. All versions pass check world. The differences between each two adjacent versions are as follows. master->14: A hunk fails due to the change in how to access rel->rd_smgr. 14->13: Several hunks fail due to simple context differences. 13->12: Many hunks fail due to the migration of delayChkpt from PGPROC to PGXACT and the context difference due to change of FSM trancation logic in RelationTruncate. 12->11: Several hunks fail due to the removal of volatile qalifier from pointers to PGPROC/PGXACT. 11-10: A hunk fails due to the context difference due to an additional member tempNamespaceId of PGPROC. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From c88858d3e5681005ba0396b7e7ebcde4322b3308 Mon Sep 17 00:00:00 2001 From: Robert Haas Date: Tue, 15 Mar 2022 12:29:14 -0400 Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a checkpoint. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit If TRUNCATE causes some buffers to be invalidated and thus the checkpoint does not flush them, TRUNCATE must also ensure that the corresponding files are truncated on disk. Otherwise, a replay from the checkpoint might find that the buffers exist but have the wrong contents, which may cause replay to fail. Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design suggestion from Heikki Linnakangas, with some changes to the comments by me. Review of this and a prior patch that approached the issue differently by Heikki Linnakangas, Andres Freund, Álvaro Herrera, Masahiko Sawada, and Tom Lane. Back-patch to all supported versions. Discussion: http://postgr.es/m/byapr06mb6373bf50b469ca393c614257ab...@byapr06mb6373.namprd06.prod.outlook.com --- src/backend/access/transam/multixact.c | 6 ++-- src/backend/access/transam/twophase.c | 12 src/backend/access/transam/xact.c | 5 ++-- src/backend/access/transam/xlog.c | 16 +-- src/backend/access/transam/xloginsert.c | 2 +- src/backend/catalog/storage.c | 29 ++- src/backend/storage/buffer/bufmgr.c | 6 ++-- src/backend/storage/ipc/procarray.c | 26 - src/backend/storage/lmgr/proc.c | 4 +-- src/include/storage/proc.h | 37 - src/include/storage/procarray.h | 5 ++-- 11 files changed, 120 insertions(+), 28 deletions(-) diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 6a70d49738..9f65c600d0 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -3088,8 +3088,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB) * crash/basebackup, even though the state of the data directory would * require it. */ - Assert(!MyProc->delayChkpt); - MyProc->delayChkpt = true; + Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0); + MyProc->delayChkpt |= DELAY_CHKPT_START; /* WAL log truncation */ WriteMTruncateXlogRec(newOldestMultiDB, @@ -3115,7 +3115,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB) /* Then offsets */ PerformOffsetsTruncation(oldestMulti, newOldestMulti); - MyProc->delayChkpt = false; + MyProc->delayChkpt &= ~DELAY_CHKPT_START; END_CRIT_SECTION(); LWLockRelease(MultiXactTruncationLock); diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 874c8ed125..4dc8ccc12b 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -475,7 +475,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid, } proc->xid = xid; Assert(proc->xmin == InvalidTransactionId); - proc->delayChkpt = false; + proc->delayChkpt = 0; proc->statusFlags = 0; proc->pid = 0; proc->databaseId = databaseid; @@ -1164,7 +1164,8 @@ EndPrepare(GlobalTransaction gxact) START_CRIT_SECTION(); - MyProc->delayChkpt = true; + Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0); + MyProc->delayChkpt |= DELAY_CHKPT_START; XLogBeginInsert(); for (record = records.head; record != NULL; record = record->next) @@ -1207,7 +1208,7 @@ EndPrepare(GlobalTransaction gxa
Re: XID formatting and SLRU refactorings
At Fri, 18 Mar 2022 10:20:08 +0900 (JST), Kyotaro Horiguchi wrote in > "(XID_TYPE) x" is actually equivalent to "(long long) x" here, but the > point here is "%llu in format string accepts (long long)" so we should Of course it's the typo of "%llu in format string accepts (*unsigned* long long)". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH] Accept IP addresses in server certificate SANs
At Thu, 17 Mar 2022 21:55:07 +, Jacob Champion wrote in > On Wed, 2022-03-16 at 23:49 +, Jacob Champion wrote: > > Thank you for the explanation -- the misunderstanding was all on my > > end. I thought you were asking me to move the check_cn assignment > > instead of copying it to the end. I agree that your suggestion is much > > clearer, and I'll make that change tomorrow. > > Done in v8. Thanks again for your suggestions (and for your > perseverance when I didn't get it)! Thanks! .. and some nitpicks..(Sorry) fe-secure-common.c doesn't need netinet/in.h. +++ b/src/include/utils/inet.h .. +#include "common/inet-common.h" I'm not sure about the project policy on #include practice, but I think it is the common practice not to include headers that are not required by the file itself. In this case, fe-secure-common.h itself doesn't need the include. Instead, fe-secure-openssl.c and fe-secure-common.c needs the include. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Replication slot drop message is sent after pgstats shutdown.
At Fri, 18 Mar 2022 00:28:37 -0700, Noah Misch wrote in > === > diff -U3 > /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out > > /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out > --- > /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out > Tue Feb 15 06:58:14 2022 > +++ > /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out > Tue Feb 15 11:38:14 2022 > I plan to fix this as attached, similar to how commit c04c767 fixed the same > challenge in detach-partition-concurrently-[34]. It looks correct and I confirmed that it works. It looks like a similar issue with [1] but this is cleaner and stable. [1] https://www.postgresql.org/message-id/20220304.113347.2105652521035137491.horikyota@gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Make mesage at end-of-recovery less scary.
At Mon, 21 Mar 2022 17:01:19 -0700, Andres Freund wrote in > Patch currently fails to apply, needs a rebase: > http://cfbot.cputube.org/patch_37_2490.log Thanks for noticing me of that. Rebased to the current HEAD. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From a7c9f36e631eaba5078398598dae5d459e79add9 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v17] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlogreader.c | 145 +- src/backend/access/transam/xlogrecovery.c | 92 ++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 +- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 106 6 files changed, 306 insertions(+), 58 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index e437c42992..0942265408 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -46,6 +46,8 @@ static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking); +static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess); static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, @@ -147,6 +149,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -552,6 +555,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking) /* reset error state */ state->errormsg_buf[0] = '\0'; decoded = NULL; + state->EndOfWAL = false; state->abortedRecPtr = InvalidXLogRecPtr; state->missingContrecPtr = InvalidXLogRecPtr; @@ -633,25 +637,21 @@ restart: Assert(pageHeaderSize <= readOff); /* - * Read the record length. + * Validate the record header. * - * NB: Even though we use an XLogRecord pointer here, the whole record - * header might not fit on this page. xl_tot_len is the first field of the - * struct, so it must be on this page (the records are MAXALIGNed), but we - * cannot access any other fields until we've verified that we got the - * whole header. - */ - record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; - - /* - * If the whole record header is on this page, validate it immediately. - * Otherwise do just a basic sanity check on xl_tot_len, and validate the - * rest of the header after reading it from the next page. The xl_tot_len + * Even though we use an XLogRecord pointer here, the whole record header + * might not fit on this page. If the whole record header is on this page, + * validate it immediately. Even otherwise xl_tot_len must be on this page + * (it is the first field of MAXALIGNed records), but we still cannot + * access any further fields until we've verified that we got the whole + * header, so do just a basic sanity check on record length, and validate + * the rest of the header after reading it from the next page. The length * check is necessary here to ensure that we enter the "Need to reassemble * record" code path below; otherwise we might fail to apply * ValidXLogRecordHeader at all. */ + record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); + if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord) { if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record, @@ -661,18 +661,14 @@ restart: } else { - /* XXX: more validation should be done here */ - if (total_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: wanted %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, total_len); + if (!ValidXLogRecordLength(state, RecPtr, record)) goto err; - } + gotheader = false; } + total_len = record->xl_tot_len; + /* * Find space to decode this record. Don't allow oversized allocation if * the caller requested nonblocking.