Re: Patch a potential memory leak in describeOneTableDetails()

2022-02-21 Thread Kyotaro Horiguchi
At Fri, 18 Feb 2022 16:12:34 +0800 (GMT+08:00), wli...@stu.xidian.edu.cn wrote 
in 
> I find a potential memory leak in PostgresSQL 14.1, which is in the function 
> describeOneTableDetails (./src/bin/psql/describe.c). The bug has been 
> confirmed by an auditor of .
> 
> Specifically, at line 1603, a memory chunk is allocated with pg_strdup and 
> assigned to tableinfo.reloptions. If the branch takes true at line 1621, the 
> execution path will eventually jump to error_return at line 3394. 
> Unfortunately, the memory is not freed when the function 
> describeOneTableDetail() returns. As a result, the memory is leaked.
> 
> Similarly, the two memory chunks (tableinfo.reloftype and tableinfo.relam) 
> allocated at line 1605, 1606 and 1612 may also be leaked for the same reason.

I think it is not potential leaks but real leaks as it accumulates as 
repeatedly doing \d .

> We believe we can fix the problems by adding corresponding release function 
> before the function returns, such as.
> 
> if (tableinfo.reloptions)
> pg_free (tableinfo.reloptions);
> if (tableinfo.reloftype)
> pg_free (tableinfo.reloftype);
> if (tableinfo.relam)
> pg_free (tableinfo. relam);
> 
> 
> I'm looking forward to your reply.

Good catch, and the fix is logically correct. The results from other
use of pg_strdup() (and pg_malloc) seems to be released properly.

About the patch, we should make a single chunk to do free(). And I
think we don't insert whitespace between function name and the
following left parenthesis.

Since view_def is allocated by pg_strdup(), we might be better use
pg_free() for it for symmetry. footers[0] is allocated using the
frontend-alternative of "palloc()" so should use pfree() instead?

Honestly I'm not sure we need to be so strict on that
correspondence...

So it would be something like this.

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 654ef2d7c3..5da2b61a88 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3412,7 +3412,13 @@ error_return:
termPQExpBuffer(&tmpbuf);
 
if (view_def)
-   free(view_def);
+   pg_free(view_def);
+if (tableinfo.reloptions)
+pg_free(tableinfo.reloptions);
+if (tableinfo.reloftype)
+pg_free(tableinfo.reloftype);
+if (tableinfo.relam)
+pg_free(tableinfo.relam);
 
if (res)
PQclear(res);
@@ -3621,8 +3627,8 @@ describeRoles(const char *pattern, bool verbose, bool 
showSystem)
printTableCleanup(&cont);
 
for (i = 0; i < nrows; i++)
-   free(attr[i]);
-   free(attr);
+   pg_free(attr[i]);
+   pg_free(attr);
 
PQclear(res);
return true;

(However, I got a mysterious -Wmisleading-indentation warning with this..)

> describe.c: In function ‘describeOneTableDetails’:
> describe.c:3420:5: warning: this ‘if’ clause does not guard... 
> [-Wmisleading-indentation]
>  if (tableinfo.relam)
>  ^~
> describe.c:3423:2: note: ...this statement, but the latter is misleadingly 
> indented as if it were guarded by the ‘if’
>   if (res)
  ^~

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Patch a potential memory leak in describeOneTableDetails()

2022-02-21 Thread Kyotaro Horiguchi
At Mon, 21 Feb 2022 10:24:23 +0100, Daniel Gustafsson  wrote 
in 
> I think it's because you've indented your new code differently from the
> existing, such that the if (res) clause is indented equally to the previous
> pg_free(tableinfo.relam) call, making it look like they are in the same block:

Yeah, I (wrongly) thought that they have the same indentation.

> > +if (tableinfo.relam)
> > +pg_free(tableinfo.relam);
> > 
> > if (res)
> > PQclear(res);

But actually the lines I added are space-indented but the following
existing line is tab-indented.

Anyway, don't we use the -ftabstop option flag to silence compiler?
The warning is contradicting our convention. The attached adds that
flag.


By the way, the doc says that:

https://www.postgresql.org/docs/14/source-format.html

> The text browsing tools more and less can be invoked as:
>more -x4
>less -x4
> to make them show tabs appropriately.

But AFAICS the "more" command on CentOS doesn't have -x options nor
any option to change tab width and I don't find a document that
suggests its existence on other platforms.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4e3b0c2ce6a2dd60d11100deb2db80ff07455d0d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 22 Feb 2022 14:41:10 +0900
Subject: [PATCH] Fit compiler tabstop to our convension

Set -ftabstop to 4 so that -Wmisleading-indentation doesn't makes
false warnings.
---
 configure| 92 
 configure.ac |  3 ++
 2 files changed, 95 insertions(+)

diff --git a/configure b/configure
index f3cb5c2b51..887ed733eb 100755
--- a/configure
+++ b/configure
@@ -6219,6 +6219,98 @@ if test 
x"$pgac_cv_prog_CXX_cxxflags__fexcess_precision_standard" = x"yes"; then
 fi
 
 
+  # Fit tab width for -Wmisleading-indentation to our convention
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports 
-ftabstop=4, for CFLAGS" >&5
+$as_echo_n "checking whether ${CC} supports -ftabstop=4, for CFLAGS... " >&6; }
+if ${pgac_cv_prog_CC_cflags__ftabstop_4+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+pgac_save_CC=$CC
+CC=${CC}
+CFLAGS="${CFLAGS} -ftabstop=4"
+ac_save_c_werror_flag=$ac_c_werror_flag
+ac_c_werror_flag=yes
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_prog_CC_cflags__ftabstop_4=yes
+else
+  pgac_cv_prog_CC_cflags__ftabstop_4=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_c_werror_flag=$ac_save_c_werror_flag
+CFLAGS="$pgac_save_CFLAGS"
+CC="$pgac_save_CC"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$pgac_cv_prog_CC_cflags__ftabstop_4" >&5
+$as_echo "$pgac_cv_prog_CC_cflags__ftabstop_4" >&6; }
+if test x"$pgac_cv_prog_CC_cflags__ftabstop_4" = x"yes"; then
+  CFLAGS="${CFLAGS} -ftabstop=4"
+fi
+
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports 
-ftabstop=4, for CXXFLAGS" >&5
+$as_echo_n "checking whether ${CXX} supports -ftabstop=4, for CXXFLAGS... " 
>&6; }
+if ${pgac_cv_prog_CXX_cxxflags__ftabstop_4+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CXXFLAGS=$CXXFLAGS
+pgac_save_CXX=$CXX
+CXX=${CXX}
+CXXFLAGS="${CXXFLAGS} -ftabstop=4"
+ac_save_cxx_werror_flag=$ac_cxx_werror_flag
+ac_cxx_werror_flag=yes
+ac_ext=cpp
+ac_cpp='$CXXCPP $CPPFLAGS'
+ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5'
+ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS 
conftest.$ac_ext $LIBS >&5'
+ac_compiler_gnu=$ac_cv_cxx_compiler_gnu
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_cxx_try_compile "$LINENO"; then :
+  pgac_cv_prog_CXX_cxxflags__ftabstop_4=yes
+else
+  pgac_cv_prog_CXX_cxxflags__ftabstop_4=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_ext=c
+ac_cpp='$CPP $CPPFLAGS'
+ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5'
+ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext 
$LIBS >&5'
+ac_compiler_gnu=$ac_cv_c_compiler_gnu
+
+ac_cxx_werror_flag=$ac_save_cxx_werror_flag
+CXXFLAGS="$pgac_save_CXXFLAGS"
+CXX="$pgac_save_CXX"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$pgac_cv_prog_CXX_cxxflags__ftabstop_4" >&5
+$as_echo "$pgac_cv_prog_CXX_cxxflags__ftabstop_4" >&6; }
+if test 

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-22 Thread Kyotaro Horiguchi
At Tue, 22 Feb 2022 01:59:45 +0900, Fujii Masao  
wrote in 
> 
> > Of the following, I think we should do (a) and (b) to make future
> > backpatchings easier.
> > a) Use RedoRecPtr and PriorRedoPtr after they are assigned.
> > b) Move assignment to PriorRedoPtr into the ControlFileLock section.
> 
> I failed to understand how (a) and (b) can make the backpatching
> easier. How easy to backpatch seems the same whether we apply (a) and
> (b) or not...

That premises that the patch applied to master contains (a) and (b).
So if it doesn't, those are not need by older branches.


> > c) Skip udpate of minRecoveryPoint only when the checkpoint gets old.
> 
> Yes.
> 
> 
> > d) Skip call to UpdateCheckPointDistanceEstimate() when RedoRecPtr <=
> >PriorRedoPtr.
> 
> But "RedoRecPtr <= PriorRedoPtr" will never happen, will it? Because a

I didn't believe that it happens. (So, it came from my
convervativeness, or laziness, or both:p) The code dates from 2009 and
StartupXLOG makes a concurrent checkpoint with bgwriter. But as of at
least 9.5, StartupXLOG doesn't directly call CreateCheckPoint. So I
think that won't happen.

So, in short, I agree to remove it or turn it into Assert().

> restartpoint is skipped at the beginning of CreateRestartPoint() in
> that case. If this understanding is right, the check of "RedoRecPtr <=
> PriorRedoPtr" is not necessary before calling
> UpdateCheckPointDistanceEstimate().
> 
> 
> + ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
> + ControlFile->minRecoveryPointTLI = 0;
> 
> Don't we need to update LocalMinRecoveryPoint and
> LocalMinRecoveryPointTLI after this? Maybe it's not necessary, but
> ISTM that it's safer and better to always update them whether the
> state is DB_IN_ARCHIVE_RECOVERY or not.

Agree that it's safer and tidy.

>   if (flags & CHECKPOINT_IS_SHUTDOWN)
>   ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
> 
> Same as above. IMO it's safer and better to always update the state
> (whether the state is DB_IN_ARCHIVE_RECOVERY or not) if
> CHECKPOINT_IS_SHUTDOWN flag is passed.

That means we may exit recovery mode after ShutdownXLOG called
CreateRestartPoint. I don't think that may happen.  So I'd rather add
Assert ((flags&CHECKPOINT_IS_SHUTDOWN) == 0) there instaed.

I'll post the new version later.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: remove more archiving overhead

2022-02-23 Thread Kyotaro Horiguchi
At Mon, 21 Feb 2022 17:19:48 -0800, Nathan Bossart  
wrote in 
> I also spent some time investigating whether durably renaming the archive
> status files was even necessary.  In theory, it shouldn't be.  If a crash
> happens before the rename is persisted, we might try to re-archive files,
> but your archive_command/archive_library should handle that.  If the file

I believe we're trying to archive a WAL segment once and only once,
even though actually we fail in certain cases.

https://www.postgresql.org/docs/14/continuous-archiving.html
| The archive command should generally be designed to refuse to
| overwrite any pre-existing archive file. This is an important safety
| feature to preserve the integrity of your archive in case of
| administrator error (such as sending the output of two different
| servers to the same archive directory).

The recommended behavior of archive_command above is to avoid this
kind of corruption.  If we officially admit that a WAL segment can be
archive twice, we need to revise that part (and the following part) of
the doc.

> was already recycled or removed, the archiver will skip it (thanks to
> 6d8727f).  But when digging further, I found that WAL file recyling uses
> durable_rename_excl(), which has the following note:
>
>* Note that a crash in an unfortunate moment can leave you with two 
> links to
>* the target file.
> 
> IIUC this means that in theory, a crash at an unfortunate moment could
> leave you with a .ready file, the file to archive, and another link to that
> file with a "future" WAL filename.  If you re-archive the file after it has
> been reused, you could end up with corrupt WAL archives.  I think this

IMHO the difference would be that the single system call (maybe)
cannot be interrupted by sigkill.  So I'm not sure that that is worth
considering.  The sigkill window can be elimianted if we could use
renameat(RENAME_NOREPLACE).  But finally power-loss can cause that.

> might already be possible today.  Syncing the directory after every rename
> probably reduces the likelihood quite substantially, but if
> RemoveOldXlogFiles() quickly picks up the .done file and attempts
> recycling before durable_rename() calls fsync() on the directory,
> presumably the same problem could occur.

If we don't fsync at every rename, we don't even need for
RemoveOldXlogFiles to pickup .done very quickly.  Isn't it a big
difference?

> I believe the current recommendation is to fail if the archive file already
> exists.  The basic_archive contrib module fails if the archive already
> exists and has different contents, and it succeeds without re-archiving if
> it already exists with identical contents.  Since something along these
> lines is recommended as a best practice, perhaps this is okay.  But I think
> we might be able to do better.  If we ensure that the archive_status
> directory is sync'd prior to recycling/removing a WAL file, I believe we
> are safe from the aforementioned problem.
>
> The drawback of this is that we are essentially offloading more work to the
> checkpointer process.  In addition to everything else it must do, it will
> also need to fsync() the archive_status directory many times.  I'm not sure
> how big of a problem this is.  It should be good to ensure that archiving
> keeps up, and there are far more expensive tasks that the checkpointer must
> perform that could make the added fsync() calls relatively insignificant.

I think we don't want make checkpointer slower in exchange of making
archiver faster.  Perhaps we need to work using a condensed
information rather than repeatedly scanning over the distributed
information like archive_status?

At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart  
wrote in 
> On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
> > In my testing, I found that when I killed the server just before unlink()
> > during WAL recyling, I ended up with links to the same file in pg_wal after
> > restarting.  My latest test produced links to the same file for the current
> > WAL file and the next one.  Maybe WAL recyling should use durable_rename()
> > instead of durable_rename_excl().
> 
> Here is an updated patch.

Is it intentional that v2 loses the xlogarchive.c part?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Patch a potential memory leak in describeOneTableDetails()

2022-02-23 Thread Kyotaro Horiguchi
At Tue, 22 Feb 2022 09:59:09 +0100, Daniel Gustafsson  wrote 
in 
> > On 22 Feb 2022, at 07:14, Kyotaro Horiguchi  wrote:
> 
> > Anyway, don't we use the -ftabstop option flag to silence compiler?
> > The warning is contradicting our convention. The attached adds that
> > flag.
> 
> Isn't this flag contradicting our convention?  From the docs section you
> reference further down:
> 
> "Source code formatting uses 4 column tab spacing, with tabs preserved
> (i.e., tabs are not expanded to spaces)."

Ugg.  Right.

> The -ftabstop option is (to a large extent, not entirely) to warn when tabs 
> and
> spaces are mixed creating misleading indentation that the author didn't even
> notice due to tabwidth settings?  ISTM we are better of getting these warnings
> than suppressing them.

At Tue, 22 Feb 2022 10:00:46 -0500, Tom Lane  wrote in 
> No, we use pgindent to not have such inconsistent indentation.

Understood.  Thaks for clarification.

> > By the way, the doc says that:
> > 
> > https://www.postgresql.org/docs/14/source-format.html
> > 
> >> The text browsing tools more and less can be invoked as:
> >>   more -x4
> >>   less -x4
> >> to make them show tabs appropriately.
> > 
> > But AFAICS the "more" command on CentOS doesn't have -x options nor
> > any option to change tab width and I don't find a document that
> > suggests its existence on other platforms.
> 
> macOS, FreeBSD and NetBSD both show the less(1) manpage for more(1) which
> suggests that more is implemented using less there, and thus supports -x (it
> does on macOS).  OpenBSD and Solaris more(1) does not show a -x option, but 
> AIX
> does have it.  Linux in its various flavors doesn't seem to have it.

Thanks for the explanation.

> The section in question was added to our docs 22 years ago, to make it reflect
> the current operating systems in use maybe just not mentioning more(1) is the
> easiest?:
> 
> "The text browsing tool less can be invoked as less -x4 to make it show
> tabs appropriately."
> 
> Or perhaps remove that section altogether?

I think the former is better.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Patch a potential memory leak in describeOneTableDetails()

2022-02-23 Thread Kyotaro Horiguchi
At Thu, 24 Feb 2022 14:44:56 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 22 Feb 2022 09:59:09 +0100, Daniel Gustafsson  wrote 
> in 
> > The section in question was added to our docs 22 years ago, to make it 
> > reflect
> > the current operating systems in use maybe just not mentioning more(1) is 
> > the
> > easiest?:
> > 
> > "The text browsing tool less can be invoked as less -x4 to make it show
> > tabs appropriately."
> > 
> > Or perhaps remove that section altogether?
> 
> I think the former is better.

So the attached does that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From fd833aee1756364f669a3d478763299ce6d3a747 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 24 Feb 2022 15:18:52 +0900
Subject: [PATCH] Fix out-of-date description mentioning the command more

In most of the operating systems in use, "more" is implemented using
"less", or otherwise "more" does not have an -x option. Remove the
descriptions related to the more command to make it reflect the
current world better.
---
 doc/src/sgml/sources.sgml | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/sources.sgml b/doc/src/sgml/sources.sgml
index 1b77efb087..7a6e8a951f 100644
--- a/doc/src/sgml/sources.sgml
+++ b/doc/src/sgml/sources.sgml
@@ -76,13 +76,8 @@

 

-The text browsing tools more and
-less can be invoked as:
-
-more -x4
-less -x4
-
-to make them show tabs appropriately.
+The text browsing tool less can be invoked
+as less -x4 to make it show tabs appropriately.

   
 
-- 
2.27.0



Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-23 Thread Kyotaro Horiguchi
At Wed, 23 Feb 2022 02:58:07 +, "Imseih (AWS), Sami"  
wrote in 
> >Ooh, nice find and diagnosys.  I can confirm that the test fails as you
> >described without the code fix, and doesn't fail with it.
> 
> >I attach the same patch, with the test file put in its final place
> >rather than as a patch.  Due to recent xlog.c changes this need a bit of
> >work to apply to back branches; I'll see about getting it in all
> >branches soon.
> 
> Thank you for reviewing!

Nice catch!  However, I'm not sure I like the patch.

 * made it through and start writing after the portion that persisted.
 * (It's critical to first write an OVERWRITE_CONTRECORD message, which
 * we'll do as soon as we're open for writing new WAL.)
+*
+* If the last wal record is ahead of the missing contrecord, this is a
+* recently promoted primary and we should not write an overwrite
+* contrecord.

Before the part, the comment follows the part shown below.

>* Actually, if WAL ended in an incomplete record, skip the parts that

So, actually WAL did not ended in an incomplete record.  I think
FinishWalRecover is the last place to do that. (But it could be
earlier.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-24 Thread Kyotaro Horiguchi
At Thu, 24 Feb 2022 16:26:42 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> So, actually WAL did not ended in an incomplete record.  I think
> FinishWalRecover is the last place to do that. (But it could be
> earlier.)

After some investigation, I finally concluded that we should reset
abortedRecPtr and missingContrecPtr at processing
XLOG_OVERWRITE_CONTRECORD.


--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1953,6 +1953,11 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID 
replayTLI)

LSN_FORMAT_ARGS(xlrec.overwritten_lsn),

timestamptz_to_str(xlrec.overwrite_time;
 
+   /* We have safely skipped the aborted record */
+   abortedRecPtr = InvalidXLogRecPtr;
+   missingContrecPtr = InvalidXLogRecPtr;
+
/* Verifying the record should only happen once */
record->overwrittenRecPtr = InvalidXLogRecPtr;
}

The last check in the test against "resetting aborted record" is no
longer useful since it is already checked by
026_verwrite_contrecord.pl.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add id's to various elements in protocol.sgml

2022-02-24 Thread Kyotaro Horiguchi
At Tue, 21 Dec 2021 08:47:27 +0100, Brar Piening  wrote in 
> On 20.12.2021 at 16:09, Robert Haas wrote:
> > As a data point, this is something I have also wanted to do, from time
> > to time. I am generally of the opinion that any place the

+1 from me.  When I put an URL in the answer for inquiries, I always
look into the html for name/id tags so that the inquirer quickly find
the information source (or the backing or reference point) on the
page.  If not found, I place a snippet instead.

> > documentation has a long list of things, which should add ids, so that
> > people can link to the particular thing in the list to which they want
> > to draw someone's attention.
> >
> Thank you.
> 
> If there is consensus on generally adding links to long lists I'd take
> suggestions for other places where people think that this would make
> sense and amend my patch.

I don't think there is.

I remember sometimes wanted ids on some sections(x.x) and
items(x.x.x or lower) (or even clauses, ignoring costs:p)

FWIW in that perspecive, there's no requirement from me that it should
be human-readable.  I'm fine with automatically-generated ids.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Buffer Manager and Contention

2022-02-24 Thread Kyotaro Horiguchi
(I added Yura, as the author of a related patch)

At Thu, 24 Feb 2022 12:58:23 +, Simon Riggs  
wrote in 
> Thinking about poor performance in the case where the data fits in
> RAM, but the working set is too big for shared_buffers, I notice a
> couple of things that seem bad in BufMgr, but don't understand why
> they are like that.
> 
> 1. If we need to allocate a buffer to a new block we do this in one
> step, while holding both partition locks for the old and the new tag.
> Surely it would cause less contention to make the old block/tag
> invalid (after flushing), drop the old partition lock and then switch
> to the new one? i.e. just hold one mapping partition lock at a time.
> Is there a specific reason we do it this way?

I'm not sure but I guess the developer wanted to make the operation
atomic.

Yura Sokolov is proposing a patch to separte the two partition locks.

https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru

And it seems to me viable for me and a benchmarking in the thread
showed a good result.  I'd appreciate your input on that thread.

> 2. Possibly connected to the above, we issue BufTableInsert() BEFORE
> we issue BufTableDelete(). That means we need extra entries in the
> buffer mapping hash table to allow us to hold both the old and the new
> at the same time, for a short period. The way dynahash.c works, we try

Yes.

> to allocate an entry from the freelist and if that doesn't work, we
> begin searching ALL the freelists for free entries to steal. So if we
> get enough people trying to do virtual I/O at the same time, then we
> will hit a "freelist storm" where everybody is chasing the last few
> entries. It would make more sense if we could do BufTableDelete()

To reduce that overhead, Yura proposed a surgically modification on
dynahash, but I didn't like that and the latest patch doesn't contain
that part.

> first, then hold onto the buffer mapping entry rather than add it to
> the freelist, so we can use it again when we do BufTableInsert() very
> shortly afterwards. That way we wouldn't need to search the freelist
> at all. What is the benefit or reason of doing the Delete after the
> Insert?

Hmm. something like hash_swap_key() or hash_reinsert_entry()?  That
sounds reasonable. (Yura's proposal was taking out an entry from hash
then attach it with a new key again.)

> Put that another way, it looks like BufTable functions are used in a
> way that mismatches against the way dynahash is designed.
> 
> Thoughts?

On the first part, I think Yura's patch works.  On the second point,
Yura already showed it gives a certain amount of gain if we do that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Buffer Manager and Contention

2022-02-24 Thread Kyotaro Horiguchi
At Fri, 25 Feb 2022 10:20:25 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> (I added Yura, as the author of a related patch)
> 
> At Thu, 24 Feb 2022 12:58:23 +, Simon Riggs 
>  wrote in 
> > Thinking about poor performance in the case where the data fits in
> > RAM, but the working set is too big for shared_buffers, I notice a
> > couple of things that seem bad in BufMgr, but don't understand why
> > they are like that.
> > 
> > 1. If we need to allocate a buffer to a new block we do this in one
> > step, while holding both partition locks for the old and the new tag.
> > Surely it would cause less contention to make the old block/tag
> > invalid (after flushing), drop the old partition lock and then switch
> > to the new one? i.e. just hold one mapping partition lock at a time.
> > Is there a specific reason we do it this way?
> 
> I'm not sure but I guess the developer wanted to make the operation
> atomic.
> 
> Yura Sokolov is proposing a patch to separte the two partition locks.
> 
> https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru
> 
> And it seems to me viable for me and a benchmarking in the thread
> showed a good result.  I'd appreciate your input on that thread.
> 
> > 2. Possibly connected to the above, we issue BufTableInsert() BEFORE
> > we issue BufTableDelete(). That means we need extra entries in the
> > buffer mapping hash table to allow us to hold both the old and the new
> > at the same time, for a short period. The way dynahash.c works, we try
> 
> Yes.
> 
> > to allocate an entry from the freelist and if that doesn't work, we
> > begin searching ALL the freelists for free entries to steal. So if we
> > get enough people trying to do virtual I/O at the same time, then we
> > will hit a "freelist storm" where everybody is chasing the last few
> > entries. It would make more sense if we could do BufTableDelete()
> 
> To reduce that overhead, Yura proposed a surgically modification on
> dynahash, but I didn't like that and the latest patch doesn't contain
> that part.
> 
> > first, then hold onto the buffer mapping entry rather than add it to
> > the freelist, so we can use it again when we do BufTableInsert() very
> > shortly afterwards. That way we wouldn't need to search the freelist
> > at all. What is the benefit or reason of doing the Delete after the
> > Insert?
> 
> Hmm. something like hash_swap_key() or hash_reinsert_entry()?  That
> sounds reasonable. (Yura's proposal was taking out an entry from hash
> then attach it with a new key again.)
> 
> > Put that another way, it looks like BufTable functions are used in a
> > way that mismatches against the way dynahash is designed.
> > 
> > Thoughts?
> 
> On the first part, I think Yura's patch works.  On the second point,
> Yura already showed it gives a certain amount of gain if we do that.

On second thought, even if we have a new dynahash API to atomically
replace hash key, we need to hold two partition locks to do that. That
is contradicting our objective.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-24 Thread Kyotaro Horiguchi
At Tue, 22 Feb 2022 17:44:01 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 22 Feb 2022 01:59:45 +0900, Fujii Masao  
> wrote in 
> > 
> > > Of the following, I think we should do (a) and (b) to make future
> > > backpatchings easier.
> > > a) Use RedoRecPtr and PriorRedoPtr after they are assigned.
> > > b) Move assignment to PriorRedoPtr into the ControlFileLock section.
> > 
> > I failed to understand how (a) and (b) can make the backpatching
> > easier. How easy to backpatch seems the same whether we apply (a) and
> > (b) or not...
> 
> That premises that the patch applied to master contains (a) and (b).
> So if it doesn't, those are not need by older branches.

I was once going to remove them.  But according the discussion below,
the patch for back-patching is now quite close to that for the master
branch.  So I left them alone.

> > > d) Skip call to UpdateCheckPointDistanceEstimate() when RedoRecPtr <=
> > >PriorRedoPtr.
> > 
> > But "RedoRecPtr <= PriorRedoPtr" will never happen, will it? Because a
> 
> I didn't believe that it happens. (So, it came from my
> convervativeness, or laziness, or both:p) The code dates from 2009 and
> StartupXLOG makes a concurrent checkpoint with bgwriter. But as of at
> least 9.5, StartupXLOG doesn't directly call CreateCheckPoint. So I
> think that won't happen.
> 
> So, in short, I agree to remove it or turn it into Assert().

It was a bit out of point.  If we assume RedoRecPtr is always larger
than PriorRedoPtr and then we don't need to check that there, we
should also remove the "if (PriorRedoPtr < RedoRecPtr)" branch just
above, which means the patch for back-branches gets very close to that
for the master.  Do we make such a large change on back branches?
Anyways this version once takes that way.

> > if (flags & CHECKPOINT_IS_SHUTDOWN)
> > ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
> > 
> > Same as above. IMO it's safer and better to always update the state
> > (whether the state is DB_IN_ARCHIVE_RECOVERY or not) if
> > CHECKPOINT_IS_SHUTDOWN flag is passed.
> 
> That means we may exit recovery mode after ShutdownXLOG called
> CreateRestartPoint. I don't think that may happen.  So I'd rather add
> Assert ((flags&CHECKPOINT_IS_SHUTDOWN) == 0) there instaed.

So this version for v14 gets updated in the following points.

Completely removed the code path for the case some other process runs
simultaneous checkpoint.

Removed the condition (RedoRecPtr > PriorRedoPtr) for
UpdateCheckPointDistanceEstimate() call.

Added an assertion to the recoery-end path.

# Honestly I feel this is a bit too much for back-patching, though.

While making patches for v12, I see a test failure of pg_rewind for
uncertain reason. I'm investigating that but I post this for
discussion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From e983f3d4c2dbeea742aed0ef1e209e7821f6687f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 14 Feb 2022 13:04:33 +0900
Subject: [PATCH v2] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 73 ---
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 6208e123e5..ff4a90eacc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9587,6 +9587,9 @@ CreateRestartPoint(int flags)
XLogSegNo   _logSegNo;
TimestampTz xtime;
 
+   /* we don't assume concurrent checkpoint/restartpoint to run */
+   Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
/* Get a local copy of the last safe checkpoint record. */
SpinLockAcquire(&XLogCtl->info_lck);
lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -9653,7 +9656,7 @@ CreateR

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-24 Thread Kyotaro Horiguchi
At Fri, 25 Feb 2022 15:31:12 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> While making patches for v12, I see a test failure of pg_rewind for
> uncertain reason. I'm investigating that but I post this for
> discussion.

Hmm. Too stupid.  Somehow I overly removed the latchet condition for
minRecoveryPoint.  So the same patch worked for v12.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-24 Thread Kyotaro Horiguchi
At Fri, 25 Feb 2022 16:06:31 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 25 Feb 2022 15:31:12 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > While making patches for v12, I see a test failure of pg_rewind for
> > uncertain reason. I'm investigating that but I post this for
> > discussion.
> 
> Hmm. Too stupid.  Somehow I overly removed the latchet condition for
> minRecoveryPoint.  So the same patch worked for v12.

So, this is the patches for pg12-10.  11 can share the same patch with
12.  10 has differences in two points.

10 has ControlFile->prevCheckPoint.

The DETAILS of the "recovery restart point at" message is not
capitalized.  But I suppose it is so close to EOL so that we don't
want to "fix" it risking existing usecases.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c89e2b509723b68897f2af49a154af2a69f0747b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 25 Feb 2022 15:04:00 +0900
Subject: [PATCH v3] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 71 +++
 1 file changed, 44 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 885558f291..2b2568c475 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9334,7 +9334,7 @@ CreateRestartPoint(int flags)
 
/* Also update the info_lck-protected copy */
SpinLockAcquire(&XLogCtl->info_lck);
-   XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+   XLogCtl->RedoRecPtr = RedoRecPtr;
SpinLockRelease(&XLogCtl->info_lck);
 
/*
@@ -9350,7 +9350,10 @@ CreateRestartPoint(int flags)
if (log_checkpoints)
LogCheckpointStart(flags, true);
 
-   CheckPointGuts(lastCheckPoint.redo, flags);
+   CheckPointGuts(RedoRecPtr, flags);
+
+   /* Update pg_control */
+   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
/*
 * Remember the prior checkpoint's redo ptr for
@@ -9358,31 +9361,28 @@ CreateRestartPoint(int flags)
 */
PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+   Assert (PriorRedoPtr < RedoRecPtr);
+
+   ControlFile->checkPoint = lastCheckPointRecPtr;
+   ControlFile->checkPointCopy = lastCheckPoint;
+
+   /* Update control file using current time */
+   ControlFile->time = (pg_time_t) time(NULL);
+
/*
-* Update pg_control, using current time.  Check that it still shows
-* IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-* this is a quick hack to make sure nothing really bad happens if 
somehow
-* we get here after the end-of-recovery checkpoint.
+* Ensure minRecoveryPoint is past the checkpoint record while archive
+* recovery is still ongoing.  Normally, this will have happened already
+* while writing out dirty buffers, but not necessarily - e.g. because 
no
+* buffers were dirtied.  We do this because a non-exclusive base backup
+* uses minRecoveryPoint to determine which WAL files must be included 
in
+* the backup, and the file (or files) containing the checkpoint record
+* must be included, at a minimum. Note that for an ordinary restart of
+* recovery there's no value in having the minimum recovery point any
+* earlier than this anyway, because redo will begin just after the
+* checkpoint record.
 */
-   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-   ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
{
-   ControlFile->checkPoint = lastCheckPointRecPtr;
-   ControlFile->checkPointCopy = lastCheckPoint;
-   ControlFile->time = (pg_time_t) time(NULL);
-
-   /*
-* Ensure minRecoveryPoint is pas

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-24 Thread Kyotaro Horiguchi
At Fri, 25 Feb 2022 16:47:01 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> So, this is the patches for pg12-10.  11 can share the same patch with
> 12.  10 has differences in two points.
> 
> 10 has ControlFile->prevCheckPoint.
> 
> The DETAILS of the "recovery restart point at" message is not
> capitalized.  But I suppose it is so close to EOL so that we don't
> want to "fix" it risking existing usecases.

Ugh! Wait for a moment. Something's wrong.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-25 Thread Kyotaro Horiguchi
At Fri, 25 Feb 2022 16:52:30 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Ugh! Wait for a moment. Something's wrong.

Sorry, what is wrong was my working directory.  It was broken by my
bogus operation. All the files apply corresponding versions correctly.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Kyotaro Horiguchi
At Fri, 25 Feb 2022 00:04:55 -0800, Andres Freund  wrote in 
> Why don't you just use LockBufHdr/UnlockBufHdr?

FWIW, v2 looked fine to me in regards to this point.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2022-02-27 Thread Kyotaro Horiguchi
At Sat, 26 Feb 2022 12:11:15 +0900, Michael Paquier  wrote 
in 
> On Fri, Feb 25, 2022 at 01:09:53PM -0800, Nathan Bossart wrote:
> > This one has been quiet for a while.  Should we mark it as
> > returned-with-feedback?
> 
> Yes, that's my feeling and I got cold feet about this change.  This
> patch would bring some extra visibility for something that's not
> incorrect either on HEAD, as end-of-recovery checkpoints are the same
> things as shutdown checkpoints.  And there is an extra argument where
> back-patching would become a bit more tricky in an area that's already
> a lot sensitive.

That sounds like we should reject the patch as we don't agree to its
objective.  If someday end-of-recovery checkpoints functionally
diverge from shutdown checkpoints but leave (somehow) the transition
alone, we may visit this again but it would be another proposal.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: definition of CalculateMaxmumSafeLSN

2022-02-28 Thread Kyotaro Horiguchi
At Mon, 28 Feb 2022 17:01:10 +0300, Sergei Kornilov  wrote in 
> Hello
> I just spotted in src/include/access/xlog.h:
> extern XLogRecPtr CalculateMaxmumSafeLSN(void);
> 
> This function doesn't seem to be used anywhere or even defined? "git grep 
> CalculateMaxmumSafeLSN" shows only this line.
> Was added in c6550776394e25c1620bc8258427c8f1d448080d "Allow users to limit 
> storage reserved by replication slots"

Hmm.  I think I remember of that name. You're right, it is a function
that was once added during development of the commit but eventually
gone.

So it should be removed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Use "WAL segment" instead of "log segment" consistently in user-facing messages

2022-02-28 Thread Kyotaro Horiguchi
At Mon, 28 Feb 2022 21:03:07 +0530, Bharath Rupireddy 
 wrote in 
> Hi,
> 
> It looks like we use "log segment" in various user-facing messages.
> The term "log" can mean server logs as well. The "WAL segment" suits
> well here and it is consistently used across the other user-facing
> messages [1].
> 
> Here's a small patch attempting to consistently use the "WAL segment".
> 
> Thoughts?

I tend to agree to this.  I also see "log record(s)" (without prefixed
by "write-ahead") in many places especially in the documentation.  I'm
not sure how we should treat "WAL log", though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2022-02-28 Thread Kyotaro Horiguchi
Rebased on a recent xlog refactoring.

No functional changes have been made.

- Removed the default case in smgr_desc since it seems to me we don't
 assume out-of-definition values in xlog records elsewhere.

- Simplified some added to storage.c.

- Fix copy-pasto'ed comments in extractPageInfo().

- The previous version smgrDoPendingCleanups() assumes that init-fork
  are not loaded onto shared buffer but it is wrong
  (SetRelationBuffersPersistence assumes the opposite.).  Thus we need
  to drop buffers before unlink an init fork. But it is already
  guaranteed by logic so I rewrote the comment for for PCOP_UNLINK_FORK.

  > * Unlink the fork file. Currently we use this only for
  > * init forks and we're sure that the init fork is not
  > * loaded on shared buffers.  For RelationDropInitFork
  > * case, the function dropped that buffers. For
  > * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
  > * is set and the buffers have been dropped just before.
  
  This logic has the same critical window as
  DropRelFilenodeBuffers. That is, if file deletion fails after
  successful buffer dropping, theoretically the file content of the
  init fork may be stale. However, AFAICS init-fork is write-once fork
  so I don't think that actually matters.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 420a9d9a0dae3bcfb1396c14997624ad67a3e557 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v18 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  49 ++
 src/backend/access/transam/README |   9 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 +
 src/backend/catalog/storage.c | 548 +-
 src/backend/commands/tablecmds.c  | 266 +--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  86 
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 344 ++
 src/backend/storage/smgr/md.c |  94 +++-
 src/backend/storage/smgr/smgr.c   |  32 ++
 src/backend/storage/sync/sync.c   |  20 +-
 src/bin/pg_rewind/parsexlog.c |  22 +
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 23 files changed, 1459 insertions(+), 182 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..f8908e2c0a 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			i

Re: Allow async standbys wait for sync replication

2022-02-28 Thread Kyotaro Horiguchi
(Now I understand what "async" mean here..)

At Mon, 28 Feb 2022 22:05:28 -0800, Nathan Bossart  
wrote in 
> On Tue, Mar 01, 2022 at 11:10:09AM +0530, Bharath Rupireddy wrote:
> > On Tue, Mar 1, 2022 at 12:27 AM Nathan Bossart  
> > wrote:
> >> My feedback is specifically about this behavior.  I don't think we should
> >> spin in XLogSend*() waiting for an LSN to be synchronously replicated.  I
> >> think we should just choose the SendRqstPtr based on what is currently
> >> synchronously replicated.
> > 
> > Do you mean something like the following?
> > 
> > /* Main loop of walsender process that streams the WAL over Copy messages. 
> > */
> > static void
> > WalSndLoop(WalSndSendDataCallback send_data)
> > {
> > /*
> >  * Loop until we reach the end of this timeline or the client requests 
> > to
> >  * stop streaming.
> >  */
> > for (;;)
> > {
> > if (am_async_walsender && there_are_sync_standbys)
> > {
> >  XLogRecPtr SendRqstLSN;
> >  XLogRecPtr SyncFlushLSN;
> > 
> > SendRqstLSN = GetFlushRecPtr(NULL);
> > LWLockAcquire(SyncRepLock, LW_SHARED);
> > SyncFlushLSN = walsndctl->lsn[SYNC_REP_WAIT_FLUSH];
> > LWLockRelease(SyncRepLock);
> > 
> > if (SendRqstLSN > SyncFlushLSN)
> >continue;
> > }

The current trend is energy-savings. We never add a "wait for some
fixed time then exit if the condition makes, otherwise repeat" loop
for this kind of purpose where there's no guarantee that the loop
exits quite shortly.  Concretely we ought to rely on condition
variables to do that.

> Not quite.  Instead of "continue", I would set SendRqstLSN to SyncFlushLSN
> so that the WAL sender only sends up to the current synchronously

I'm not sure, but doesn't that makes walsender falsely believes it
have caught up to the bleeding edge of WAL?

> replicated LSN.  TBH there are probably other things that need to be
> considered (e.g., how do we ensure that the WAL sender sends the rest once
> it is replicated?), but I still think we should avoid spinning in the WAL
> sender waiting for WAL to be replicated.

It seems to me it would be something similar to
SyncRepReleaseWaiters().  Or it could be possible to consolidate this
feature into the function, I'm not sure, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2022-03-01 Thread Kyotaro Horiguchi
At Tue, 01 Mar 2022 14:14:13 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> - Removed the default case in smgr_desc since it seems to me we don't
>  assume out-of-definition values in xlog records elsewhere.

Stupid.  The complier on the CI environemnt complains for
uninitialized variable even though it (presumably) knows that the all
paths of the switch statement set the variable.  Added default value
to try to silence compiler.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ec75c49ffd939f6db8e0d840ef043c18845d1b9d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v19 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  49 ++
 src/backend/access/transam/README |   9 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 +
 src/backend/catalog/storage.c | 548 +-
 src/backend/commands/tablecmds.c  | 266 +--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  86 
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 344 ++
 src/backend/storage/smgr/md.c |  94 +++-
 src/backend/storage/smgr/smgr.c   |  32 ++
 src/backend/storage/sync/sync.c   |  20 +-
 src/bin/pg_rewind/parsexlog.c |  22 +
 src/bin/pg_rewind/pg_rewind.c |   1 -
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 24 files changed, 1459 insertions(+), 183 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..225ffbafef 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..2ecd8c8c7c 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -725,6 +725,15 @@ then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
 
+The Smgr MARK files
+
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time.  In
+contrast to the four

Re: Make mesage at end-of-recovery less scary.

2022-03-01 Thread Kyotaro Horiguchi
At Sat, 19 Feb 2022 09:31:33 +0530, Ashutosh Sharma  
wrote in 
> The changes looks good. thanks.!

Thanks!

Some recent core change changed WAL insertion speed during the TAP
test and revealed one forgotton case of EndOfWAL.  When a record
header flows into the next page, XLogReadRecord does separate check
from ValidXLogRecordHeader by itself.

>* If the whole record header is on this page, validate it immediately.
>* Otherwise do just a basic sanity check on xl_tot_len, and validate 
> the
>* rest of the header after reading it from the next page.  The 
> xl_tot_len
>* check is necessary here to ensure that we enter the "Need to 
> reassemble
>* record" code path below; otherwise we might fail to apply
>* ValidXLogRecordHeader at all.
>*/
>   if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
>   {
...
>}
>   else
>   {
>   /* XXX: more validation should be done here */
>   if (total_len < SizeOfXLogRecord)
>   {

I could simplly copy-in a part of ValidXLogRecordHeader there but that
results in rather large duplicate code. I could have
ValidXLogRecordHeader handle the partial header case but it seems to
me complex.

So in this version I split the xl_tot_len part of
ValidXLogRecordHeader into ValidXLogRecordLength.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 01cce076d2b3ad536398cc2b716ef64ed9b2c409 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v15] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 125 ++
 src/backend/access/transam/xlogrecovery.c |  92 
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 ++-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++
 6 files changed, 297 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..ba1c1ece87 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -42,6 +42,8 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 			 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -121,6 +123,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +295,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -380,12 +384,11 @@ restart:
 	 * whole header.
 	 */
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
 
 	/*
 	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Otherwise do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
@@ -399,18 +402,13 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-  "invalid record length at %X/%X: wanted %u, got %u",
-  LSN_FORMAT_ARGS(RecPtr),
-  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record

Re: more descriptive message for process termination due to max_slot_wal_keep_size

2022-03-01 Thread Kyotaro Horiguchi
At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> So what do you say if I propose the following?
> 
> LOG:  terminating process %d to release replication slot \"%s\"
> because its restart_lsn %X/%X exceeds the limit %X/%X
> HINT: You might need to increase max_slot_wal_keep_size.

This version emits the following message.

[35785:checkpointer] LOG:  terminating process 36368 to release replication 
slot "s1" because its restart_lsn 0/1F000148 exceeds the limit 0/2100
[35785:checkpointer] HINT:  You might need to increase max_slot_wal_keep_size.
[36368:walsender] FATAL:  terminating connection due to administrator command
[36368:walsender] STATEMENT:  START_REPLICATION SLOT "s1" 0/1F00 TIMELINE 1
[35785:checkpointer] LOG:  invalidating slot "s1" because its restart_lsn 
0/1F000148 exceeds the limit 0/2100
[35785:checkpointer] HINT:  You might need to increase max_slot_wal_keep_size.

We can omit the HINT line from the termination log for non-persistent
slots but I think we don't want to bother that considering its low
frequency.

The CI was confused by the mixed patches for multiple PG versions. In
this version the patchset for master are attached as .patch and that
for PG13 as .txt.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4909db2893f0b33ab6bca1ffc3ad802cd159c577 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 24 Dec 2021 12:52:07 +0900
Subject: [PATCH v4 1/2] Make a message on process termination more dscriptive

The message at process termination due to slot limit doesn't provide
the reason. In the major scenario the message is followed by another
message about slot invalidatation, which shows the detail for the
termination.  However the second message is missing if the slot is
temporary one.

Augment the first message with the reason same as the second message.

Backpatch through to 13 where the message was introduced.

Reported-by: Alex Enachioaie 
Author: Kyotaro Horiguchi 
Reviewed-by: Ashutosh Bapat 
Reviewed-by: Bharath Rupireddy 
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
Backpatch-through: 13
---
 src/backend/replication/slot.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index caa6b29756..92f19bcb35 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1300,8 +1300,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			if (last_signaled_pid != active_pid)
 			{
 ereport(LOG,
-		(errmsg("terminating process %d to release replication slot \"%s\"",
-active_pid, NameStr(slotname;
+		(errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size",
+active_pid, NameStr(slotname),
+LSN_FORMAT_ARGS(restart_lsn;
 
 (void) kill(active_pid, SIGTERM);
 last_signaled_pid = active_pid;
-- 
2.27.0

>From f31014f85c7dec160bd42e1c48f1c28a7dc22af2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 24 Dec 2021 12:58:25 +0900
Subject: [PATCH v4 2/2] Add detailed information to slot-invalidation messages

The messages says just "exceeds max_slot_wal_keep_size" but doesn't
tell the concrete LSN of the limit. That information helps operators'
understanding on the issue.

Author: Kyotaro Horiguchi 
Reviewed-by: Ashutosh Bapat 
Reviewed-by: Masahiko Sawada 
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
---
 src/backend/replication/slot.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 92f19bcb35..be21b62add 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1300,9 +1300,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			if (last_signaled_pid != active_pid)
 			{
 ereport(LOG,
-		(errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size",
+		(errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X",
 active_pid, NameStr(slotname),
-LSN_FORMAT_ARGS(restart_lsn;
+LSN_FORMAT_ARGS(restart_lsn),
+LSN_FORMAT_ARGS(oldestLSN)),
+		 errhint("You might need to increase max_slot_wal_keep_size.")));
 
 (void) kill(active_pid, SIGTERM);
 last_signaled_pid = active_pid;
@@ -1345,9 +1347,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotRelease();
 
 			ereport(LOG,
-	(errmsg("invalidating slot \&qu

Re: Add the replication origin name and commit-LSN to logical replication worker errcontext

2022-03-01 Thread Kyotaro Horiguchi
At Wed, 2 Mar 2022 14:39:54 +0900, Masahiko Sawada  
wrote in 
> On Wed, Mar 2, 2022 at 11:55 AM Amit Kapila  wrote:
> >
> > On Mon, Feb 28, 2022 at 5:46 PM Masahiko Sawada  
> > wrote:
> > >
> > > I've attached two patches: the first one changes
> > > apply_error_callback() so that it uses complete sentences with if-else
> > > blocks in order to have a translation work,
> > >
> >
> > This is an improvement over what we have now but I think this is still
> > not completely correct as per message translation rules:
> > + else
> > + errcontext("processing remote data during \"%s\" in transaction %u at %s",
> > +logicalrep_message_type(errarg->command),
> > +errarg->remote_xid,
> > +(errarg->ts != 0) ? timestamptz_to_str(errarg->ts) : "(not-set)");
> >
> > As per guidelines [1][2], we don't prefer to construct messages at
> > run-time aka we can do better for the following part: +(errarg->ts
> > != 0) ? timestamptz_to_str(errarg->ts) : "(not-set)". I think we need
> > to use if-else here to split it further. If you agree, then the same
> > needs to be dealt with in other parts of the patch as well.
> 
> I intended to use "(not-set)" as a value rather than a word in the
> sentence so I think it doesn't violate the guidelines. We can split it
> further as you suggested but we will end up having more if-else
> branches.

It seems to me exactly our way.  In the first place I doubt
"(not-set)" fits the place for timestamp even in English.

Moreover, if we (I?) translated the message into Japanese, it would
look like;

CONTEXT: (not-set)にトランザクション 2352314 内で "TRUNCATE" でのリモートデータの処理中

I don't think that looks fine.  Translating "(not-set)" makes things
even worse.

CONTEXT: (非設定)にトランザクション 2352314 内で "TRUNCATE" でのリモートデータの処理中

Yes, I can alleviate that strangeness a bit by modulating it, but it
still looks odd.

CONTEXT: 時刻(非設定)、トランザクション 2352314 内で "TRUNCATE" でのリモートデータの処理中

Rather, I'd prefer simply to list the attributes.

CONTEXT: processing remote data during "MESSAGE". Transaction (unknown). Time 
(unknown), replication target relation (unknown), column (unknown)
CONTEXT: "MESSAGE"でのリモートデータの処理中. トランザクション (不明). 時刻 (不明), レプリケーション対象リレーション (不明), 
column (不明)


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-03-01 Thread Kyotaro Horiguchi
At Thu, 20 Jan 2022 17:19:04 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 20 Jan 2022 15:07:22 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> CI now likes this version for all platforms.

An xlog.c refactoring happend recently hit this.
Just rebased on the change.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 35958b17c62cd14f81efa26a097c32c273028f77 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Nov 2021 20:42:00 +0900
Subject: [PATCH v16 1/3] Add tablespace support to TAP framework

TAP framework doesn't support nodes that have tablespaces.  Especially
backup and initialization from backups failed if the source node has
tablespaces.  This commit provides simple way to create tablespace
directories and allows backup routines to handle tablespaces.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 264 ++-
 src/test/perl/PostgreSQL/Test/Utils.pm   |  43 
 2 files changed, 305 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index be05845248..15d57b9a71 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -298,6 +298,64 @@ sub archive_dir
 
 =pod
 
+=item $node->tablespace_storage([, nocreate])
+
+Diretory to store tablespace directories.
+If nocreate is true, returns undef if not yet created.
+
+=cut
+
+sub tablespace_storage
+{
+	my ($self, $nocreate) = @_;
+
+	if (!defined $self->{_tsproot})
+	{
+		# tablespace is not used, return undef if nocreate is specified.
+		return undef if ($nocreate);
+
+		# create and remember the tablespae root directotry.
+		$self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short();
+	}
+
+	return $self->{_tsproot};
+}
+
+=pod
+
+=item $node->tablespaces()
+
+Returns a hash from tablespace OID to tablespace directory name.  For
+example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub tablespaces
+{
+	my ($self) = @_;
+	my $pg_tblspc = $self->data_dir . '/pg_tblspc';
+	my %ret;
+
+	# return undef if no tablespace is used
+	return undef if (!defined $self->tablespace_storage(1));
+
+	# collect tablespace entries in pg_tblspc directory
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->backup_dir()
 
 The output path for backups taken with $node->backup()
@@ -313,6 +371,77 @@ sub backup_dir
 
 =pod
 
+=item $node->backup_tablespace_storage_path(backup_name)
+
+Returns tablespace location path for backup_name.
+Retuns the parent directory if backup_name is not given.
+
+=cut
+
+sub backup_tablespace_storage_path
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_dir . '/__tsps';
+
+	$dir .= "/$backup_name" if (defined $backup_name);
+
+	return $dir;
+}
+
+=pod
+
+=item $node->backup_create_tablespace_storage(backup_name)
+
+Create tablespace location directory for backup_name if not yet.
+Create the parent tablespace storage that holds all location
+directories if backup_name is not supplied.
+
+=cut
+
+sub backup_create_tablespace_storage
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_tablespace_storage_path($backup_name);
+
+	File::Path::make_path $dir if (! -d $dir);
+}
+
+=pod
+
+=item $node->backup_tablespaces(backup_name)
+
+Returns a reference to hash from tablespace OID to tablespace
+directory name of tablespace directory that the specified backup has.
+For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub backup_tablespaces
+{
+	my ($self, $backup_name) = @_;
+	my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc';
+	my %ret;
+
+	#return undef if this backup holds no tablespaces
+	return undef if (! -d $self->backup_tablespace_storage_path($backup_name));
+
+	# scan pg_tblspc directory of the backup
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return \%ret;
+}
+
+=pod
+
 =item $node->install_path()
 
 The configured install path (if any) for the node.
@@ -370,6 +499,7 @@ sub info
 	print $fh "Data directory: " . $self->data_dir . "\n";
 	print $fh "Backup directory: " . $self->backup_dir . "\n";
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
+	print $fh "Tablespace directory:

Re: pg_stop_backup() v2 incorrectly marked as proretset

2022-03-02 Thread Kyotaro Horiguchi
At Wed, 2 Mar 2022 16:46:01 +0900, Michael Paquier  wrote 
in 
> Hi all,
> 
> In my hunt looking for incorrect SRFs, I have noticed a new case of a
> system function marked as proretset while it builds and returns only
> one record.  And this is a popular one: pg_stop_backup(), labelled
> v2.
> 
> This leads to a lot of unnecessary work, as the function creates a
> tuplestore it has no need for with the usual set of checks related to
> SRFs.  The logic can be be simplified as of the attached.
> 
> Thoughts?

That direction seems find to me.

But the patch forgets to remove an useless variable.

>   /* Initialise attributes information in the tuple descriptor */
>   tupdesc = CreateTemplateTupleDesc(PG_STOP_BACKUP_V2_COLS);
>   TupleDescInitEntry(tupdesc, (AttrNumber) 1, "lsn",
>  PG_LSNOID, -1, 0);

I think we can use get_call_resuilt_type here.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Two noncritical bugs of pg_waldump

2022-03-02 Thread Kyotaro Horiguchi
At Fri, 25 Feb 2022 10:48:47 -0800, Andres Freund  wrote in 
> Hi,
> 
> On 2022-02-14 18:18:47 +0900, Kyotaro Horiguchi wrote:
> > > If I give an empty file to the tool it complains as the follows.
> > > 
> > > > pg_waldump: fatal: could not read file "hoge": No such file or directory
> > > 
> > > No, the file exists.  The cause is it reads uninitialized errno to
> > > detect errors from the system call.  read(2) is defined to set errno
> > > always when it returns -1 and doesn't otherwise. Thus it seems to me
> > > that it is better to check that the return value is less than zero
> > > than to clear errno before the call to read().
> > 
> > So I post a patch contains only the indisputable part.
> 
> Thanks for the report and fix. Pushed. This was surprisingly painful, all but
> one branch had conflicts...

Ah, I didn't expect that this is committed so quickly.  I should have
created patches for all versions.  Anyway thanks for committing this!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: more descriptive message for process termination due to max_slot_wal_keep_size

2022-03-02 Thread Kyotaro Horiguchi
At Wed, 02 Mar 2022 15:37:19 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> The CI was confused by the mixed patches for multiple PG versions. In
> this version the patchset for master are attached as .patch and that
> for PG13 as .txt.

Yeah It is of course the relevant check should be fixed.  The
attached v5 adjusts 019_replslot_limit.pl.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4909db2893f0b33ab6bca1ffc3ad802cd159c577 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 24 Dec 2021 12:52:07 +0900
Subject: [PATCH v5 1/2] Make a message on process termination more dscriptive

The message at process termination due to slot limit doesn't provide
the reason. In the major scenario the message is followed by another
message about slot invalidatation, which shows the detail for the
termination.  However the second message is missing if the slot is
temporary one.

Augment the first message with the reason same as the second message.

Backpatch through to 13 where the message was introduced.

Reported-by: Alex Enachioaie 
Author: Kyotaro Horiguchi 
Reviewed-by: Ashutosh Bapat 
Reviewed-by: Bharath Rupireddy 
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
Backpatch-through: 13
---
 src/backend/replication/slot.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index caa6b29756..92f19bcb35 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1300,8 +1300,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			if (last_signaled_pid != active_pid)
 			{
 ereport(LOG,
-		(errmsg("terminating process %d to release replication slot \"%s\"",
-active_pid, NameStr(slotname;
+		(errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size",
+active_pid, NameStr(slotname),
+LSN_FORMAT_ARGS(restart_lsn;
 
 (void) kill(active_pid, SIGTERM);
 last_signaled_pid = active_pid;
-- 
2.27.0

>From 7a9e571e8cf4e5ffc13d101733c1b6fc455b1aec Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 24 Dec 2021 12:58:25 +0900
Subject: [PATCH v5 2/2] Add detailed information to slot-invalidation messages

The messages says just "exceeds max_slot_wal_keep_size" but doesn't
tell the concrete LSN of the limit. That information helps operators'
understanding on the issue.

Author: Kyotaro Horiguchi 
Reviewed-by: Ashutosh Bapat 
Reviewed-by: Masahiko Sawada 
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
---
 src/backend/replication/slot.c| 12 
 src/test/recovery/t/019_replslot_limit.pl |  2 +-
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 92f19bcb35..be21b62add 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1300,9 +1300,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			if (last_signaled_pid != active_pid)
 			{
 ereport(LOG,
-		(errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size",
+		(errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X",
 active_pid, NameStr(slotname),
-LSN_FORMAT_ARGS(restart_lsn;
+LSN_FORMAT_ARGS(restart_lsn),
+LSN_FORMAT_ARGS(oldestLSN)),
+		 errhint("You might need to increase max_slot_wal_keep_size.")));
 
 (void) kill(active_pid, SIGTERM);
 last_signaled_pid = active_pid;
@@ -1345,9 +1347,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 			ReplicationSlotRelease();
 
 			ereport(LOG,
-	(errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size",
+	(errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X",
 			NameStr(slotname),
-			LSN_FORMAT_ARGS(restart_lsn;
+			LSN_FORMAT_ARGS(restart_lsn),
+			LSN_FORMAT_ARGS(oldestLSN)),
+	 errhint("You might need to increase max_slot_wal_keep_size.")));
 
 			/* done with this slot for now */
 			break;
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 9bb71b62c0..bac496a9cf 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -188,7 +188,7 @@ for (my $i = 0; $i < 1; $i++)
 {
 	if (find_in_log(
 			$node_primary,
-			"invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds max_slot_wal_keep_size",
+			"invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds the limit [0-9A-F/]+",
 			$logstart))
 	{
 		$invalidated = 1;
-- 
2.27.0



Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-03-02 Thread Kyotaro Horiguchi
At Wed, 02 Mar 2022 16:59:09 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 20 Jan 2022 17:19:04 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Thu, 20 Jan 2022 15:07:22 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > CI now likes this version for all platforms.
> 
> An xlog.c refactoring happend recently hit this.
> Just rebased on the change.

A function added to Util.pm used perl2host, which has been removed
recently.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From bb714659adcde5265974c46b061966e5dfc556be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Nov 2021 20:42:00 +0900
Subject: [PATCH v17 1/3] Add tablespace support to TAP framework

TAP framework doesn't support nodes that have tablespaces.  Especially
backup and initialization from backups failed if the source node has
tablespaces.  This commit provides simple way to create tablespace
directories and allows backup routines to handle tablespaces.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 264 ++-
 src/test/perl/PostgreSQL/Test/Utils.pm   |  42 
 2 files changed, 304 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index be05845248..15d57b9a71 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -298,6 +298,64 @@ sub archive_dir
 
 =pod
 
+=item $node->tablespace_storage([, nocreate])
+
+Diretory to store tablespace directories.
+If nocreate is true, returns undef if not yet created.
+
+=cut
+
+sub tablespace_storage
+{
+	my ($self, $nocreate) = @_;
+
+	if (!defined $self->{_tsproot})
+	{
+		# tablespace is not used, return undef if nocreate is specified.
+		return undef if ($nocreate);
+
+		# create and remember the tablespae root directotry.
+		$self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short();
+	}
+
+	return $self->{_tsproot};
+}
+
+=pod
+
+=item $node->tablespaces()
+
+Returns a hash from tablespace OID to tablespace directory name.  For
+example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub tablespaces
+{
+	my ($self) = @_;
+	my $pg_tblspc = $self->data_dir . '/pg_tblspc';
+	my %ret;
+
+	# return undef if no tablespace is used
+	return undef if (!defined $self->tablespace_storage(1));
+
+	# collect tablespace entries in pg_tblspc directory
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->backup_dir()
 
 The output path for backups taken with $node->backup()
@@ -313,6 +371,77 @@ sub backup_dir
 
 =pod
 
+=item $node->backup_tablespace_storage_path(backup_name)
+
+Returns tablespace location path for backup_name.
+Retuns the parent directory if backup_name is not given.
+
+=cut
+
+sub backup_tablespace_storage_path
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_dir . '/__tsps';
+
+	$dir .= "/$backup_name" if (defined $backup_name);
+
+	return $dir;
+}
+
+=pod
+
+=item $node->backup_create_tablespace_storage(backup_name)
+
+Create tablespace location directory for backup_name if not yet.
+Create the parent tablespace storage that holds all location
+directories if backup_name is not supplied.
+
+=cut
+
+sub backup_create_tablespace_storage
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_tablespace_storage_path($backup_name);
+
+	File::Path::make_path $dir if (! -d $dir);
+}
+
+=pod
+
+=item $node->backup_tablespaces(backup_name)
+
+Returns a reference to hash from tablespace OID to tablespace
+directory name of tablespace directory that the specified backup has.
+For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub backup_tablespaces
+{
+	my ($self, $backup_name) = @_;
+	my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc';
+	my %ret;
+
+	#return undef if this backup holds no tablespaces
+	return undef if (! -d $self->backup_tablespace_storage_path($backup_name));
+
+	# scan pg_tblspc directory of the backup
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return \%ret;
+}
+
+=pod
+
 =item $node->install_path()
 
 The configured install path (if any) for the node.
@@ -370,6 +499,7 @@ sub info
 	print $fh "Data directory: " . $self->data_dir . "\n";
 	print $fh &qu

Re: Add CHECKPOINT_REQUESTED flag to the log message in LogCheckpointStart()

2022-03-02 Thread Kyotaro Horiguchi
At Wed, 2 Mar 2022 18:18:10 +0530, Bharath Rupireddy 
 wrote in 
> On Wed, Mar 2, 2022 at 5:41 PM Nitin Jadhav
>  wrote:
> >
> > Hi,
> >
> > I have noticed that the CHECKPOINT_REQUESTED flag information is not
> > present in the log message of LogCheckpointStart() function. I would
> > like to understand if it was missed or left intentionally. The log
> > message describes all the possible checkpoint flags except
> > CHECKPOINT_REQUESTED flag. I feel we should support this. Thoughts?
> 
> I don't think that's useful. Being in LogCheckpointStart
> (CreateCheckPoint or CreateRestartPoint) itself means that somebody
> has requested a checkpoint. Having CHECKPOINT_REQUESTED doesn't add
> any value.

Agreed.

> I would suggest removing the CHECKPOINT_REQUESTED flag as it's not
> being used anywhere instead CheckpointerShmem->ckpt_flags is used as
> an indication of the checkpoint requested in CheckpointerMain [1]. If

Actually no one does but RequestCheckpoint() accepts 0 as flags.
Checkpointer would be a bit more complex without CHECKPOINT_REQUESTED.
I don't think it does us any good to get rid of the flag value.

> others don't agree to remove as it doesn't cause any harm, then, I
> would  add something like this for more readability:

 if (((volatile CheckpointerShmemStruct *)
-  CheckpointerShmem)->ckpt_flags)
+  CheckpointerShmem)->ckpt_flags) & CHECKPOINT_REQUESTED))

I don't particularly object to this, but I don't think that change
makes the code significantly easier to read either.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Changing "Hot Standby" to "hot standby"

2022-03-02 Thread Kyotaro Horiguchi
At Wed, 2 Mar 2022 15:22:44 +, "Daniel Westermann (DWE)" 
 wrote in 
> > Pretty sure that for titles we should keep English capitalization rules.
> 
> Done like that. Thanks for taking a look.


-Hot Standby feedback propagates upstream, whatever the cascaded 
arrangement.
+hot standby feedback propagates upstream, whatever the cascaded arrangement


-Hot Standby is the term used to describe the ability to connect to
+hot standby is the term used to describe the ability to connect to


They look like decapitalizing the first word in a sentsnce.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-02 Thread Kyotaro Horiguchi
ID_PARAMETER_VALUE),
>  errmsg("WAL record start LSN must be less than end LSN")));

I don't think that validations are worth doing at least for the first
two, as far as that value has a special meaning. I see it useful if
pg_get_wal_records_info() means dump the all available records for the
moment, or records of the last segment, page or something.

> > ==
> >
> >
> > +   if (loc <= read_upto)
> > +   break;
> > +
> > +   /* Let's not wait for WAL to be available if
> > indicated */
> > +   if (loc > read_upto &&
> > +   state->private_data != NULL)
> > +   {
> >
> > Why loc > read_upto? The first if condition is (loc <= read_upto)
> > followed by the second if condition - (loc > read_upto). Is the first
> > if condition (loc <= read_upto) not enough to indicate that loc >
> > read_upto?
> 
> Yeah, that's unnecessary, I improved the comment there and removed loc
> > read_upto.
> 
> > ==
> >
> > +#define IsEndOfWALReached(state) \
> > +   (state->private_data != NULL && \
> > +   (((ReadLocalXLOGPage2Private *)
> > xlogreader->private_data)->no_wait == true) && \
> > +   (((ReadLocalXLOGPage2Private *)
> > xlogreader->private_data)->reached_end_of_wal == true))
> >
> > I think we should either use state or xlogreader. First line says
> > state->private_data and second line xlogreader->private_data.
> 
> I've changed it to use state instead of xlogreader.
> 
> > ==
> >
> > +   (((ReadLocalXLOGPage2Private *)
> > xlogreader->private_data)->reached_end_of_wal == true))
> > +
> >
> > There is a new patch coming to make the end of WAL messages less
> > scary. It introduces the EOW flag in xlogreaderstate maybe we can use
> > that instead of introducing new flags in private area to represent the
> > end of WAL.
> 
> Yeah that would be great. But we never know which one gets committed
> first. Until then it's not good to have dependencies on two "on-going"
> patches. Later, we can change.
> 
> > ==
> >
> > +/*
> > + * XLogReaderRoutine->page_read callback for reading local xlog files
> > + *
> > + * This function is same as read_local_xlog_page except that it works in 
> > both
> > + * wait and no wait mode. The callers can specify about waiting in 
> > private_data
> > + * of XLogReaderState.
> > + */
> > +int
> > +read_local_xlog_page_2(XLogReaderState *state, XLogRecPtr targetPagePtr,
> > +  int reqLen, XLogRecPtr
> > targetRecPtr, char *cur_page)
> > +{
> > +   XLogRecPtr  read_upto,
> >
> > Do we really need this function? Can't we make use of an existing WAL
> > reader function - read_local_xlog_page()?
> 
> I clearly explained the reasons upthread [2]. Please let me know if
> you have more thoughts/doubts here, we can connect offlist.

*I* also think the function is not needed, as explained above.  Why do
we need that function while we know how far we can read WAL records
*before* calling the function?

> Attaching v6 patch set with above review comments addressed. Please
> review it further.
> 
> [1] 
> https://www.postgresql.org/message-id/CAE9k0P%3D9SReU_613TXytZmpwL3ZRpnC5zrf96UoNCATKpK-UxQ%40mail.gmail.com
> +PG_FUNCTION_INFO_V1(pg_get_raw_wal_record);
> +PG_FUNCTION_INFO_V1(pg_get_first_valid_wal_record_lsn);
> +PG_FUNCTION_INFO_V1(pg_verify_raw_wal_record);
> +PG_FUNCTION_INFO_V1(pg_get_wal_record_info);
> +PG_FUNCTION_INFO_V1(pg_get_wal_records_info);
> 
> I think we should allow all these functions to be executed in wait and
> *nowait* mode. If a user specifies nowait mode, the function should
> return if no WAL data is present, rather than waiting for new WAL data
> to become available, default behaviour could be anything you like.
> 
> [2] 
> https://www.postgresql.org/message-id/CALj2ACUtqWX95uAj2VNJED0PnixEeQ%3D0MEzpouLi%2Bzd_iTugRA%40mail.gmail.com
> I've added a new function read_local_xlog_page_2 (similar to
> read_local_xlog_page but works in wait and no wait mode) and the
> callers can specify whether to wait or not wait using private_data.
> Actually, I wanted to use the private_data structure of
> read_local_xlog_page but the logical decoding already has context as
> private_data, that is why I had to have a new function. I know it
> creates a bit of duplicate code, but its cleaner than using
> backend-local variables or additional flags in XLogReaderState or
> adding wait/no-wait boolean to page_read callback. Any other
> suggestions are welcome here.
> 
> With this, I'm able to have wait/no wait versions for any functions.
> But for now, I'm having wait/no wait for two functions
> (pg_get_wal_records_info and pg_get_wal_stats) for which it makes more
> sense.
> 
> Regards,
> Bharath Rupireddy.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Why do spgbuildempty(), btbuildempty(), and blbuildempty() use smgrwrite()?

2022-03-02 Thread Kyotaro Horiguchi
At Wed, 2 Mar 2022 20:07:14 -0500, Melanie Plageman  
wrote in 
> If you enable the CHECK_WRITE_VS_EXTEND-protected assert in mdwrite(),
> you'll trip it anytime you exercise btbuildempty(), blbuildempty(), or
> spgbuildempty().
> 
> In this case, it may not make any meaningful difference if smgrwrite()
> or smgrextend() is called (_mdfd_getseg() behavior won't differ even
> with the different flags, so really only the FileWrite() wait event will
> be different).
> However, it seems like it should still be changed to call smgrextend().
> Or, since they only write a few pages, why not use shared buffers like
> gistbuildempty() and brinbuildempty() do?
> 
> I've attached a patch to move these three into shared buffers.
> 
> And, speaking of spgbuildempty(), there is no test exercising it in
> check-world, so I've attached a patch to do so. I wasn't sure if it
> belonged in spgist.sql or create_index_spgist.sql (on name alone, seems
> like the latter but based on content, seems like the former).
> 
> - Melanie

I didn't dig into your specific isssue, but I'm mildly opposed to
moving them onto shared buffers.  They are cold images of init-fork,
which is actually no-use for active servers.  Rather I'd like to move
brinbuildempty out of shared buffers considering one of my proposing
patches..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add CHECKPOINT_REQUESTED flag to the log message in LogCheckpointStart()

2022-03-02 Thread Kyotaro Horiguchi
At Thu, 3 Mar 2022 10:27:10 +0900, Michael Paquier  wrote 
in 
> On Thu, Mar 03, 2022 at 09:39:37AM +0900, Kyotaro Horiguchi wrote:
> > At Wed, 2 Mar 2022 18:18:10 +0530, Bharath Rupireddy 
> >  wrote in 
> >> I don't think that's useful. Being in LogCheckpointStart
> >> (CreateCheckPoint or CreateRestartPoint) itself means that somebody
> >> has requested a checkpoint. Having CHECKPOINT_REQUESTED doesn't add
> >> any value.
> > 
> > Agreed.
> 
> Exactly my impression.  This would apply now to the WAL shutdown code
> paths, and I'd suspect that the callers of CreateCheckPoint() are not
> going to increase soon.  The point is: the logs already provide some
> contexts for any of those callers so I see no need for this additional
> information.
> 
> > Actually no one does but RequestCheckpoint() accepts 0 as flags.
> > Checkpointer would be a bit more complex without CHECKPOINT_REQUESTED.
> > I don't think it does us any good to get rid of the flag value.
> 
> I'd rather keep this code as-is.

I fail to identify the nuance of the phrase, so just for a
clarification.  In short, I think we should keep the exiting code
as-is.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Changing "Hot Standby" to "hot standby"

2022-03-02 Thread Kyotaro Horiguchi
At Thu, 3 Mar 2022 06:55:43 +, "Daniel Westermann (DWE)" 
 wrote in 
> Hi Kyotaro,
> 
> >>    
> >>-    Hot Standby is the term used to describe the ability to connect to
> >>+    hot standby is the term used to describe the ability to connect to
> 
> >They look like decapitalizing the first word in a sentsnce.
> 
> Thanks for having a look. Are you suggesting to change it like this?
> -Hot Standby is the term used to describe the ability to connect to
> +Hot standby is the term used to describe the ability to connect to

Yes.  Isn't it the right form of a sentence?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: shared-memory based stats collector

2022-03-03 Thread Kyotaro Horiguchi
At Wed, 2 Mar 2022 18:16:00 -0800, Andres Freund  wrote in 
> Hi,
> 
> On 2021-07-26 18:27:54 -0700, Andres Freund wrote:
> > I had intended to post a rebase by now. But while I did mostly finish
> > that (see [1]) I unfortunately encountered a new issue around
> > partitioned tables, see [2]. Currently I'm hoping for a few thoughts on
> > that thread about the best way to address the issues.
> 
> Now that 
> https://postgr.es/m/20220125063131.4cmvsxbz2tdg6g65%40alap3.anarazel.de
> is resolved, here's a rebased version. With a good bit of further cleanup.
> 
> One "big" thing that I'd like to figure out is a naming policy for the
> different types prefixed with PgStat. We have different groups of types:
> 
> - "pending statistics", that are accumulated but not yet submitted to the
>   shared stats system, like PgStat_TableStatus, PgStat_BackendFunctionEntry
>   etc
> - accumulated statistics like PgStat_StatDBEntry, PgStat_SLRUStats. About 
> half are
>   prefixed with PgStat_Stat, the other just with PgStat_
> - random other types like PgStat_Single_Reset_Type, ...
> 
> To me it's very confusing to have these all in an essentially undistinguishing
> namespace, particularly the top two items.

Profoundly agreed.  It was always a pain in the neck.

> I think we should at least do s/PgStat_Stat/PgStat_/. Perhaps we should use a
> distinct PgStatPending_* for the pending item? I can't quite come up with a
> good name for the "accumulated" ones.

How about naming "pending stats" as just "Stats" and the "acculumated
stats" as "counts" or "counters"?  "Counter" doesn't reflect the
characteristics so exactly but I think the discriminability of the two
is more significant.  Specifically;

- PgStat_TableStatus
+ PgStat_TableStats
- PgStat_BackendFunctionEntry
+ PgStat_FunctionStats

- PgStat_GlobalStats
+ PgStat_GlobalCounts
- PgStat_ArchiverStats
+ PgStat_ArchiverCounts
- PgStat_BgWriterStats
+ PgStat_BgWriterCounts

Moving to shared stats collector turns them into attributed by "Local"
and "Shared". (I don't consider the details at this stage.)

PgStatLocal_TableStats
PgStatLocal_FunctionStats
PgStatLocal_GlobalCounts
PgStatLocal_ArchiverCounts
PgStatLocal_BgWriterCounts

PgStatShared_TableStats
PgStatShared_FunctionStats
PgStatShared_GlobalCounts
PgStatShared_ArchiverCounts
PgStatShared_BgWriterCounts

PgStatLocal_GlobalCounts somewhat looks odd, but doesn't matter much, maybe.

> I'd like that get resolved first because I think that'd allow commiting the
> prepatory split and reordering patches.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-03-03 Thread Kyotaro Horiguchi
At Wed, 02 Mar 2022 19:31:24 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> A function added to Util.pm used perl2host, which has been removed
> recently.

And same function contained a maybe-should-have-been-removed line
which makes Windows build unhappy.

This should make all platforms in the CI happy.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ee17f0f4400ce484cdba80c84744ae47d68c6fa4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Nov 2021 20:42:00 +0900
Subject: [PATCH v18 1/3] Add tablespace support to TAP framework

TAP framework doesn't support nodes that have tablespaces.  Especially
backup and initialization from backups failed if the source node has
tablespaces.  This commit provides simple way to create tablespace
directories and allows backup routines to handle tablespaces.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 264 ++-
 src/test/perl/PostgreSQL/Test/Utils.pm   |  42 
 2 files changed, 304 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index be05845248..15d57b9a71 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -298,6 +298,64 @@ sub archive_dir
 
 =pod
 
+=item $node->tablespace_storage([, nocreate])
+
+Diretory to store tablespace directories.
+If nocreate is true, returns undef if not yet created.
+
+=cut
+
+sub tablespace_storage
+{
+	my ($self, $nocreate) = @_;
+
+	if (!defined $self->{_tsproot})
+	{
+		# tablespace is not used, return undef if nocreate is specified.
+		return undef if ($nocreate);
+
+		# create and remember the tablespae root directotry.
+		$self->{_tsproot} = PostgreSQL::Test::Utils::tempdir_short();
+	}
+
+	return $self->{_tsproot};
+}
+
+=pod
+
+=item $node->tablespaces()
+
+Returns a hash from tablespace OID to tablespace directory name.  For
+example, an oid 16384 pointing to /tmp/jWAhkT_fs0/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub tablespaces
+{
+	my ($self) = @_;
+	my $pg_tblspc = $self->data_dir . '/pg_tblspc';
+	my %ret;
+
+	# return undef if no tablespace is used
+	return undef if (!defined $self->tablespace_storage(1));
+
+	# collect tablespace entries in pg_tblspc directory
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return %ret;
+}
+
+=pod
+
 =item $node->backup_dir()
 
 The output path for backups taken with $node->backup()
@@ -313,6 +371,77 @@ sub backup_dir
 
 =pod
 
+=item $node->backup_tablespace_storage_path(backup_name)
+
+Returns tablespace location path for backup_name.
+Retuns the parent directory if backup_name is not given.
+
+=cut
+
+sub backup_tablespace_storage_path
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_dir . '/__tsps';
+
+	$dir .= "/$backup_name" if (defined $backup_name);
+
+	return $dir;
+}
+
+=pod
+
+=item $node->backup_create_tablespace_storage(backup_name)
+
+Create tablespace location directory for backup_name if not yet.
+Create the parent tablespace storage that holds all location
+directories if backup_name is not supplied.
+
+=cut
+
+sub backup_create_tablespace_storage
+{
+	my ($self, $backup_name) = @_;
+	my $dir = $self->backup_tablespace_storage_path($backup_name);
+
+	File::Path::make_path $dir if (! -d $dir);
+}
+
+=pod
+
+=item $node->backup_tablespaces(backup_name)
+
+Returns a reference to hash from tablespace OID to tablespace
+directory name of tablespace directory that the specified backup has.
+For example, an oid 16384 pointing to ../tsps/backup1/ts1 is stored as
+$hash{16384} = "ts1".
+
+=cut
+
+sub backup_tablespaces
+{
+	my ($self, $backup_name) = @_;
+	my $pg_tblspc = $self->backup_dir . '/' . $backup_name . '/pg_tblspc';
+	my %ret;
+
+	#return undef if this backup holds no tablespaces
+	return undef if (! -d $self->backup_tablespace_storage_path($backup_name));
+
+	# scan pg_tblspc directory of the backup
+	opendir(my $dir, $pg_tblspc);
+	while (my $oid = readdir($dir))
+	{
+		next if ($oid !~ /^([0-9]+)$/);
+		my $linkpath = "$pg_tblspc/$oid";
+		my $tsppath = PostgreSQL::Test::Utils::dir_readlink($linkpath);
+		$ret{$oid} = File::Basename::basename($tsppath);
+	}
+	closedir($dir);
+
+	return \%ret;
+}
+
+=pod
+
 =item $node->install_path()
 
 The configured install path (if any) for the node.
@@ -370,6 +499,7 @@ sub info
 	print $fh "Data directory: " . $self->data_dir . "\n";
 	print $fh "Backup directory: " . $self->backup_dir . "\n";
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
+	print $

Re: Make mesage at end-of-recovery less scary.

2022-03-03 Thread Kyotaro Horiguchi
At Thu, 3 Mar 2022 15:39:44 +0530, Ashutosh Sharma  
wrote in 
> The new changes made in the patch look good. Thanks to the recent
> changes to speed WAL insertion that have helped us catch this bug.

Thanks for the quick checking.

> One small comment:
> 
> record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
> -   total_len = record->xl_tot_len;
> 
> Do you think we need to change the position of the comments written
> for above code that says:

Yeah, I didn't do that since it is about header verification.  But as
you pointed, the result still doesn't look perfect.

On second thought the two seems repeating the same things.  Thus I
merged the two comments together.  In this verion 16 it looks like
this.

>   /*
>* Validate the record header.
>*
>* Even though we use an XLogRecord pointer here, the whole record 
> header
>* might not fit on this page.  If the whole record header is on this 
> page,
>* validate it immediately.  Even otherwise xl_tot_len must be on this 
> page
>* (it is the first field of MAXALIGNed records), but we still cannot
>* access any further fields until we've verified that we got the whole
>* header, so do just a basic sanity check on record length, and 
> validate
>* the rest of the header after reading it from the next page.  The 
> length
>* check is necessary here to ensure that we enter the "Need to 
> reassemble
>* record" code path below; otherwise we might fail to apply
>* ValidXLogRecordHeader at all.
>*/
>   record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
>
>   if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 00d848df6bb8b9966dfbd39c98a388fda42a3e3c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v16] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 144 +-
 src/backend/access/transam/xlogrecovery.c |  92 ++
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 +-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 
 6 files changed, 305 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..bd0f211a23 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -42,6 +42,8 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 			 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -121,6 +123,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +295,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -371,25 +375,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header 

Re: pg_stop_backup() v2 incorrectly marked as proretset

2022-03-03 Thread Kyotaro Horiguchi
At Fri, 4 Mar 2022 10:09:19 +0900, Michael Paquier  wrote 
in 
> On Thu, Mar 03, 2022 at 04:40:42PM -0500, Tom Lane wrote:
> > The point is to make it clear that the macro isn't intended to affect
> > code outside the function.  Since C lacks block-scoped macros,
> > there's no other way to do that.
> > 
> > I concede that a lot of our code is pretty sloppy about this, but
> > that doesn't make it a good practice.
> 
> Well, if we change that, better to do that in all the places where
> this would be affected, but I am not sure to see a style appealing
> enough on this thread.
> 
> From what I can see, history shows that the style of using a #define
> for the number of columns originates from da2c1b8, aka 9.0.  Its use
> inside a function originates from a755ea3 as of 9.1 and then it has
> just spread around without any undefs, so it looks like people like
> that way of doing things.

I'm one of them.  Not unliking #undef, though.

It seems to me the name "PG_STOP_BACKUP_V2_COLS" alone is specific
enough for the purpose of avoiding misuse.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




false failure of test_docoding regression test

2022-03-03 Thread Kyotaro Horiguchi
Hello.

The CF-CI complained on one of my patch for seemingly a reason
unrelated to the patch.

https://cirrus-ci.com/task/5544213843542016?logs=test_world#L1666

> diff -U3 
> /tmp/cirrus-ci-build/contrib/test_decoding/expected/slot_creation_error.out 
> /tmp/cirrus-ci-build/contrib/test_decoding/output_iso/results/slot_creation_error.out
> --- 
> /tmp/cirrus-ci-build/contrib/test_decoding/expected/slot_creation_error.out   
> 2022-03-03 22:45:04.708072000 +
> +++ 
> /tmp/cirrus-ci-build/contrib/test_decoding/output_iso/results/slot_creation_error.out
>  2022-03-03 22:54:49.621351000 +
> @@ -96,13 +96,13 @@
>  t   
>  (1 row)
>  
> +step s1_c: COMMIT;
>  step s2_init: <... completed>
>  FATAL:  terminating connection due to administrator command
>  server closed the connection unexpectedly
>   This probably means the server terminated abnormally
>   before or while processing the request.
>  
> -step s1_c: COMMIT;
>  step s1_view_slot: 
>  SELECT slot_name, slot_type, active FROM pg_replication_slots WHERE 
> slot_name = 'slot_creation_error'

This comes from the permuattion 'permutation s1_b s1_xid s2_init
s1_terminate_s2 s1_c s1_view_slot'.  That means the process
termination by s1_terminate_s2 is a bit delayed until the next s1_c
ends.  So it is rare false failure but it is annoying enough on the
CI.  It seems to me we need to wait for process termination at the
time. postgres_fdw does that in regression test.

Thoughts?

Simliar use is found in temp-schema-cleanup. There's another possible
instability between s2_advisory and s2_check_schema but this change
alone reduces the chance for false failures.

The attached fixes the both points.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 82cdee92c3726a7f248849a310ec3f392ba02384 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Mar 2022 11:03:12 +0900
Subject: [PATCH] Wait for process termination during isolation tests

slot_creation_error.spec and temp-schema-cleanup.spec used
pg_terminate_backend() without specifying timeout, thus there may be a
case of proceeding to the next step before the process actually
terminates then false failure.

Supply timeout to pg_terminate_backend() so that it waits for process
termination to avoid that failure mode.
---
 contrib/test_decoding/expected/slot_creation_error.out | 2 +-
 contrib/test_decoding/specs/slot_creation_error.spec   | 2 +-
 src/test/isolation/expected/temp-schema-cleanup.out| 2 +-
 src/test/isolation/specs/temp-schema-cleanup.spec  | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/contrib/test_decoding/expected/slot_creation_error.out b/contrib/test_decoding/expected/slot_creation_error.out
index 043bdae0a2..3707482cb8 100644
--- a/contrib/test_decoding/expected/slot_creation_error.out
+++ b/contrib/test_decoding/expected/slot_creation_error.out
@@ -87,7 +87,7 @@ step s2_init:
 SELECT 'init' FROM pg_create_logical_replication_slot('slot_creation_error', 'test_decoding');
  
 step s1_terminate_s2: 
-SELECT pg_terminate_backend(pid)
+SELECT pg_terminate_backend(pid, 18)
 FROM pg_stat_activity
 WHERE application_name = 'isolation/slot_creation_error/s2';
 
diff --git a/contrib/test_decoding/specs/slot_creation_error.spec b/contrib/test_decoding/specs/slot_creation_error.spec
index 6816696b9d..32161b9e7f 100644
--- a/contrib/test_decoding/specs/slot_creation_error.spec
+++ b/contrib/test_decoding/specs/slot_creation_error.spec
@@ -13,7 +13,7 @@ step s1_cancel_s2 {
 }
 
 step s1_terminate_s2 {
-SELECT pg_terminate_backend(pid)
+SELECT pg_terminate_backend(pid, 18)
 FROM pg_stat_activity
 WHERE application_name = 'isolation/slot_creation_error/s2';
 }
diff --git a/src/test/isolation/expected/temp-schema-cleanup.out b/src/test/isolation/expected/temp-schema-cleanup.out
index 35b91d9e45..cb4302739a 100644
--- a/src/test/isolation/expected/temp-schema-cleanup.out
+++ b/src/test/isolation/expected/temp-schema-cleanup.out
@@ -83,7 +83,7 @@ exec
 (1 row)
 
 step s1_exit: 
-SELECT pg_terminate_backend(pg_backend_pid());
+SELECT pg_terminate_backend(pg_backend_pid(), 18);
 
 FATAL:  terminating connection due to administrator command
 server closed the connection unexpectedly
diff --git a/src/test/isolation/specs/temp-schema-cleanup.spec b/src/test/isolation/specs/temp-schema-cleanup.spec
index a9417b7e90..f0d3928996 100644
--- a/src/test/isolation/specs/temp-schema-cleanup.spec
+++ b/src/test/isolation/specs/temp-schema-cleanup.spec
@@ -47,7 +47,7 @@ step s1_discard_temp {
 }
 
 step s1_exit {
-SELECT pg_terminate_backend(pg_backend_pid());
+SELECT pg_terminate_backend(pg_backend_pid(), 18);
 }
 
 
-- 
2.27.0



Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-03 Thread Kyotaro Horiguchi
At Mon, 14 Feb 2022 14:52:15 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> In this version , 0001 gets one fix and two comment updates.

While the disucssion on back-patching of 0001 proceeding on another
branch, the main patch iself gets looks like as if rotten on
CF-App. So I rebased v10 on the current master.  0001 is replaced by
an adjusted patch based on the latest "control file update fix" patch.

If someone wants to voice on the message-fix patches (0002-0004), be
our guest.  0005 also wants opinions.


0001: Fix possible incorrect controlfile update that leads to
  unrecoverable database.

0002: Add REDO/Checkpiont LSNs to checkpoinkt-end log message.
  (The main patch in this thread)

0003: Replace (WAL-)location to LSN in pg_controldata.

0004: Replace (WAL-)location to LSN in user-facing texts.
  (This doesn't reflect my recent comments.)

0005: Unhyphenate the word archive-recovery and similars.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 0bf164c66ba2c27856c014ed1d452e5e07678541 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Mar 2022 13:18:30 +0900
Subject: [PATCH v11 1/5] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 72 ---
 1 file changed, 47 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..3987aa81de 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6899,6 +6899,9 @@ CreateRestartPoint(int flags)
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
 
+	/* we don't assume concurrent checkpoint/restartpoint to run */
+	Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
 	/* Get a local copy of the last safe checkpoint record. */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -6964,7 +6967,7 @@ CreateRestartPoint(int flags)
 
 	/* Also update the info_lck-protected copy */
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+	XLogCtl->RedoRecPtr = RedoRecPtr;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
@@ -6983,7 +6986,10 @@ CreateRestartPoint(int flags)
 	/* Update the process title */
 	update_checkpoint_display(flags, true, false);
 
-	CheckPointGuts(lastCheckPoint.redo, flags);
+	CheckPointGuts(RedoRecPtr, flags);
+
+	/* Update pg_control */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -6991,30 +6997,29 @@ CreateRestartPoint(int flags)
 	 */
 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+	Assert (PriorRedoPtr < RedoRecPtr);
+
+	ControlFile->checkPoint = lastCheckPointRecPtr;
+	ControlFile->checkPointCopy = lastCheckPoint;
+
+	/* Update control file using current time */
+	ControlFile->time = (pg_time_t) time(NULL);
+
 	/*
-	 * Update pg_control, using current time.  Check that it still shows
-	 * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-	 * this is a quick hack to make sure nothing really bad happens if somehow
-	 * we get here after the end-of-recovery checkpoint.
+	 * Ensure minRecoveryPoint is past the checkpoint record while archive
+	 * recovery is still ongoing.  Normally, this will have happened already
+	 * while writing out dirty buffers, but not necessarily - e.g. because no
+	 * buffers were dirtied.  We do this because a non-exclusive base backup
+	 * uses minRecoveryPoint to determine which WAL files must be included in
+	 * the backup, and the file (or files) containing the checkpoint record
+	 * must be included, at a minimum. Note that for an ordinary restart of
+	 * recovery there's no value in having the minimum recovery point any
+	 * earlier than this anyway, because redo will begin just after the
+	 * checkpoint record.  This is a quick hack to make sure nothing really bad
+	 * happens if somehow we get here after the end-of-recovery checkpoint.
 	 */
-	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-	if (ControlFile->state == 

Re: Doc about how to set max_wal_senders when setting minimal wal_level

2022-03-03 Thread Kyotaro Horiguchi
At Fri, 04 Mar 2022 12:18:29 +0800, Japin Li  wrote in 
> 
> On Thu, 03 Mar 2022 at 12:10, Japin Li  wrote:
>
> Attach v3 patch to fix missing close varname tag.

+A precondition for using minimal WAL is to disable WAL archiving and
+streaming replication by setting 
+to 0, and archive_mode
+to off.

It is a bit odd that the features to stop and the corresponding GUCs
are written irrespectively. It would be better they're in the same
order.

+servers.  If setting max_wal_senders to
+0 consider also reducing the amount of WAL produced
+by changing wal_level to minimal.

Those who anively follow this suggestion may bump into failure when
arhive_mode is on.  Thus archive_mode is also worth referred to.  But,
anyway, IMHO, it is mere a performance tips that is not necessarily
required in this section, or even in this documentaiotn.  Addtion to
that, if we write this for max_wal_senders, archive_mode will deserve
the similar tips but I think it is too verbose.  In short, I think I
would not add that description at all.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-03-03 Thread Kyotaro Horiguchi
Thanks to look this!

At Fri, 4 Mar 2022 13:51:12 +0900, Michael Paquier  wrote i
n 
> On Fri, Mar 04, 2022 at 09:10:48AM +0900, Kyotaro Horiguchi wrote:
> > And same function contained a maybe-should-have-been-removed line
> > which makes Windows build unhappy.
> > 
> > This should make all platforms in the CI happy.
> 
> d6d317d as solved the issue of tablespace paths across multiple nodes
> with the new GUC called allow_in_place_tablespaces, and is getting
> successfully used in the recovery tests as of 027_stream_regress.pl.

The feature allows only one tablespace directory. but that uses (I'm
not sure it needs, though) multiple tablespace directories so I think
the feature doesn't work for the test.

Maybe I'm missing something, but it doesn't use tablespaces.  I see
that in 002_tablespace.pl but but the test uses only one tablespace
location.

> Shouldn't we rely on that rather than extending more our test perl
> modules?  One tricky part is the emulation of readlink for junction
> points on Windows (dir_readlink in your patch), and the root of the

Yeah, I don't like that as I said before...

> problem is that 0003 cares about the path structure of the
> tablespaces so we have no need, as far as I can see, for any
> dependency with link follow-up in the scope of this patch.

I'm not sure how this related to 0001 but maybe I don't follow this.

> This means that you should be able to simplify the patch set, as we
> could entirely drop 0001 in favor of enforcing the new dev GUC in the
> nodes created in the TAP test of 0002.

Maybe it's possible by breaking the test into ones that need only one
tablespace.  I'll give it a try.

> Speaking of 0002, perhaps this had better be in its own file rather
> than extending more 011_crash_recovery.pl.  0003 looks like a good

Ok, no problem.

> idea to check after the consistency of the path structures created
> during replay, and it touches paths I'd expect it to touch, as of
> database and tbspace redos.
> 
> +   if (!reachedConsistency)
> +   XLogForgetMissingDir(xlrec->ts_id, InvalidOid);
> +
> +   XLogFlush(record->EndRecPtr);
> Not sure to understand why this is required.  A comment may be in
> order to explain the hows and the whys.

Is it about XLogFlush?  As my understanding it is to update
minRecoveryPoint to that LSN.  I'll add a comment like that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-03 Thread Kyotaro Horiguchi
At Fri, 4 Mar 2022 15:44:22 +0900, Michael Paquier  wrote 
in 
> Hi all,
> 
> While playing with tablespaces and recovery in a TAP test, I have
> noticed that retrieving the location of a tablespace created with
> allow_in_place_tablespaces enabled fails in pg_tablespace_location(),
> because readlink() sees a directory in this case.

ERROR:  could not read symbolic link "pg_tblspc/16407": Invalid argument

> The use may be limited to any automated testing and
> allow_in_place_tablespaces is a developer GUC, still it seems to me
> that there is an argument to allow the case rather than tweak any
> tests to hardcode a path with the tablespace OID.  And any other code
> paths are able to handle such tablespaces, be they in recovery or in
> tablespace create/drop.

+1

> A junction point is a directory on WIN32 as far as I recall, but
> pgreadlink() is here to ensure that we get the correct path on
> a source found as pgwin32_is_junction(), so we can rely on that.  This
> stuff has led me to the attached.
> 
> Thoughts?

The function I think is expected to return a absolute path but it
returns a relative path for in-place tablespaces.  While it is
apparently incovenient for general use, there might be a case where we
want to know whether the tablespace is in-place or not.  So I'm not
sure which is better..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-03 Thread Kyotaro Horiguchi
At Fri, 04 Mar 2022 16:41:03 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 4 Mar 2022 15:44:22 +0900, Michael Paquier  
> wrote in 
> > The use may be limited to any automated testing and
> > allow_in_place_tablespaces is a developer GUC, still it seems to me
> > that there is an argument to allow the case rather than tweak any
> > tests to hardcode a path with the tablespace OID.  And any other code
> > paths are able to handle such tablespaces, be they in recovery or in
> > tablespace create/drop.
> 
> +1

By the way, regardless of the patch, I got an error from pg_basebackup
for an in-place tablespace.  pg_do_start_backup calls readlink
believing pg_tblspc/* is always a symlink.

# Running: pg_basebackup -D 
/home/horiguti/work/worktrees/tsp_replay_2/src/test/recovery/tmp_check/t_029_replay_tsp_drops_primary1_data/backup/my_backup
 -h /tmp/X8E4nbF4en -p 51584 --checkpoint fast --no-sync
WARNING:  could not read symbolic link "pg_tblspc/16384": Invalid argument

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-04 Thread Kyotaro Horiguchi
At Fri, 04 Mar 2022 16:54:49 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 04 Mar 2022 16:41:03 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Fri, 4 Mar 2022 15:44:22 +0900, Michael Paquier  
> > wrote in 
> > > The use may be limited to any automated testing and
> > > allow_in_place_tablespaces is a developer GUC, still it seems to me
> > > that there is an argument to allow the case rather than tweak any
> > > tests to hardcode a path with the tablespace OID.  And any other code
> > > paths are able to handle such tablespaces, be they in recovery or in
> > > tablespace create/drop.
> > 
> > +1
> 
> By the way, regardless of the patch, I got an error from pg_basebackup
> for an in-place tablespace.  pg_do_start_backup calls readlink
> believing pg_tblspc/* is always a symlink.
> 
> # Running: pg_basebackup -D 
> /home/horiguti/work/worktrees/tsp_replay_2/src/test/recovery/tmp_check/t_029_replay_tsp_drops_primary1_data/backup/my_backup
>  -h /tmp/X8E4nbF4en -p 51584 --checkpoint fast --no-sync
> WARNING:  could not read symbolic link "pg_tblspc/16384": Invalid argument

So now we know that there are three places that needs the same
processing.

pg_tablespace_location: this patch tries to fix
sendDir:    it already supports  in-place tsp
do_pg_start_backup: not supports in-place tsp yet.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-04 Thread Kyotaro Horiguchi
At Fri, 04 Mar 2022 17:28:45 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> > By the way, regardless of the patch, I got an error from pg_basebackup
> > for an in-place tablespace.  pg_do_start_backup calls readlink
> > believing pg_tblspc/* is always a symlink.
> > 
> > # Running: pg_basebackup -D 
> > /home/horiguti/work/worktrees/tsp_replay_2/src/test/recovery/tmp_check/t_029_replay_tsp_drops_primary1_data/backup/my_backup
> >  -h /tmp/X8E4nbF4en -p 51584 --checkpoint fast --no-sync
> > WARNING:  could not read symbolic link "pg_tblspc/16384": Invalid argument
> 
> So now we know that there are three places that needs the same
> processing.
> 
> pg_tablespace_location: this patch tries to fix
> sendDir:it already supports  in-place tsp
> do_pg_start_backup: not supports in-place tsp yet.

And I made a quick hack on do_pg_start_backup.  And I found that
pg_basebackup copies in-place tablespaces under the *current
directory*, which is not ok at all:(

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-06 Thread Kyotaro Horiguchi
At Fri, 4 Mar 2022 23:26:43 +1300, Thomas Munro  wrote 
in 
> On Fri, Mar 4, 2022 at 10:04 PM Kyotaro Horiguchi
>  wrote:
> > And I made a quick hack on do_pg_start_backup.  And I found that
> > pg_basebackup copies in-place tablespaces under the *current
> > directory*, which is not ok at all:(
> 
> Hmm.  Which OS are you on?  Looks OK here -- the "in place" tablespace
> gets copied as a directory under pg_tblspc, no symlink:

> The warning from readlink() while making the mapping file isn't ideal,
> and perhaps we should suppress that with something like the attached.
> Or does the missing map file entry break something on Windows?

Ah.. Ok, somehow I thought that pg_basebackup failed for readlink
failure and the tweak I made made things worse.  I got to make it
work.

Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-03-07 Thread Kyotaro Horiguchi
So the new framework has been dropped in this version.
The second test is removed as it is irrelevant to this bug.

In this version the patch is a single file that contains the test.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 43bb3ba8900edd53a1feb0acb1a72bdc22bb1627 Mon Sep 17 00:00:00 2001
From: P 
Date: Mon, 7 Mar 2022 17:10:07 +0900
Subject: [PATCH v20] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

CREATE DATABASE
DROP DATABASE
DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Bug identified by Paul Guo.

Authored by Paul Guo, Kyotaro Horiguchi and Asim R P.
---
 src/backend/access/transam/xlogrecovery.c   |   6 +
 src/backend/access/transam/xlogutils.c  | 145 
 src/backend/commands/dbcommands.c   |  56 
 src/backend/commands/tablespace.c   |  16 +++
 src/include/access/xlogutils.h  |   4 +
 src/test/recovery/t/029_replay_tsp_drops.pl |  62 +
 6 files changed, 289 insertions(+)
 create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..97fed1e04d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2043,6 +2043,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 (errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20734..3f8f7dadac 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -79,6 +79,151 @@ typedef struct xl_invalid_page
 
 static HTAB *invalid_page_tab = NULL;
 
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * directory entry should be matched with a drop database or drop tablespace
+ * WAL record by the end of crash recovery.
+ */
+typedef struct xl_missing_dir_key
+{
+	Oid spcNode;
+	Oid dbNode;
+} xl_missing_dir_key;
+
+typedef struct xl_missing_dir
+{
+	xl_missing_dir_key key;
+	char path[MAXPGPATH];
+} xl_missing_dir;
+
+static HTAB *missing_dir_tab = NULL;
+
+void
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
+{
+	xl_missing_dir_key key;
+	bool found;
+	xl_missing_dir *entry;
+
+	/*
+	 * Database OID may be invalid but tablespace OID must be valid.  If
+	 * dbNode is InvalidOid, we are logging a missing tablespace directory,
+	 * otherwise we are logging a missing database directory.
+	 */
+	Assert(OidIsValid(spcNode));
+
+	if (missing_dir_tab == NULL)
+	{
+		/* create hash table when first needed */
+		HASHCTL		ctl;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(xl_missing_dir_key);
+		ctl.entrysize = sizeof(xl_missing_dir);
+
+		missing_dir_tab = hash_create("XLOG missing directory table",
+	   100,
+	   &ctl,
+	   HASH_ELEM | HASH_BLOBS);
+	}
+
+	key.spcNode = spcNode;
+	key.dbNode = dbNode;
+
+	entry = hash_search(missing_dir_tab, &key, HASH_ENTER, &found);
+
+	if (found)
+	{
+		if (dbNode == InvalidOid)
+			elog(DEBUG2, "missing directory %s (tablespace %d) already exists: %s",
+ path, spcNode, entry->path);
+		else
+			elog(DEBUG2, "missing directory %s (t

Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-07 Thread Kyotaro Horiguchi
At Tue, 8 Mar 2022 10:39:06 +1300, Thomas Munro  wrote 
in 
> On Tue, Mar 8, 2022 at 12:58 AM Michael Paquier  wrote:
> > On Fri, Mar 04, 2022 at 11:26:43PM +1300, Thomas Munro wrote:
> > > + /* Skip in-place tablespaces (testing use only) */
> > > + if (get_dirent_type(fullpath, de, false, ERROR) == PGFILETYPE_DIR)
> > > + continue;
> >
> > I saw the warning when testing base backups with in-place tablespaces
> > and it did not annoy me much, but, yes, that can be confusing.
> >
> > Junction points are directories, no?  Are you sure that this works
> > correctly on WIN32?  It seems to me that we'd better use readlink()
> > only for entries in pg_tlbspc/ that are PGFILETYPE_LNK on non-WIN32
> > and pgwin32_is_junction() on WIN32.
> 
> Thanks, you're right.  Test on a Win10 VM.  Here's a new version.

Thanks!  It works for me on CentOS8 and Windows11.

FYI, on Windows11, pg_basebackup didn't work correctly without the
patch.  So this looks like fixing an undiscovered bug as well.

===
> pg_basebackup -D copy
WARNING:  could not read symbolic link "pg_tblspc/16384": Invalid argument
pg_basebackup: error: tar member has empty name

> dir copy
 Volume in drive C has no label.
 Volume serial number: 10C6-4BA6

 Directory of c:\..\copy

2022/03/08  09:53  .
2022/03/08  09:53  ..
2022/03/08  09:53 0 nbase.tar
2022/03/08  09:53      pg_wal
   1 File(s) 0 bytes
   3 Dir(s)  171,920,613,376 bytes free

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-07 Thread Kyotaro Horiguchi
At Tue, 8 Mar 2022 10:28:46 +0900, Michael Paquier  wrote 
in 
> On Tue, Mar 08, 2022 at 10:06:50AM +0900, Kyotaro Horiguchi wrote:
> > At Tue, 8 Mar 2022 10:39:06 +1300, Thomas Munro  
> > wrote in 
> >> Thanks, you're right.  Test on a Win10 VM.  Here's a new version.
> 
> Looks fine to me.
> 
> > FYI, on Windows11, pg_basebackup didn't work correctly without the
> > patch.  So this looks like fixing an undiscovered bug as well.
> 
> Well, that's not really a long-time bug but just a side effect of
> in-place tablespaces because we don't use them in many test cases 
> yet, is it?

No, we don't. So just FYI.

> >> pg_basebackup -D copy
> > WARNING:  could not read symbolic link "pg_tblspc/16384": Invalid argument
> > pg_basebackup: error: tar member has empty name
> > 
> >1 File(s) 0 bytes
> >3 Dir(s)  171,920,613,376 bytes free
> 
> That's a lot of free space.

The laptop has a 512GB storage, so 160GB is pretty normal, maybe.
129GB of the storage is used by some VMs..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Naming of the different stats systems / "stats collector"

2022-03-08 Thread Kyotaro Horiguchi
At Tue, 8 Mar 2022 15:55:04 -0700, "David G. Johnston" 
 wrote in 
> On Tue, Mar 8, 2022 at 1:54 PM Andres Freund  wrote:
> > the immediate question for the patch is what to replace "collector" with.
> >
> >
> Not really following the broader context here so this came out of nowhere
> for me.  What is the argument for changing the status quo here?  Collector
> seems like good term.

The name "stats collector" is tied with the story that "there is a
process that only collects stats data that arrive from working
proceses.".  We have such modules like bgwriter, checkpointer,
walwriter and so on.  On the other hand we have many features with no
dedicate process instead work on shared storage area as a part of
working prcesses. table/column statistics, XLOG, heap, SLUR and so on.

In the world where every working process writes statitics to shared
meomry area by its own, no such process exists.  I think we no longer
name it "stats collector".

> > The patch currently uses "activity statistics" in a number of places, but
> > that
> > is confusing too, because pg_stat_activity is a different kind of stats.
> >
> > Any ideas?
> >
> 
> If the complaint is that not all of these statistics modules use the
> statistics collector then maybe we say each non-collector module defines an
> "Event Listener".  Or, and without looking at the source code, have the
> collector simply forward events like "reset now" to the appropriate module
> but keep the collector as the single point of message interchange for all.
> And so "tell the collector about" is indeed the correct phrasing of what
> happens.

So the collector as a process is going to die. We need alternative
name for the non-collector.  Metrics, as you mentioned below, sounds
good to me.  The name "activity stat(istics)?s" is an answer to my
desire to discriminate it from "table/column statistics" but I have to
admit that it is still not great.

> > The postgresql.conf.sample section header seems particularly odd - "index
> > statistics"? We collect more data about tables etc.
> >
> 
> No argument for bringing the header current.
> 
> >
> > A more general point: Our naming around different types of stats is
> > horribly
> > confused. We have stats describing the current state (e.g.
> > pg_stat_activity,
> > pg_stat_replication, pg_stat_progress_*, ...) and accumulated stats
> > (pg_stat_user_tables, pg_stat_database, etc) in the same namespace.
> > Should we
> > try to move towards something more coherent, at least going forward?
> >
> >
> I'm not sure trying to improve this going forward, and thus having at least
> three categories, is particularly desirable.  While it is unfortunate that
> we don't have separate pg_metric and pg_status namespaces (combining
> pg_stat with pg_status or pg_state, the two obvious choices, would be
> undesirable being they all have a shared leading character sequence) that
> is where we are today.  We are probably stuck with just using the pg_stat
> namespace and doing a better job of letting users know about the underlying
> implementation choice each pg_stat relation took in order to know whether
> what is being reported is considered reliable (self-managed shared memory)
> or not (leverages the unreliable collector).  In short, deal with this
> mainly in documentation/comments and implementation details but leave the
> public facing naming alone.
> 
> David J.

If we could, I like the namings like pg_metrics.process,
pg_metrics.replication, pg_progress.vacuum, pg_progress.basebackup,
and pg_stats.database, pg_stats.user_tables..  With such eyes, it
looks somewhat odd that pg_stat_* views are belonging to the
pg_catalog namespace.

If we had system table-aliases, people who insist on the good-old
names can live with that.  Even if there isn't, we can instead provide
views with the old names.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-10 Thread Kyotaro Horiguchi
e ID.
> + 
> 
> > * Can we mark this extension 'trusted'? I'm not 100% clear on the
> > standards for that marker, but it seems reasonable for a database owner
> > with the right privileges might want to install it.
> 
> 'trusted' extensions concept is added by commit 50fc694 [3]. Since
> pg_walinspect deals with WAL, we strictly want to control who creates
> and can execute functions exposed by it, so I don't know if 'trusted'
> is a good idea here. Also, pageinspect isn't a 'trusted' extension.
> 
> > * pg_get_raw_wal_record() seems too powerful for pg_monitor. Maybe that
> > function should require pg_read_server_files? Or at least
> > pg_read_all_data?
> 
> pg_read_all_data may not be the right choice, but pg_read_server_files
> is. However, does it sound good if some functions are allowed to be
> executed by users with a pg_monitor role and others
> pg_get_raw_wal_record by users with pg_read_server_files? Since the
> extension itself can be created by superusers, isn't the
> pg_get_raw_wal_record sort of safe with pg_mointor itself?
> 
>  If hackers don't agree, I'm happy to grant execution on
> pg_get_raw_wal_record() to the pg_read_server_files role.
> 
> Attaching the v8 patch-set resolving above comments and some tests for
> checking function permissions. Please review it further.
> 
> [1] 
> https://www.postgresql.org/message-id/CALj2ACWtToUQ5hCCBJP%2BmKeVUcN-g7cMb9XvhAcicPxUDsdcKg%40mail.gmail.com
> [2] 
> https://www.postgresql.org/message-id/CA%2BTgmobYrTgMEF0SV%2ByDYyCCh44DAGjZVs7BYGrD8xD3vwNjHA%40mail.gmail.com
> [3] commit 50fc694e43742ce3d04a5e9f708432cb022c5f0d
> Author: Tom Lane 
> Date:   Wed Jan 29 18:42:43 2020 -0500
> 
> Invent "trusted" extensions, and remove the pg_pltemplate catalog.

I played with this a bit, and would like to share some thoughts on it.

It seems to me too rigorous that pg_get_wal_records_info/stats()
reject future LSNs as end-LSN and I think WARNING or INFO and stop at
the real end-of-WAL is more kind to users.  I think the same with the
restriction that start and end LSN are required to be different.

The definition of end-lsn is fuzzy here.  If I fed a future LSN to the
functions, they tell me the beginning of the current insertion point
in error message.  On the other hand they don't accept the same
value as end-LSN.  I think it is right that they tell the current
insertion point and they should take the end-LSN as the LSN to stop
reading.

I think pg_get_wal_stats() is worth to have but I think it should be
implemented in SQL.  Currently pg_get_wal_records_info() doesn't tell
about FPI since pg_waldump doesn't but it is internally collected (of
course!) and easily revealed. If we do that, the
pg_get_wal_records_stats() would be reduced to the following SQL
statement

SELECT resource_manager resmgr,
   count(*) AS N,
   (count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS "%N",
   sum(total_length) AS "combined size",
   (sum(total_length) * 100 / sum(sum(total_length)) OVER 
tot)::numeric(5,2) AS "%combined size",
   sum(fpi_len) AS fpilen,
   (sum(fpi_len) * 100 / sum(sum(fpi_len)) OVER tot)::numeric(5,2) AS 
"%fpilen"
   FROM pg_get_wal_records_info('0/100', '0/175DD7f')
   GROUP by resource_manager
   WINDOW tot AS ()
   ORDER BY "combined size" desc;

The only difference with pg_waldump is the statement above doesn't
show lines for the resource managers that don't contained in the
result of pg_get_wal_records_info(). But I don't think that matters.


Sometimes the field description has very long (28kb long) content. It
makes the result output almost unreadable and I had a bit hard time
struggling with the output full of '-'s.  I would like have a default
limit on the length of such fields that can be long but I'm not sure
we want that.


The difference between pg_get_wal_record_info and _records_ other than
the number of argument is the former accepts incorrect LSNs.

The following works,
 pg_get_wal_record_info('0/100');
 pg_get_wal_records_info('0/100');

but this doesn't
 pg_get_wal_records_info('0/100', '0/100');
> ERROR:  WAL start LSN must be less than end LSN

But the following works
 pg_get_wal_records_info('0/100', '0/129');
> 0/128 | 0/0  |   0

So I think we can consolidate the two functions as:

- pg_get_wal_records_info('0/100');

  (current behavior) find the first record and show all records
  thereafter.

- pg_get_wal_records_info('0/100', '0/1000000');

  finds the first record since the start lsn and show it.

- pg_get_wal_records_info('0/100', '0/130');

  finds the first record since the start lsn then show records up to
  the end-lsn.


And about pg_get_raw_wal_record(). I don't see any use-case of the
function alone on SQL interface.  Even if we need to inspect broken
WAL files, it needs profound knowledge of WAL format and tools that
doesn't work on SQL interface.

However like pageinspect, if we separate the WAL-record fetching and
parsing it could be thought as useful.

pg_get_wal_records_info woule be like:

SELECT * FROM pg_walinspect_parse(raw)
 FROM (SELECT * FROM pg_walinspect_get_raw(start_lsn, end_lsn));

And pg_get_wal_stats woule be like:

SELECT * FROM pg_walinpect_stat(pg_walinspect_parse(raw))
 FROM (SELECT * FROM pg_walinspect_get_raw(start_lsn, end_lsn)));

Regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-10 Thread Kyotaro Horiguchi
Sorry, some minor non-syntactical corrections.

At Fri, 11 Mar 2022 11:38:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I played with this a bit, and would like to share some thoughts on it.
> 
> It seems to me too rigorous that pg_get_wal_records_info/stats()
> reject future LSNs as end-LSN and I think WARNING or INFO and stop at
> the real end-of-WAL is more kind to users.  I think the same with the
> restriction that start and end LSN are required to be different.
> 
> The definition of end-lsn is fuzzy here.  If I fed a future LSN to the
> functions, they tell me the beginning of the current insertion point
> in error message.  On the other hand they don't accept the same
> value as end-LSN.  I think it is right that they tell the current
> insertion point and they should take the end-LSN as the LSN to stop
> reading.
> 
> I think pg_get_wal_stats() is worth to have but I think it should be
> implemented in SQL.  Currently pg_get_wal_records_info() doesn't tell
> about FPI since pg_waldump doesn't but it is internally collected (of
> course!) and easily revealed. If we do that, the
> pg_get_wal_records_stats() would be reduced to the following SQL
> statement
> 
> SELECT resource_manager resmgr,
>count(*) AS N,
>  (count(*) * 100 / sum(count(*)) OVER tot)::numeric(5,2) AS "%N",
>  sum(total_length) AS "combined size",
>  (sum(total_length) * 100 / sum(sum(total_length)) OVER 
> tot)::numeric(5,2) AS "%combined size",
>  sum(fpi_len) AS fpilen,
>  (sum(fpi_len) * 100 / sum(sum(fpi_len)) OVER tot)::numeric(5,2) AS 
> "%fpilen"
>  FROM pg_get_wal_records_info('0/100', '0/175DD7f')
>  GROUP by resource_manager
>  WINDOW tot AS ()
>  ORDER BY "combined size" desc;
> 
> The only difference with pg_waldump is the statement above doesn't
> show lines for the resource managers that don't contained in the
> result of pg_get_wal_records_info(). But I don't think that matters.
> 
> 
> Sometimes the field description has very long (28kb long) content. It
> makes the result output almost unreadable and I had a bit hard time
> struggling with the output full of '-'s.  I would like have a default
> limit on the length of such fields that can be long but I'm not sure
> we want that.
> 
> 
- The difference between pg_get_wal_record_info and _records_ other than
- the number of argument is the former accepts incorrect LSNs.

The discussion is somewhat confused after some twists and turns..  It
should be something like the following.

pg_get_wal_record_info and pg_get_wal_records_info are almost same
since the latter can show a single record.  However it is a bit
annoying to do that. Since, other than it doens't accept same LSNs for
start and end, it doesn't show a record when there' no record in the
specfied LSN range.  But I don't think there's no usefulness of the
behavior.

The following works,
 pg_get_wal_record_info('0/100');
 pg_get_wal_records_info('0/100');

but this doesn't
 pg_get_wal_records_info('0/100', '0/100');
> ERROR:  WAL start LSN must be less than end LSN

And the following shows no records.
 pg_get_wal_records_info('0/100', '0/101');
 pg_get_wal_records_info('0/100', '0/128');

But the following works
  pg_get_wal_records_info('0/100', '0/129');
> 0/128 | 0/0  |   0



> So I think we can consolidate the two functions as:
> 
> - pg_get_wal_records_info('0/100');
> 
>   (current behavior) find the first record and show all records
>   thereafter.
> 
> - pg_get_wal_records_info('0/100', '0/100');
> 
>   finds the first record since the start lsn and show it.
> 
> - pg_get_wal_records_info('0/100', '0/130');
> 
>   finds the first record since the start lsn then show records up to
>   the end-lsn.
> 
> 
> And about pg_get_raw_wal_record(). I don't see any use-case of the
> function alone on SQL interface.  Even if we need to inspect broken
> WAL files, it needs profound knowledge of WAL format and tools that
> doesn't work on SQL interface.
> 
> However like pageinspect, if we separate the WAL-record fetching and
> parsing it could be thought as useful.
> 
> pg_get_wal_records_info woule be like:
> 
> SELECT * FROM pg_walinspect_parse(raw)
>  FROM (SELECT * FROM pg_walinspect_get_raw(start_lsn, end_lsn));
> 
> And pg_get_wal_stats woule be like:
> 
> SELECT * FROM pg_walinpect_stat(pg_walinspect_parse(raw))
>  FROM (SELECT * FROM pg_walinspect_get_raw(start_lsn, end_lsn)));


Regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-10 Thread Kyotaro Horiguchi
At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov  
wrote in 
> В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:
> > Ok, here is v4.
> 
> And here is v5.
> 
> First, there was compilation error in Assert in dynahash.c .
> Excuse me for not checking before sending previous version.
> 
> Second, I add third commit that reduces HASHHDR allocation
> size for non-partitioned dynahash:
> - moved freeList to last position
> - alloc and memset offset(HASHHDR, freeList[1]) for
>   non-partitioned hash tables.
> I didn't benchmarked it, but I will be surprised if it
> matters much in performance sence.
> 
> Third, I put all three commits into single file to not
> confuse commitfest application.

Thanks!  I looked into dynahash part.

 struct HASHHDR
 {
-   /*
-* The freelist can become a point of contention in high-concurrency 
hash

Why did you move around the freeList?


-   longnentries;   /* number of entries in 
associated buckets */
+   longnfree;  /* number of free entries in 
the list */
+   longnalloced;   /* number of entries initially 
allocated for

Why do we need nfree?  HASH_ASSING should do the same thing with
HASH_REMOVE.  Maybe the reason is the code tries to put the detached
bucket to different free list, but we can just remember the
freelist_idx for the detached bucket as we do for hashp.  I think that
should largely reduce the footprint of this patch.

-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);

That optimization may work even a bit, but it is not irrelevant to
this patch?

+   case HASH_REUSE:
+   if (currBucket != NULL)
+   {
+   /* check there is no unfinished 
HASH_REUSE+HASH_ASSIGN pair */
+   Assert(DynaHashReuse.hashp == NULL);
+   Assert(DynaHashReuse.element == NULL);

I think all cases in the switch(action) other than HASH_ASSIGN needs
this assertion and no need for checking both, maybe only for element
would be enough.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-10 Thread Kyotaro Horiguchi
At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Thanks!  I looked into dynahash part.
> 
>  struct HASHHDR
>  {
> - /*
> -  * The freelist can become a point of contention in high-concurrency 
> hash
> 
> Why did you move around the freeList?
> 
> 
> - longnentries;   /* number of entries in 
> associated buckets */
> + longnfree;  /* number of free entries in 
> the list */
> + longnalloced;   /* number of entries initially 
> allocated for
> 
> Why do we need nfree?  HASH_ASSING should do the same thing with
> HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> bucket to different free list, but we can just remember the
> freelist_idx for the detached bucket as we do for hashp.  I think that
> should largely reduce the footprint of this patch.
> 
> -static void hdefault(HTAB *hashp);
> +static void hdefault(HTAB *hashp, bool partitioned);
> 
> That optimization may work even a bit, but it is not irrelevant to
> this patch?
> 
> + case HASH_REUSE:
> + if (currBucket != NULL)
> + {
> + /* check there is no unfinished 
> HASH_REUSE+HASH_ASSIGN pair */
> + Assert(DynaHashReuse.hashp == NULL);
> + Assert(DynaHashReuse.element == NULL);
> 
> I think all cases in the switch(action) other than HASH_ASSIGN needs
> this assertion and no need for checking both, maybe only for element
> would be enough.

While I looked buf_table part, I came up with additional comments.

BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
{
hash_search_with_hash_value(SharedBufHash,

HASH_ASSIGN,
...
BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)

BufTableDelete considers both reuse and !reuse cases but
BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
odd. We should use HASH_ENTER here.  Thus I think it is more
reasonable that HASH_ENTRY uses the stashed entry if exists and
needed, or returns it to freelist if exists but not needed.

What do you think about this?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-11 Thread Kyotaro Horiguchi
At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > Thanks!  I looked into dynahash part.

Then I looked into bufmgr part.  It looks fine to me but I have some
comments on code comments.

>* To change the association of a valid buffer, we'll need to 
> have
>* exclusive lock on both the old and new mapping partitions.
>   if (oldFlags & BM_TAG_VALID)

We don't take lock on the new mapping partition here.


+* Clear out the buffer's tag and flags.  We must do this to ensure that
+* linear scans of the buffer array don't think the buffer is valid. We
+* also reset the usage_count since any recency of use of the old 
content
+* is no longer relevant.
+*
+* We are single pinner, we hold buffer header lock and exclusive
+* partition lock (if tag is valid). Given these statements it is safe 
to
+* clear tag since no other process can inspect it to the moment.

This comment is a merger of the comments from InvalidateBuffer and
BufferAlloc.  But I think what we need to explain here is why we
invalidate the buffer here despite of we are going to reuse it soon.
And I think we need to state that the old buffer is now safe to use
for the new tag here.  I'm not sure the statement is really correct
but clearing-out actually looks like safer.

> Now it is safe to use victim buffer for new tag.  Invalidate the
> buffer before releasing header lock to ensure that linear scans of
> the buffer array don't think the buffer is valid.  It is safe
> because it is guaranteed that we're the single pinner of the buffer.
> That pin also prevents the buffer from being stolen by others until
> we reuse it or return it to freelist.

So I want to revise the following comment.

-* Now it is safe to use victim buffer for new tag.
+* Now reuse victim buffer for new tag.
>* Make sure BM_PERMANENT is set for buffers that must be written at 
> every
>* checkpoint.  Unlogged buffers only need to be written at shutdown
>* checkpoints, except for their "init" forks, which need to be treated
>* just like permanent relations.
>*
>* The usage_count starts out at 1 so that the buffer can survive one
>* clock-sweep pass.

But if you think the current commet is fine, I don't insist on the
comment chagnes.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Kyotaro Horiguchi
At Fri, 11 Mar 2022 11:30:27 +0300, Yura Sokolov  
wrote in 
> В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет:
> > At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov  
> > wrote in 
> > > В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:
> > > > Ok, here is v4.
> > > 
> > > And here is v5.
> > > 
> > > First, there was compilation error in Assert in dynahash.c .
> > > Excuse me for not checking before sending previous version.
> > > 
> > > Second, I add third commit that reduces HASHHDR allocation
> > > size for non-partitioned dynahash:
> > > - moved freeList to last position
> > > - alloc and memset offset(HASHHDR, freeList[1]) for
> > >   non-partitioned hash tables.
> > > I didn't benchmarked it, but I will be surprised if it
> > > matters much in performance sence.
> > > 
> > > Third, I put all three commits into single file to not
> > > confuse commitfest application.
> > 
> > Thanks!  I looked into dynahash part.
> > 
> >  struct HASHHDR
> >  {
> > -   /*
> > -* The freelist can become a point of contention in 
> > high-concurrency hash
> > 
> > Why did you move around the freeList?
> > 
> > 
> > -   longnentries;   /* number of entries in 
> > associated buckets */
> > +   longnfree;  /* number of free entries 
> > in the list */
> > +   longnalloced;   /* number of entries 
> > initially allocated for
> > 
> > Why do we need nfree?  HASH_ASSING should do the same thing with
> > HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> > bucket to different free list, but we can just remember the
> > freelist_idx for the detached bucket as we do for hashp.  I think that
> > should largely reduce the footprint of this patch.
> 
> If we keep nentries, then we need to fix nentries in both old
> "freeList" partition and new one. It is two freeList[partition]->mutex
> lock+unlock pairs.
>
> But count of free elements doesn't change, so if we change nentries
> to nfree, then no need to fix freeList[partition]->nfree counters,
> no need to lock+unlock. 

Ah, okay. I missed that bucket reuse chages key in most cases.

But still I don't think its good to move entries around partition
freelists for another reason.  I'm afraid that the freelists get into
imbalanced state.  get_hash_entry prefers main shmem allocation than
other freelist so that could lead to freelist bloat, or worse
contension than the traditinal way involving more than two partitions.

I'll examine the possibility to resolve this...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Kyotaro Horiguchi
At Fri, 11 Mar 2022 12:34:32 +0300, Yura Sokolov  
wrote in 
> В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi 
> >  > BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool 
> > reuse)
> > 
> > BufTableDelete considers both reuse and !reuse cases but
> > BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
> > odd. We should use HASH_ENTER here.  Thus I think it is more
> > reasonable that HASH_ENTRY uses the stashed entry if exists and
> > needed, or returns it to freelist if exists but not needed.
> > 
> > What do you think about this?
> 
> Well... I don't like it but I don't mind either.
> 
> Code in HASH_ENTER and HASH_ASSIGN cases differs much.
> On the other hand, probably it is possible to merge it carefuly.
> I'll try.

Honestly, I'm not sure it wins on performance basis. It just came from
interface consistency (mmm. a bit different, maybe.. convincibility?).

regards.
-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-13 Thread Kyotaro Horiguchi
At Fri, 11 Mar 2022 15:39:13 -0500, Robert Haas  wrote 
in 
> On Thu, Mar 10, 2022 at 9:38 PM Kyotaro Horiguchi
>  wrote:
> > It seems to me too rigorous that pg_get_wal_records_info/stats()
> > reject future LSNs as end-LSN and I think WARNING or INFO and stop at
> > the real end-of-WAL is more kind to users.  I think the same with the
> > restriction that start and end LSN are required to be different.
> 
> In his review just yesterday, Jeff suggested this: "Don't issue
> WARNINGs or other messages for ordinary situations, like when
> pg_get_wal_records_info() hits the end of WAL." I think he's entirely
> right, and I don't think any patch that does otherwise should get

It depends on what we think is the "ordinary" here.  If we don't
expect that specified LSN range is not filled-out, the case above is
ordinary and no need for any WARNING nor INFO.  I'm fine with that
definition here.

> committed. It is worth remembering that the results of queries are
> often examined by something other than a human being sitting at a psql
> terminal. Any tool that uses this is going to want to understand what
> happened from the result set, not by parsing strings that may show up
> inside warning messages.

Right. I don't think it is right that WARNING is required to evaluate
the result. And I think that the WARNING like 'reached end-of-wal
before end LSN' is a kind that is not required in evaluation of the
result. Since each WAL row contains at least start LSN.

> I think that the right answer here is to have a function that returns
> one row per record parsed, and each row should also include the start
> and end LSN of the record. If for some reason the WAL records return
> start after the specified start LSN (e.g. because we skip over a page
> header) or end before the specified end LSN (e.g. because we reach
> end-of-WAL) the user can figure it out from looking at the LSNs in the
> output rows and comparing them to the LSNs provided as input.

I agree with you here.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Allow async standbys wait for sync replication

2022-03-13 Thread Kyotaro Horiguchi
At Sat, 12 Mar 2022 14:33:32 -0800, Nathan Bossart  
wrote in 
> On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote:
> > To me it's architecturally the completely wrong direction. We should move in
> > the *other* direction, i.e. allow WAL to be sent to standbys before the
> > primary has finished flushing it locally. Which requires similar
> > infrastructure to what we're discussing here.
> 
> I think this is a good point.  After all, WALRead() has the following
> comment:
> 
>  * XXX probably this should be improved to suck data directly from the
>  * WAL buffers when possible.
> 
> Once you have all the infrastructure for that, holding back WAL replay on
> async standbys based on synchronous replication might be relatively easy.

That is, (as my understanding) async standbys are required to allow
overwriting existing unreplayed records after reconnection.  But,
putting aside how to remember that LSN, if that happens at a segment
boundary, the async replica may run into the similar situation with
the missing-contrecord case.  But standby cannot insert any original
record to get out from that situation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Allow async standbys wait for sync replication

2022-03-13 Thread Kyotaro Horiguchi
At Mon, 14 Mar 2022 11:30:02 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Sat, 12 Mar 2022 14:33:32 -0800, Nathan Bossart  
> wrote in 
> > On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote:
> > > To me it's architecturally the completely wrong direction. We should move 
> > > in
> > > the *other* direction, i.e. allow WAL to be sent to standbys before the
> > > primary has finished flushing it locally. Which requires similar
> > > infrastructure to what we're discussing here.
> > 
> > I think this is a good point.  After all, WALRead() has the following
> > comment:
> > 
> >  * XXX probably this should be improved to suck data directly from the
> >  * WAL buffers when possible.
> > 
> > Once you have all the infrastructure for that, holding back WAL replay on
> > async standbys based on synchronous replication might be relatively easy.

Just to make sure and a bit off from the topic, I think the
optimization itself is quite promising and want to have.

> That is, (as my understanding) async standbys are required to allow
> overwriting existing unreplayed records after reconnection.  But,
> putting aside how to remember that LSN, if that happens at a segment
> boundary, the async replica may run into the similar situation with
> the missing-contrecord case.  But standby cannot insert any original
> record to get out from that situation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Kyotaro Horiguchi
At Mon, 14 Mar 2022 09:39:48 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I'll examine the possibility to resolve this...

The existence of nfree and nalloc made me confused and I found the
reason.

In the case where a parittion collects many REUSE-ASSIGN-REMOVEed
elemetns from other paritiotns, nfree gets larger than nalloced.  This
is a strange point of the two counters.  nalloced is only referred to
as (sum(nalloced[])).  So we don't need nalloced per-partition basis
and the formula to calculate the number of used elements would be as
follows.

 sum(nalloced - nfree)
 =  - sum(nfree)

We rarely create fresh elements in shared hashes so I don't think
there's additional contention on the  even if it were
a global atomic.

So, the remaining issue is the possible imbalancement among
partitions.  On second thought, by the current way, if there's a bad
deviation in partition-usage, a heavily hit partition finally collects
elements via get_hash_entry().  By the patch's way, similar thing
happens via the REUSE-ASSIGN-REMOVE sequence. But buffers once used
for something won't be freed until buffer invalidation. But bulk
buffer invalidation won't deviatedly distribute freed buffers among
partitions.  So I conclude for now that is a non-issue.

So my opinion on the counters is:

I'd like to ask you to remove nalloced from partitions then add a
global atomic for the same use?

No need to do something for the possible deviation issue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Accept IP addresses in server certificate SANs

2022-03-13 Thread Kyotaro Horiguchi
At Thu, 17 Feb 2022 17:29:15 +, Jacob Champion  wrote 
in 
> On Tue, 2022-02-15 at 15:16 +0900, Kyotaro Horiguchi wrote:
> > (This needs rebasing)
> 
> Done in v6, attached.

Thanks!

> > # I forgot to mention that, the test fails for me even without the
> > # change.  I didn't checked what is wrong there, though.
> 
> Ah. We should probably figure that out, then -- what failures do you
> see?

I forgot the detail but v6 still fails for me. I think it is that.

t/003_sslinfo.pl ... 1/? # Tests were run but no plan was declared and 
done_testing() was not seen.
# Looks like your test exited with 29 just after 6.
t/003_sslinfo.pl ... Dubious, test returned 29 (wstat 7424, 0x1d00)
All 6 subtests passed 
...
Result: FAIL

The script complains like this:

ok 6 - ssl_client_cert_present() for connection with cert
connection error: 'psql: error: connection to server at "127.0.0.1", port 62656 
failed: SSL error: tlsv1 alert unknown ca'
while running 'psql -XAtq -d sslrootcert=ssl/root+server_ca.crt sslmode=require 
dbname=trustdb hostaddr=127.0.0.1 user=ssltestuser host=localhost -f - -v 
ON_ERROR_STOP=1' at 
/home/horiguti/work/worktrees/ipsan/src/test/ssl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
 line 1873.

So, psql looks like disliking the ca certificate.  I also will dig
into that.

> > Mmm. after the end of the loop, rc is non-zero only when the loop has
> > exited by the break and otherwise rc is zero. Why isn't it equivalent
> > to setting check_cn to false at the break?
> 
> check_cn can be false if rc is zero, too; it means that we found a SAN
> of the correct type but it didn't match.

Don't we count unmatched certs as "existed"?  In that case I think we
don't go to CN.

> > Anyway, apart from that detail, I reconfirmed the spec the patch is
> > going to implement.
> > 
> >   * If connhost contains a DNS name, and the certificate's SANs contain any
> >   * dNSName entries, then we'll ignore the Subject Common Name entirely;
> >   * otherwise, we fall back to checking the CN. (This behavior matches the
> >   * RFC.)
> > 
> > Sure.
> > 
> >   * If connhost contains an IP address, and the SANs contain iPAddress
> >   * entries, we again ignore the CN. Otherwise, we allow the CN to match,
> >   * EVEN IF there is a dNSName in the SANs. (RFC 6125 prohibits this: "A
> >   * client MUST NOT seek a match for a reference identifier of CN-ID if the
> >   * presented identifiers include a DNS-ID, SRV-ID, URI-ID, or any
> >   * application-specific identifier types supported by the client.")
> > 
> > Actually the patch searches for a match of IP address connhost from
> > dNSName SANs even if iPAddress SANs exist.  I think we've not
> > explicitly defined thebehavior in that case.
> 
> That's a good point; I didn't change the prior behavior. I feel more
> comfortable leaving that check, since it is technically possible to
> push something that looks like an IP address into a dNSName SAN. We
> should probably make an explicit decision on that, as you say.
> 
> But I don't think that contradicts the code comment, does it? The
> comment is just talking about CN fallback scenarios. If you find a
> match in a dNSName, there's no reason to fall back to the CN.

The comment explains the spec correctly. From a practical view, the
behavior above doesn't seem to make things insecure.  So I don't have
a strong opinion on the choice of the behaviors.

The only thing I'm concerned here is the possibility that the decision
corners us to some uncomfortable state between the RFC and our spec in
future.  On the other hand, changing the behavior can immediately make
someone uncomfortable.

So, I'd like to leave it to committers:p

> > I supposed that we only
> > be deviant in the case "IP address connhost and no SANs of any type
> > exists". What do you think about it?
> 
> We fall back in the case of "IP address connhost and dNSName SANs
> exist", which is prohibited by that part of RFC 6125. I don't think we
> deviate in the case you described; can you explain further?

In that case, i.e., connhost is IP address and no SANs exist at all,
we go to CN.  On the other hand in RFC6125:

rfc6125> In some cases, the URI is specified as an IP address rather
rfc6125> than a hostname. In this case, the iPAddress subjectAltName
rfc6125> must be present in the certificate and must exactly match the
rfc6125> IP in the URI.

It (seems to me) denies that behavior.  Regardless of the existence of
other types of SANs, iPAddress is required if connname is an IP
address.  (That is, it doesn't seem to me that there's any contex

Re: BufferAlloc: don't take two simultaneous locks

2022-03-14 Thread Kyotaro Horiguchi
At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov  
wrote in 
> В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > I'd like to ask you to remove nalloced from partitions then add a
> > global atomic for the same use?
> 
> I really believe it should be global. I made it per-partition to
> not overcomplicate first versions. Glad you tell it.
> 
> I thought to protect it with freeList[0].mutex, but probably atomic
> is better idea here. But which atomic to chose: uint64 or uint32?
> Based on sizeof(long)?
> Ok, I'll do in next version.

Current nentries is a long (= int64 on CentOS). And uint32 can support
roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
enough.  So it would be uint64.

> Whole get_hash_entry look strange.
> Doesn't it better to cycle through partitions and only then go to
> get_hash_entry?
> May be there should be bitmap for non-empty free lists? 32bit for
> 32 partitions. But wouldn't bitmap became contention point itself?

The code puts significance on avoiding contention caused by visiting
freelists of other partitions.  And perhaps thinks that freelist
shortage rarely happen.

I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
128kB shared buffers and I saw that get_hash_entry never takes the
!element_alloc() path and always allocate a fresh entry, then
saturates at 30 new elements allocated at the medium of a 100 seconds
run.

Then, I tried the same with the patch, and I am surprized to see that
the rise of the number of newly allocated elements didn't stop and
went up to 511 elements after the 100 seconds run.  So I found that my
concern was valid.  The change in dynahash actually
continuously/repeatedly causes lack of free list entries.  I'm not
sure how much the impact given on performance if we change
get_hash_entry to prefer other freelists, though.


By the way, there's the following comment in StrategyInitalize.

>* Initialize the shared buffer lookup hashtable.
>*
>* Since we can't tolerate running out of lookup table entries, we must 
> be
>* sure to specify an adequate table size here.  The maximum 
> steady-state
>* usage is of course NBuffers entries, but BufferAlloc() tries to 
> insert
>* a new entry before deleting the old.  In principle this could be
>* happening in each partition concurrently, so we could need as many as
>* NBuffers + NUM_BUFFER_PARTITIONS entries.
>*/
>   InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

"but BufferAlloc() tries to insert a new entry before deleting the
old." gets false by this patch but still need that additional room for
stashed entries.  It seems like needing a fix.



regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-14 Thread Kyotaro Horiguchi
At Mon, 14 Mar 2022 17:12:48 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Then, I tried the same with the patch, and I am surprized to see that
> the rise of the number of newly allocated elements didn't stop and
> went up to 511 elements after the 100 seconds run.  So I found that my
> concern was valid.

Which means my last decision was wrong with high odds..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add 64-bit XIDs into PostgreSQL 15

2022-03-14 Thread Kyotaro Horiguchi
At Mon, 14 Mar 2022 19:43:40 +0400, Pavel Borisov  
wrote in 
> > I'd like to add a few quick notes on what's been done in v17.

I have some commens by a quick look-through.  Apologize in advance for
wrong comments from the lack of the knowledge of the whole patch-set.

> > Patches 0001 and 0002 that are planned to be committed to PG15 are almost
> > unchanged with the exception of one unnecessary cast in 0002 removed.

0001:

 The XID_FMT has quite bad impact on the translatability of error
 messages.  3286065651 has removed INT64_FORMAT from translatable
 texts for the reason.  This re-introduces that in several places.
 0001 itself does not harm but 0005 replaces XID_FMT with
 INT64_FORMAT.  Other patches have the same issue, too.

> > We've also addressed several issues in patch 0005 (which is planned for
> > PG16):
> > - The bug with frozen xids after pg_upgrade, reported by Justin [1]
> > - Added proper processing of double xmax pages in
> > HeapPageSetPruneXidInternal()
> > - Fixed xids comparison. Initially in the patch it was changed to simple <
> > <= => > for 64 bit values. Now v17 patch has returned this to the way
> > similar to what is used in STABLE for 32-bit xids, but using modulus-64
> > numeric ring. The main goal of this change was to fix SRLU tests that were 
> > mentioned

If IIUC, the following part in 0002 doesn't consider wraparound.

-asyncQueuePagePrecedes(int p, int q)
+asyncQueuePagePrecedes(int64 p, int64 q)
 {
-   return asyncQueuePageDiff(p, q) < 0;
+   return p < q;
 }

> > by Alexander to have been disabled. We've fixed and enabled most of them,
> > but some of them are still need to be fixed and enabled.
> >
> > Also, we've pgindent-ed all the patches.

0005 has "new blank line at EOF".

> > As patches that are planned to be delivered to PG15 are almost unchanged,
> > I completely agree with Alexander's plan to consider these patches (0001
> > and 0002) as RfC.
> >
> > All activity, improvement, review, etc. related to the whole patchset is
> > also very much appreciated. Big thanks to Alexander for working on the
> > patch set!
> >
> > [1]
> > https://www.postgresql.org/message-id/20220115063925.GS14051%40telsasoft.com
> >
> Also, the patch v17 (0005) returns SLRU_PAGES_PER_SEGMENT to the previous
> value of 32.

0002 re-introduces pg_strtouint64, which have been recently removed by
3c6f8c011f.
> Simplify the general-purpose 64-bit integer parsing APIs
> 
> pg_strtouint64() is a wrapper around strtoull/strtoul/_strtoui64, but
> it seems no longer necessary to have this indirection.
..
>type definition int64/uint64.  For that, add new macros strtoi64() and
>strtou64() in c.h as thin wrappers around strtol()/strtoul() or
>strtoll()/stroull().  This makes these functions available everywhere

regards.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-03-14 Thread Kyotaro Horiguchi
At Mon, 14 Mar 2022 17:37:40 -0400, Robert Haas  wrote 
in 
> On Mon, Mar 7, 2022 at 3:39 AM Kyotaro Horiguchi
>  wrote:
> > So the new framework has been dropped in this version.
> > The second test is removed as it is irrelevant to this bug.
> >
> > In this version the patch is a single file that contains the test.
> 
> The status of this patch in the CommitFest was set to "Waiting for
> Author." Since a new patch has been submitted since that status was
> set, I have changed it to "Needs Review." Since this is now in its
> 15th CommitFest, we really should get it fixed; that's kind of
> ridiculous. (I am as much to blame as anyone.) It does seem to be a
> legitimate bug.
> 
> A few questions about the patch:

Thanks for looking this!

> 1. Why is it OK to just skip the operation without making it up later?

Does "it" mean removal of directories?  It is not okay, but in the
first place it is out-of-scope of this patch to fix that.  The patch
leaves the existing code alone.  This patch just has recovery ignore
invalid accesses into eventually removed objects.

Maybe, I don't understand you question..

> 2. Why not instead change the code so that the operation can succeed,
> by creating the prerequisite parent directories? Do we not have enough
> information for that? I'm not saying that we definitely should do it
> that way rather than this way, but I think we do take that approach in
> some cases.

It is proposed first by Paul Guo [1] then changed so that it ignores
failed directory creations in the very early stage in this thread.
After that, it gets conscious of recovery consistency by managing
invalid-access list.

[1] 
https://www.postgresql.org/message-id/flat/20210327142316.GA32517%40alvherre.pgsql#a557bd47207a446ce206879676e0140a

I think there was no strong reason for the current shape but I
personally rather like the remembering-invalid-access way because it
doesn't dirty the data directory and it is consistent with how we
treat missing heap pages.

I tried a slightly tweaked version (attached) of the first version and
confirmed that it works for the current test script.  It doesn't check
recovery consistency but otherwise that way also seems fine.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/dbcommands.c 
b/src/backend/commands/dbcommands.c
index c37e3c9a9a..28aed8d296 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -47,6 +47,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "common/file_perm.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2382,6 +2383,7 @@ dbase_redo(XLogReaderState *record)
xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) 
XLogRecGetData(record);
char   *src_path;
char   *dst_path;
+   char   *parent_path;
struct stat st;
 
src_path = GetDatabasePath(xlrec->src_db_id, 
xlrec->src_tablespace_id);
@@ -2401,6 +2403,41 @@ dbase_redo(XLogReaderState *record)
dst_path)));
}
 
+   /*
+* It is possible that the tablespace was later dropped, but we 
are
+* re-redoing database create before that. In that case, those
+* directories are gone, and we do not create symlink.
+*/
+   if (stat(dst_path, &st) < 0 && errno == ENOENT)
+   {
+   parent_path = pstrdup(dst_path);
+   get_parent_directory(parent_path);
+   elog(WARNING, "creating missing directory: %s", 
parent_path);
+   if (stat(parent_path, &st) != 0 && 
pg_mkdir_p(parent_path, pg_dir_create_mode) != 0)
+   {
+   ereport(WARNING,
+   (errmsg("can not recursively 
create directory \"%s\"",
+   parent_path)));
+   }
+   }
+
+   /*
+* There's a case where the copy source directory is missing 
for the
+* same reason above.  Create the emtpy source directory so that
+* copydir below doesn't fail.  The directory will be dropped 
soon by
+* recovery.
+*/
+   if (stat(src_path, &st) < 0 && errno == ENOENT)
+   {
+   elog(WARNING, "creating missing copy source directory: 
%

Re: BufferAlloc: don't take two simultaneous locks

2022-03-15 Thread Kyotaro Horiguchi
Thanks for the new version.

At Tue, 15 Mar 2022 08:07:39 +0300, Yura Sokolov  
wrote in 
> В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:
> > В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> > > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov 
> > >  wrote in 
> > > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> > > 128kB shared buffers and I saw that get_hash_entry never takes the
> > > !element_alloc() path and always allocate a fresh entry, then
> > > saturates at 30 new elements allocated at the medium of a 100 seconds
> > > run.
> > > 
> > > Then, I tried the same with the patch, and I am surprized to see that
> > > the rise of the number of newly allocated elements didn't stop and
> > > went up to 511 elements after the 100 seconds run.  So I found that my
> > > concern was valid.  The change in dynahash actually
> > > continuously/repeatedly causes lack of free list entries.  I'm not
> > > sure how much the impact given on performance if we change
> > > get_hash_entry to prefer other freelists, though.
> > 
> > Well, it is quite strange SharedBufHash is not allocated as
> > HASH_FIXED_SIZE. Could you check what happens with this flag set?
> > I'll try as well.
> > 
> > Other way to reduce observed case is to remember freelist_idx for
> > reused entry. I didn't believe it matters much since entries migrated
> > netherless, but probably due to some hot buffers there are tention to
> > crowd particular freelist.
> 
> Well, I did both. Everything looks ok.

Hmm. v8 returns stashed element with original patition index when the
element is *not* reused.  But what I saw in the previous test runs is
the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
behaving the same way (or somehow even worse) with the previous
version.  get_hash_entry continuously suffer lack of freelist
entry. (FWIW, attached are the test-output diff for both master and
patched)

master finally allocated 31 fresh elements for a 100s run.

> ALLOCED: 31;; freshly allocated

v8 finally borrowed 33620 times from another freelist and 0 freshly
allocated (ah, this version changes that..)
Finally v8 results in:

> RETURNED: 50806;; returned stashed elements
> BORROWED: 33620;; borrowed from another freelist
> REUSED: 1812664;; stashed
> ASSIGNED: 1762377  ;; reused
>(ALLOCED: 0);; freshly allocated

It contains a huge degradation by frequent elog's so they cannot be
naively relied on, but it should show what is happening sufficiently.

> I lost access to Xeon 8354H, so returned to old Xeon X5675.
...
> Strange thing: both master and patched version has higher
> peak tps at X5676 at medium connections (17 or 27 clients)
> than in first october version [1]. But lower tps at higher
> connections number (>= 191 clients).
> I'll try to bisect on master this unfortunate change.

The reversing of the preference order between freshly-allocation and
borrow-from-another-freelist might affect.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/storage/buffer/buf_table.c 
b/src/backend/storage/buffer/buf_table.c
index dc439940fa..ac651b98e6 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -31,7 +31,7 @@ typedef struct
int id; /* Associated 
buffer ID */
 } BufferLookupEnt;
 
-static HTAB *SharedBufHash;
+HTAB *SharedBufHash;
 
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c 
b/src/backend/utils/hash/dynahash.c
index 3babde8d70..294516ef01 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -195,6 +195,11 @@ struct HASHHDR
longssize;  /* segment size --- must be 
power of 2 */
int sshift; /* segment shift = 
log2(ssize) */
int nelem_alloc;/* number of entries to 
allocate at once */
+   int alloc;
+   int reuse;
+   int borrow;
+   int assign;
+   int ret;
 
 #ifdef HASH_STATISTICS
 
@@ -963,6 +968,7 @@ hash_search(HTAB *hashp,
   
foundPtr);
 }
 
+extern HTAB *SharedBufHash;
 void *
 hash_search_with_hash_value(HTAB *hashp,
const void *keyPtr,
@@ -1354,6 +1360,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
hctl->freeList[freelist_idx].nentries++;

SpinLockRelease(&hctl->freeL

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 12:19:47 +0530, Bharath Rupireddy 
 wrote in 
> On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi
>  wrote:
> 0001 - I don't think you need to do this as UpdateControlFile
> (update_controlfile) will anyway update it, no?
> + /* Update control file using current time */
> + ControlFile->time = (pg_time_t) time(NULL);

Ugh.. Yes. It is a copy-pasto from older versions.  They may have the
same copy-pasto..



> > 0002: Add REDO/Checkpiont LSNs to checkpoinkt-end log message.
> >   (The main patch in this thread)
> 
> 0002 - If at all the intention is to say that no ControlFileLock is
> required while reading ControlFile->checkPoint and
> ControlFile->checkPointCopy.redo, let's say it, no? How about
> something like "No ControlFileLock is required while reading
> ControlFile->checkPoint and ControlFile->checkPointCopy.redo as there
> can't be any other process updating them concurrently."?
> 
> + /* we are the only updator of these variables */
> + LSN_FORMAT_ARGS(ControlFile->checkPoint),
> + LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;

I thought the comment explains that. But it would be better to be more
specific.  It is changed as the follows.

> * ControlFileLock is not required as we are the only
> * updator of these variables.


> > 0003: Replace (WAL-)location to LSN in pg_controldata.
> >
> > 0004: Replace (WAL-)location to LSN in user-facing texts.
> >   (This doesn't reflect my recent comments.)
> 
> If you don't mind, can you please put the comments here?

Okay. It's the following message.

https://www.postgresql.org/message-id/20220209.115204.1794224638476710282.horikyota@gmail.com

> The old 0003 (attached 0004):
> 
> 
> 
> +++ b/src/backend/access/rmgrdesc/xlogdesc.c
> - appendStringInfo(buf, "redo %X/%X; "
> + appendStringInfo(buf, "redo lsn %X/%X; "
> 
> 
> 
> It is shown in the context of a checkpoint record, so I think it is
> not needed or rather lengthning the dump line uselessly. 
> 
> 
> 
> +++ b/src/backend/access/transam/xlog.c
> - (errmsg("request to flush past end of generated 
> WAL; request %X/%X, current position %X/%X",
> + (errmsg("request to flush past end of generated 
> WAL; request lsn %X/%X, current lsn %X/%X",
> 
> 
> 
> +++ b/src/backend/replication/walsender.c
> - (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush position of this server %X/%X",
> + (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush LSN of this server %X/%X",
> 
> 
> 
> "WAL" is upper-cased. So it seems rather strange that the "lsn" is
> lower-cased.  In the first place the message doesn't look like a
> user-facing error message and I feel we don't need position or lsn
> there..
> 
> 
> 
> +++ b/src/bin/pg_rewind/pg_rewind.c
> - pg_log_info("servers diverged at WAL location %X/%X on timeline 
> %u",
> + pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u",
> 
> 
> 
> I feel that we don't need "WAL" there.
> 
> 
> 
> +++ b/src/bin/pg_waldump/pg_waldump.c
> - printf(_("  -e, --end=RECPTR   stop reading at WAL location 
> RECPTR\n"));
> + printf(_("  -e, --end=RECPTR   stop reading at WAL LSN RECPTR\n"));
> 
> 
> 
> Mmm.. "WAL LSN RECPTR" looks strange to me.  In the first place I
> don't think "RECPTR" is a user-facing term. Doesn't something like the
> follows work?
> 
> 
> 
> + printf(_("  -e, --end=WAL-LSN   stop reading at WAL-LSN\n"));
> 
> 
> 
> In some changes in this patch shorten the main message text of
> fprintf-ish functions.  That makes the succeeding parameters can be
> inlined.


> > 0005: Unhyphenate the word archive-recovery and similars.
> 
> 0005 - How about replacing "crash-recovery" to "crash recovery" in
> postgres-ref.sgml too?

Oh, that's a left-over. Fixed. Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 37f7433dce1ea9e41aa54f2c18c0bf3d2aa3c814 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Mar 2022 13:18:30 +0900
Subject: [PATCH v12 1/5] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the f

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 18:26:26 +0900, Michael Paquier  wrote 
in 
> On Tue, Mar 15, 2022 at 05:23:40PM +0900, Kyotaro Horiguchi wrote:
> > At Tue, 15 Mar 2022 12:19:47 +0530, Bharath Rupireddy 
> >  wrote in 
> >> On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi
> >>  wrote:
> >> 0001 - I don't think you need to do this as UpdateControlFile
> >> (update_controlfile) will anyway update it, no?
> >> + /* Update control file using current time */
> >> + ControlFile->time = (pg_time_t) time(NULL);
> > 
> > Ugh.. Yes. It is a copy-pasto from older versions.  They may have the
> > same copy-pasto..
> 
> This thread has shifted to an entirely different discussion,
> presenting patches that touch code paths unrelated to what was first
> stated.  Shouldn't you create a new thread with a proper $subject to
> attract a more correct audience?  

I felt the same since some messages ago.  I thought Fujii-san thought
that he wants to fix the CreateRestartPoint issue before the
checkpoint log patch but he looks like busy these days.

Since the CreateRestartPoint issue is orthogonal to the main patch,
I'll separate that part into anther thread.

This thread is discussing one other topic, wordings in user-facing
texts. This is also orthogonal (3-dimentionally?) to the two topics.

In short, I split out the two topics other than checkpoint log to
other threads.

Thanks for cueing me to do that!

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Possible corruption by CreateRestartPoint at promotion

2022-03-15 Thread Kyotaro Horiguchi
Hello,  (Cc:ed Fujii-san)

This is a diverged topic from [1], which is summarized as $SUBJECT.

To recap:

While discussing on additional LSNs in checkpoint log message,
Fujii-san pointed out [2] that there is a case where
CreateRestartPoint leaves unrecoverable database when concurrent
promotion happens. That corruption is "fixed" by the next checkpoint
so it is not a severe corruption.

AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
restartpoint [3].  So I propose to remove the code path as attached.

regards.


[1] 
https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

[2] 
https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com

[3] 
https://www.postgresql.org/message-id/20220222.174401.765586897814316743.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 57823c5690469cf404de9fbdf37f55d09abaedd5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Mar 2022 13:18:30 +0900
Subject: [PATCH v4] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 69 ---
 1 file changed, 44 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ed16f279b1..dd9c564988 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6900,6 +6900,9 @@ CreateRestartPoint(int flags)
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
 
+	/* we don't assume concurrent checkpoint/restartpoint to run */
+	Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
 	/* Get a local copy of the last safe checkpoint record. */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -6965,7 +6968,7 @@ CreateRestartPoint(int flags)
 
 	/* Also update the info_lck-protected copy */
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+	XLogCtl->RedoRecPtr = RedoRecPtr;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
@@ -6984,7 +6987,10 @@ CreateRestartPoint(int flags)
 	/* Update the process title */
 	update_checkpoint_display(flags, true, false);
 
-	CheckPointGuts(lastCheckPoint.redo, flags);
+	CheckPointGuts(RedoRecPtr, flags);
+
+	/* Update pg_control */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -6992,30 +6998,26 @@ CreateRestartPoint(int flags)
 	 */
 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+	Assert (PriorRedoPtr < RedoRecPtr);
+
+	ControlFile->checkPoint = lastCheckPointRecPtr;
+	ControlFile->checkPointCopy = lastCheckPoint;
+
 	/*
-	 * Update pg_control, using current time.  Check that it still shows
-	 * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-	 * this is a quick hack to make sure nothing really bad happens if somehow
-	 * we get here after the end-of-recovery checkpoint.
+	 * Ensure minRecoveryPoint is past the checkpoint record while archive
+	 * recovery is still ongoing.  Normally, this will have happened already
+	 * while writing out dirty buffers, but not necessarily - e.g. because no
+	 * buffers were dirtied.  We do this because a non-exclusive base backup
+	 * uses minRecoveryPoint to determine which WAL files must be included in
+	 * the backup, and the file (or files) containing the checkpoint record
+	 * must be included, at a minimum. Note that for an ordinary restart of
+	 * recovery there's no value in having the minimum recovery point any
+	 * earlier than this anyway, because redo will begin just after the
+	 * checkpoint record.  This is a quick hack to make sure nothing really bad
+	 * happens if somehow we get here after the end-of-recovery checkpoint.
 	 */
-	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-		ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
 	{
-		ControlFile->checkPoint = lastCheckPoin

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi
Hello, this is a follow-up topic of [1] (add LSNs to checkpint logs).

Many user-facing texts contains wording like "WAL location" or such
like. The attached is WIP patches that fixes such wordings to "WAL
LSN" or alikes.

The attached patches is v13 but it is not changed at all from v12.
My lastest comments on this version are as follows.

https://www.postgresql.org/message-id/20220209.115204.1794224638476710282.horikyota@gmail.com

> The old 0003 (attached 0004):
> 
> 
> 
> +++ b/src/backend/access/rmgrdesc/xlogdesc.c
> - appendStringInfo(buf, "redo %X/%X; "
> + appendStringInfo(buf, "redo lsn %X/%X; "
> 
> 
> 
> It is shown in the context of a checkpoint record, so I think it is
> not needed or rather lengthning the dump line uselessly. 
> 
> 
> 
> +++ b/src/backend/access/transam/xlog.c
> - (errmsg("request to flush past end of generated 
> WAL; request %X/%X, current position %X/%X",
> + (errmsg("request to flush past end of generated 
> WAL; request lsn %X/%X, current lsn %X/%X",
> 
> 
> 
> +++ b/src/backend/replication/walsender.c
> - (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush position of this server %X/%X",
> + (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush LSN of this server %X/%X",
> 
> 
> 
> "WAL" is upper-cased. So it seems rather strange that the "lsn" is
> lower-cased.  In the first place the message doesn't look like a
> user-facing error message and I feel we don't need position or lsn
> there..
> 
> 
> 
> +++ b/src/bin/pg_rewind/pg_rewind.c
> - pg_log_info("servers diverged at WAL location %X/%X on timeline 
> %u",
> + pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u",
> 
> 
> 
> I feel that we don't need "WAL" there.
> 
> 
> 
> +++ b/src/bin/pg_waldump/pg_waldump.c
> - printf(_("  -e, --end=RECPTR   stop reading at WAL location 
> RECPTR\n"));
> + printf(_("  -e, --end=RECPTR   stop reading at WAL LSN RECPTR\n"));
> 
> 
> 
> Mmm.. "WAL LSN RECPTR" looks strange to me.  In the first place I
> don't think "RECPTR" is a user-facing term. Doesn't something like the
> follows work?
> 
> 
> 
> + printf(_("  -e, --end=WAL-LSN   stop reading at WAL-LSN\n"));
> 
> 
> 
> In some changes in this patch shorten the main message text of
> fprintf-ish functions.  That makes the succeeding parameters can be
> inlined.
regards.

[1] 
https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From afed063f09b788bc07a0e2665499c3e3c3edc214 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 1 Feb 2022 03:57:04 +
Subject: [PATCH v13 1/2] Change "location" to "lsn" in pg_controldata

---
 src/bin/pg_controldata/pg_controldata.c| 10 +-
 src/test/recovery/t/016_min_consistency.pl |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d94..59f39267df 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -235,9 +235,9 @@ main(int argc, char *argv[])
 		   dbState(ControlFile->state));
 	printf(_("pg_control last modified: %s\n"),
 		   pgctime_str);
-	printf(_("Latest checkpoint location:   %X/%X\n"),
+	printf(_("Latest checkpoint LSN:%X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPoint));
-	printf(_("Latest checkpoint's REDO location:%X/%X\n"),
+	printf(_("Latest checkpoint's REDO LSN: %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo));
 	printf(_("Latest checkpoint's REDO WAL file:%s\n"),
 		   xlogfilename);
@@ -274,13 +274,13 @@ main(int argc, char *argv[])
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->unloggedLSN));
-	printf(_("Minimum recovery ending location: %X/%X\n"),
+	printf(_("Minimum recovery ending LSN:  %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint));
 	printf(_("Min recovery ending loc's timeline:   %u\n"),
 		   ControlFile->minRecoveryPointTLI);
-	printf(_("Backup start location:%X/%X\n"),
+	printf(_(

Unhyphenation of crash-recovery

2022-03-15 Thread Kyotaro Horiguchi
Hello, this is a derived topic from [1], summarized as $SUBJECT.

This just removes useless hyphens from the words
"(crash|emergency)-recovery". We don't have such wordings for "archive
recovery" This patch fixes non-user-facing texts as well as
user-facing ones.

regards.

[1] 
https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 751acd2510e036a5c484e5768dd9e4e85cd8b2cf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 4 Mar 2022 13:53:38 +0900
Subject: [PATCH v13] Get rid of uses of some hyphenated words

There are use of both "crash-recovery" and "crash recovery" in the
tree.  All occurances of archive recovery is spelled
unhyphenated. Unify the similar uses to unhyphenated.
---
 doc/src/sgml/ref/pg_ctl-ref.sgml  | 2 +-
 doc/src/sgml/ref/postgres-ref.sgml| 2 +-
 src/backend/commands/trigger.c| 2 +-
 src/backend/utils/cache/relcache.c| 2 +-
 src/test/recovery/t/023_pitr_prepared_xact.pl | 2 +-
 src/test/regress/parallel_schedule| 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/pg_ctl-ref.sgml b/doc/src/sgml/ref/pg_ctl-ref.sgml
index 3946fa52ea..7cbe5c3048 100644
--- a/doc/src/sgml/ref/pg_ctl-ref.sgml
+++ b/doc/src/sgml/ref/pg_ctl-ref.sgml
@@ -194,7 +194,7 @@ PostgreSQL documentation
rolled back and clients are forcibly disconnected, then the
server is shut down.  Immediate mode will abort
all server processes immediately, without a clean shutdown.  This choice
-   will lead to a crash-recovery cycle during the next server start.
+   will lead to a crash recovery cycle during the next server start.
   
 
   
diff --git a/doc/src/sgml/ref/postgres-ref.sgml b/doc/src/sgml/ref/postgres-ref.sgml
index 55a3f6c69d..fc22fbc4fe 100644
--- a/doc/src/sgml/ref/postgres-ref.sgml
+++ b/doc/src/sgml/ref/postgres-ref.sgml
@@ -719,7 +719,7 @@ PostgreSQL documentation
is also unwise to send SIGKILL to a server
process — the main postgres process will
interpret this as a crash and will force all the sibling processes
-   to quit as part of its standard crash-recovery procedure.
+   to quit as part of its standard crash recovery procedure.
   
  
 
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index e08bd9a370..ce170b4c6e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -1887,7 +1887,7 @@ RelationBuildTriggers(Relation relation)
 	/*
 	 * Note: since we scan the triggers using TriggerRelidNameIndexId, we will
 	 * be reading the triggers in name order, except possibly during
-	 * emergency-recovery operations (ie, IgnoreSystemIndexes). This in turn
+	 * emergency recovery operations (ie, IgnoreSystemIndexes). This in turn
 	 * ensures that triggers will be fired in name order.
 	 */
 	ScanKeyInit(&skey,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index fccffce572..2419cf5285 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -6598,7 +6598,7 @@ RelationCacheInitFilePostInvalidate(void)
  * Remove the init files during postmaster startup.
  *
  * We used to keep the init files across restarts, but that is unsafe in PITR
- * scenarios, and even in simple crash-recovery cases there are windows for
+ * scenarios, and even in simple crash recovery cases there are windows for
  * the init files to become out-of-sync with the database.  So now we just
  * remove them during startup and expect the first backend launch to rebuild
  * them.  Of course, this has to happen in each database of the cluster.
diff --git a/src/test/recovery/t/023_pitr_prepared_xact.pl b/src/test/recovery/t/023_pitr_prepared_xact.pl
index 39e8a8fa17..0f48c20ed1 100644
--- a/src/test/recovery/t/023_pitr_prepared_xact.pl
+++ b/src/test/recovery/t/023_pitr_prepared_xact.pl
@@ -1,7 +1,7 @@
 
 # Copyright (c) 2021-2022, PostgreSQL Global Development Group
 
-# Test for point-in-time-recovery (PITR) with prepared transactions
+# Test for point-in-time recovery (PITR) with prepared transactions
 use strict;
 use warnings;
 use PostgreSQL::Test::Cluster;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 6d8f524ae9..8251744bbf 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -13,7 +13,7 @@ test: test_setup
 
 # run tablespace by itself, and early, because it forces a checkpoint;
 # we'd prefer not to have checkpoints later in the tests because that
-# interferes with crash-recovery testing.
+# interferes with crash recovery testing.
 test: tablespace
 
 # --
-- 
2.27.0



Reword "WAL location" as "WAL LSN"

2022-03-15 Thread Kyotaro Horiguchi
Uh.. Sorry, I forgot to change the subject. Resent with the correct
subject.

=
Hello, this is a follow-up topic of [1] (add LSNs to checkpint logs).

Many user-facing texts contains wording like "WAL location" or such
like. The attached is WIP patches that fixes such wordings to "WAL
LSN" or alikes.

The attached patches is v13 but it is not changed at all from v12.
My lastest comments on this version are as follows.

https://www.postgresql.org/message-id/20220209.115204.1794224638476710282.horikyota@gmail.com

> The old 0003 (attached 0004):
> 
> 
> 
> +++ b/src/backend/access/rmgrdesc/xlogdesc.c
> - appendStringInfo(buf, "redo %X/%X; "
> + appendStringInfo(buf, "redo lsn %X/%X; "
> 
> 
> 
> It is shown in the context of a checkpoint record, so I think it is
> not needed or rather lengthning the dump line uselessly. 
> 
> 
> 
> +++ b/src/backend/access/transam/xlog.c
> - (errmsg("request to flush past end of generated 
> WAL; request %X/%X, current position %X/%X",
> + (errmsg("request to flush past end of generated 
> WAL; request lsn %X/%X, current lsn %X/%X",
> 
> 
> 
> +++ b/src/backend/replication/walsender.c
> - (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush position of this server %X/%X",
> + (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush LSN of this server %X/%X",
> 
> 
> 
> "WAL" is upper-cased. So it seems rather strange that the "lsn" is
> lower-cased.  In the first place the message doesn't look like a
> user-facing error message and I feel we don't need position or lsn
> there..
> 
> 
> 
> +++ b/src/bin/pg_rewind/pg_rewind.c
> - pg_log_info("servers diverged at WAL location %X/%X on timeline 
> %u",
> + pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u",
> 
> 
> 
> I feel that we don't need "WAL" there.
> 
> 
> 
> +++ b/src/bin/pg_waldump/pg_waldump.c
> - printf(_("  -e, --end=RECPTR   stop reading at WAL location 
> RECPTR\n"));
> + printf(_("  -e, --end=RECPTR   stop reading at WAL LSN RECPTR\n"));
> 
> 
> 
> Mmm.. "WAL LSN RECPTR" looks strange to me.  In the first place I
> don't think "RECPTR" is a user-facing term. Doesn't something like the
> follows work?
> 
> 
> 
> + printf(_("  -e, --end=WAL-LSN   stop reading at WAL-LSN\n"));
> 
> 
> 
> In some changes in this patch shorten the main message text of
> fprintf-ish functions.  That makes the succeeding parameters can be
> inlined.
regards.

[1] 
https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From afed063f09b788bc07a0e2665499c3e3c3edc214 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 1 Feb 2022 03:57:04 +
Subject: [PATCH v13 1/2] Change "location" to "lsn" in pg_controldata

---
 src/bin/pg_controldata/pg_controldata.c| 10 +-
 src/test/recovery/t/016_min_consistency.pl |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d94..59f39267df 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -235,9 +235,9 @@ main(int argc, char *argv[])
 		   dbState(ControlFile->state));
 	printf(_("pg_control last modified: %s\n"),
 		   pgctime_str);
-	printf(_("Latest checkpoint location:   %X/%X\n"),
+	printf(_("Latest checkpoint LSN:%X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPoint));
-	printf(_("Latest checkpoint's REDO location:%X/%X\n"),
+	printf(_("Latest checkpoint's REDO LSN: %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo));
 	printf(_("Latest checkpoint's REDO WAL file:%s\n"),
 		   xlogfilename);
@@ -274,13 +274,13 @@ main(int argc, char *argv[])
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->unloggedLSN));
-	printf(_("Minimum recovery ending location: %X/%X\n"),
+	printf(_("Minimum recovery ending LSN:  %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint));
 	printf(_("Min recovery ending loc's timeline:   %u\n"),
 		   ControlFile->minRecoveryPointTLI

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi
At Wed, 16 Mar 2022 09:19:13 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> In short, I split out the two topics other than checkpoint log to
> other threads.

So, this is about the main topic of this thread, adding LSNs to
checkpint log.  Other topics have moved to other treads [1], [2] ,
[3].

I think this is no longer controversial alone.  So this patch is now
really Read-for-Commiter and is waiting to be picked up.

regards.


[1] 
https://www.postgresql.org/message-id/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com
[2] 
https://www.postgresql.org/message-id/20220316.102900.2003692961119672246.horikyota.ntt%40gmail.com
[3] 
https://www.postgresql.org/message-id/20220316.102509.785466054344164656.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4c3d40e7aaf29cb905f3561a291d35981d762456 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 4 Mar 2022 13:22:41 +0900
Subject: [PATCH v13] Add checkpoint and redo LSN to LogCheckpointEnd log
 message

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN(end) and REDO LSN(start). It gives more
context while analyzing checkpoint-related issues. The pg_controldata
gives the last checkpoint LSN and REDO LSN, but having this info
alongside the log message helps analyze issues that happened
previously, connect the dots and identify the root cause.
---
 src/backend/access/transam/xlog.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ed16f279b1..b85c76d8f8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6121,7 +6121,8 @@ LogCheckpointEnd(bool restartpoint)
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -6134,14 +6135,21 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/*
+		 * ControlFileLock is not required as we are the only
+		 * updator of these variables.
+		 */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 	else
 		ereport(LOG,
 (errmsg("checkpoint complete: wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -6154,7 +6162,13 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/*
+		 * ControlFileLock is not required as we are the only
+		 * updator of these variables.
+		 */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 }
 
 /*
-- 
2.27.0



Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 23:16:52 +1300, Thomas Munro  wrote 
in 
> On Tue, Mar 15, 2022 at 10:30 PM Michael Paquier  wrote:
> > So, which one of a relative path or an absolute path do you think
> > would be better for the user?  My preference tends toward the relative
> > path, as we know that all those tablespaces stay in pg_tblspc/ so one
> > can make the difference with normal tablespaces more easily.  The
> > barrier is thin, though :p
> 
> Sounds good to me.

+1. Desn't the doc need to mention that?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov  
wrote in 
> В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> > Hmm. v8 returns stashed element with original patition index when the
> > element is *not* reused.  But what I saw in the previous test runs is
> > the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
> > behaving the same way (or somehow even worse) with the previous
> > version.
> 
> v8 doesn't differ in REMOVE case neither from master nor from
> previous version. It differs in RETURNED case only.
> Or I didn't understand what you mean :(

In v7, HASH_ENTER returns the element stored in DynaHashReuse using
the freelist_idx of the new key.  v8 uses that of the old key (at the
time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
returns the stashed)" case the stashed element is returned to its
original partition.  But it is not what I mentioned.

On the other hand, once the stahsed element is reused by HASH_ENTER,
it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
from old partition) case.  I suspect that ththat the frequent freelist
starvation comes from the latter case.

> > get_hash_entry continuously suffer lack of freelist
> > entry. (FWIW, attached are the test-output diff for both master and
> > patched)
> > 
> > master finally allocated 31 fresh elements for a 100s run.
> > 
> > > ALLOCED: 31;; freshly allocated
> > 
> > v8 finally borrowed 33620 times from another freelist and 0 freshly
> > allocated (ah, this version changes that..)
> > Finally v8 results in:
> > 
> > > RETURNED: 50806;; returned stashed elements
> > > BORROWED: 33620;; borrowed from another freelist
> > > REUSED: 1812664;; stashed
> > > ASSIGNED: 1762377  ;; reused
> > >(ALLOCED: 0);; freshly allocated

(I misunderstand that v8 modified get_hash_entry's preference between
allocation and borrowing.)

I re-ran the same check for v7 and it showed different result.

RETURNED: 1
ALLOCED: 15
BORROWED: 0
REUSED: 505435
ASSIGNED: 505462 (-27)  ## the counters are not locked.

> Is there any measurable performance hit cause of borrowing?
> Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
> shared buffers that is extremely small. (Or it was 128MB?)

It is intentional set small to get extremely frequent buffer
replacements.  The point here was the patch actually can induce
frequent freelist starvation.  And as you do, I also doubt the
significance of the performance hit by that.  Just I was not usre.

I re-ran the same for v8 and got a result largely different from the
previous trial on the same v8.

RETURNED: 2
ALLOCED: 0
BORROWED: 435
REUSED: 495444
ASSIGNED: 495467 (-23)

Now "BORROWED" happens 0.8% of REUSED.

> Well, I think some spare entries could reduce borrowing if there is
> a need. I'll test on 128MB with spare entries. If there is profit,
> I'll return some, but will keep SharedBufHash fixed.

I don't doubt the benefit of this patch.  And now convinced by myself
that the downside is negligible than the benefit.

> Master branch does less freelist manipulations since it  tries to
> insert first and if there is collision it doesn't delete victim
> buffer.
> 
> > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > ...
> > > Strange thing: both master and patched version has higher
> > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > than in first october version [1]. But lower tps at higher
> > > connections number (>= 191 clients).
> > > I'll try to bisect on master this unfortunate change.
> > 
> > The reversing of the preference order between freshly-allocation and
> > borrow-from-another-freelist might affect.
> 
> `master` changed its behaviour as well.
> It is not problem of the patch at all.

Agreed.  So I think we should go on this direction.

There are some last comments on v8.

+ 
HASH_FIXED_SIZE);

Ah, now I understand that this prevented allocation of new elements.
I think this good to do for SharedBufHash.



+   longnfree;  /* number of free entries in 
the list */
HASHELEMENT *freeList;  /* chain of free elements */
 } FreeListData;
 
+#if SIZEOF_LONG == 4
+typedef pg_atomic_uint32 nalloced_store_t;
+typedef uint32 nalloced_value_t;
+#define nalloced_read(a)   (long)pg_atomic_read_u32(a)
+#define nalloced_add(a, v) pg_atomic_fetch_add_u32((a), (uint32)(v))


I don't think nalloced needs to be the same width to long.  For the
platforms with 32-bit long, anyway 

Re: Add 64-bit XIDs into PostgreSQL 15

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 18:48:34 +0300, Maxim Orlov  wrote in 
> Hi Kyotaro!
> 
> 0001:
> >
> >  The XID_FMT has quite bad impact on the translatability of error
> >  messages.  3286065651 has removed INT64_FORMAT from translatable
> >  texts for the reason.  This re-introduces that in several places.
> >  0001 itself does not harm but 0005 replaces XID_FMT with
> >  INT64_FORMAT.  Other patches have the same issue, too.
> >
>  I do understand your concern and I wonder how I can do this better? My
> first intention was to replace XID_FMT with %llu and INT64_FORMAT with
> %lld. This should solve the translatability issue, but I'm not sure about
> portability of this. Should this work on Windows, etc? Can you advise me on
> the best solution?

Doesn't doing "errmsg("blah blah  %lld ..", (long long) xid)" work?

> We've fixed all the other things mentioned. Thanks!
> 
> Also added two fixes:
> - CF bot was unhappy with pg_upgrade test in v17 because I forgot to add a
> fix for computation of relminmxid during vacuum on a fresh database.
> - Replace frozen or invalid x_min with FrozenTransactionId or
> InvalidTransactionId respectively during tuple conversion to 64xid.
> 
> Reviews are welcome as always! Thanks!

My pleasure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Corruption during WAL replay

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 12:44:49 -0400, Robert Haas  wrote 
in 
> On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
>  wrote:
> > The attached is the fixed version and it surely works with the repro.
> 
> Hi,
> 
> I spent the morning working on this patch and came up with the
> attached version. I wrote substantial comments in RelationTruncate(),
> where I tried to make it more clear exactly what the bug is here, and
> also in storage/proc.h, where I tried to clarify both the use of the
> DELAY_CHKPT_* flags in general terms. If nobody is too sad about this
> version, I plan to commit it.

Thanks for taking this and for the time.  The additional comments
seems describing the flags more clearly.

storage.c:
+* Make sure that a concurrent checkpoint can't complete while 
truncation
+* is in progress.
+*
+* The truncation operation might drop buffers that the checkpoint
+* otherwise would have flushed. If it does, then it's essential that
+* the files actually get truncated on disk before the checkpoint record
+* is written. Otherwise, if reply begins from that checkpoint, the
+* to-be-truncated buffers might still exist on disk but have older
+* contents than expected, which can cause replay to fail. It's OK for
+* the buffers to not exist on disk at all, but not for them to have the
+* wrong contents.

FWIW, this seems like slightly confusing between buffer and its
content.  I can read it correctly so I don't mind if it is natural
enough.

Otherwise all the added/revised comments looks fine. Thanks for the
labor.

> I think it should be back-patched, too, but that looks like a bit of a
> pain. I think every back-branch will require different adjustments.

I'll try that, if you are already working on it, please inform me. (It
may more than likely be too late..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Accept IP addresses in server certificate SANs

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 21:41:49 +, Jacob Champion  wrote 
in 
> On Mon, 2022-03-14 at 15:30 +0900, Kyotaro Horiguchi wrote:
> > t/003_sslinfo.pl ... 1/? # Tests were run but no plan was declared and 
> > done_testing() was not seen.
> > # Looks like your test exited with 29 just after 6.
> > t/003_sslinfo.pl ... Dubious, test returned 29 (wstat 7424, 0x1d00)
> > All 6 subtests passed 
> > ...
> > Result: FAIL
> > 
> > The script complains like this:
> > 
> > ok 6 - ssl_client_cert_present() for connection with cert
> > connection error: 'psql: error: connection to server at "127.0.0.1", port 
> > 62656 failed: SSL error: tlsv1 alert unknown ca'
> > while running 'psql -XAtq -d sslrootcert=ssl/root+server_ca.crt 
> > sslmode=require dbname=trustdb hostaddr=127.0.0.1 user=ssltestuser 
> > host=localhost -f - -v ON_ERROR_STOP=1' at 
> > /home/horiguti/work/worktrees/ipsan/src/test/ssl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
> >  line 1873.
> > 
> > So, psql looks like disliking the ca certificate.  I also will dig
> > into that.
> 
> Hmm, the sslinfo tests are failing? I wouldn't have expected that based
> on the patch changes. Just to confirm -- they pass for you without the
> patch?

Mmm I'm not sure how come I didn't noticed that, master also fails
for me fo the same reason.  In the past that may fail when valid
clinent-certs exists in the users ~/.postgresql but I believe that has
been fixed.


> > > > Mmm. after the end of the loop, rc is non-zero only when the loop has
> > > > exited by the break and otherwise rc is zero. Why isn't it equivalent
> > > > to setting check_cn to false at the break?
> > > 
> > > check_cn can be false if rc is zero, too; it means that we found a SAN
> > > of the correct type but it didn't match.

I'm not discussing on the meaning. Purely on the logical equivalancy
of the two ways to decide whether to visit CN.

Concretely, the equivalancy between this:

=
 check_cn = true;
 rc = 0;
 for (i < san_len)
 {
   if (type matches)  check_cn = false;
   if (some error happens)  rc = nonzero;
 
   if (rc != 0)
 break;
 }
!if ((rc == 0) && check_cn) {}
=

and this.

=
 check_cn = true;
 rc = 0;
 for (i < san_len)
 {
   if (type matches)  check_cn = false;
   if (some error happens)  rc = nonzero;

   if (rc != 0)
   {
!check_cn = false;
 break;
   }
 }
!if (check_cn) {}
=

The two are equivalant to me. And if it is, the latter form smees
simpler and clearner to me.

> > > check_cn can be false if rc is zero, too; it means that we found a SAN
> > > of the correct type but it didn't match.
> 
> > Don't we count unmatched certs as "existed"?  In that case I think we
> > don't go to CN.
> 
> Unmatched names, you mean? I'm not sure I understand.

Sorry, I was confused here. Please ignore that. I shoudl have said
something like the following instead.

> check_cn can be false if rc is zero, too; it means that we found a SAN
> of the correct type but it didn't match.

Yes, in that case we don't visit CN because check_cn is false even if
we don't exit by (rc != 0) and that behavior is not changed by my
proposal.

> > > > I supposed that we only
> > > > be deviant in the case "IP address connhost and no SANs of any type
> > > > exists". What do you think about it?
> > > 
> > > We fall back in the case of "IP address connhost and dNSName SANs
> > > exist", which is prohibited by that part of RFC 6125. I don't think we
> > > deviate in the case you described; can you explain further?
> > 
> > In that case, i.e., connhost is IP address and no SANs exist at all,
> > we go to CN.  On the other hand in RFC6125:
> > 
> > rfc6125> In some cases, the URI is specified as an IP address rather
> > rfc6125> than a hostname. In this case, the iPAddress subjectAltName
> > rfc6125> must be present in the certificate and must exactly match the
> > rfc6125> IP in the URI.
> > 
> > It (seems to me) denies that behavior.  Regardless of the existence of
> > other types of SANs, iPAddress is required if connname is an IP
> > address.  (That is, it doesn't seem to me that there's any context
> > like "if any SANs exists", but I'm not so sure I read it perfectly.)
> 
> I see what you mean now. Yes, we deviate there as well (and have done
> so for a while now). I think breaking compatibility there would
> probably not go over well.

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Accept IP addresses in server certificate SANs

2022-03-16 Thread Kyotaro Horiguchi
At Wed, 16 Mar 2022 15:56:02 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Mmm I'm not sure how come I didn't noticed that, master also fails
> for me fo the same reason.  In the past that may fail when valid
> clinent-certs exists in the users ~/.postgresql but I believe that has
> been fixed.

And finally my fear found to be true..  The test doesn't fail after
removing files under ~/.postgresql, which should not happen.

I'll fix it apart from this.
Sorry for the confusion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Out-of-tree certificate interferes ssltest

2022-03-16 Thread Kyotaro Horiguchi
Hello.

003_sslinfo.pl fails for me.

ok 6 - ssl_client_cert_present() for connection with cert
connection error: 'psql: error: connection to server at "127.0.0.1", port 61688 
failed: could not read certificate file 
"/home/horiguti/.postgresql/postgresql.crt": no start line'
while running 'psql -XAtq -d sslrootcert=ssl/root+server_ca.crt sslmode=require 
dbname=trustdb hostaddr=127.0.0.1 user=ssltestuser host=localhost -f - -v ON_ERR

I think we don't want this behavior.

The attached fixes that and make-world successfully finished even if I
have a cert file in my home direcotory.

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 308b55f06907ccaf4ac5669daacf04fea8a18fe1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 16 Mar 2022 16:20:46 +0900
Subject: [PATCH v1] Prevent out-of-tree certificates from interfering ssl
 tests

003_sslinfo.pl fails when there is a certificate file in ~/.postgresql
directory.  Prevent that failure by explicitly setting sslcert option
in connection string.
---
 src/test/ssl/t/003_sslinfo.pl | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/test/ssl/t/003_sslinfo.pl b/src/test/ssl/t/003_sslinfo.pl
index 95742081f3..81da94f18d 100644
--- a/src/test/ssl/t/003_sslinfo.pl
+++ b/src/test/ssl/t/003_sslinfo.pl
@@ -93,7 +93,7 @@ $result = $node->safe_psql("certdb", "SELECT ssl_client_cert_present();",
 is($result, 't', "ssl_client_cert_present() for connection with cert");
 
 $result = $node->safe_psql("trustdb", "SELECT ssl_client_cert_present();",
-  connstr => "sslrootcert=ssl/root+server_ca.crt sslmode=require " .
+  connstr => "sslcert=invalid sslrootcert=ssl/root+server_ca.crt sslmode=require " .
   "dbname=trustdb hostaddr=$SERVERHOSTADDR user=ssltestuser host=localhost");
 is($result, 'f', "ssl_client_cert_present() for connection without cert");
 
@@ -108,7 +108,7 @@ $result = $node->psql("certdb", "SELECT ssl_client_dn_field('invalid');",
 is($result, '3', "ssl_client_dn_field() for an invalid field");
 
 $result = $node->safe_psql("trustdb", "SELECT ssl_client_dn_field('commonName');",
-  connstr => "sslrootcert=ssl/root+server_ca.crt sslmode=require " .
+  connstr => "sslcert=invalid sslrootcert=ssl/root+server_ca.crt sslmode=require " .
   "dbname=trustdb hostaddr=$SERVERHOSTADDR user=ssltestuser host=localhost");
 is($result, '', "ssl_client_dn_field() for connection without cert");
 
-- 
2.27.0



Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-16 Thread Kyotaro Horiguchi
At Wed, 16 Mar 2022 15:42:52 +0900, Michael Paquier  wrote 
in 
> On Wed, Mar 16, 2022 at 10:34:15AM +0900, Kyotaro Horiguchi wrote:
> > +1. Desn't the doc need to mention that?
> 
> Yes, I agree that it makes sense to add a note, even if
> allow_in_place_tablespaces is a developer option.  I have added the
> following paragraph in the docs:
> +A full path of the symbolic link in pg_tblspc/
> +is returned. A relative path to the data directory is returned
> +for tablespaces created with
> + enabled.

I'm not sure that the "of the symbolic link in pg_tblspc/" is
needed. And allow_in_place_tablespaces alone doesn't create in-place
tablesapce. So this might need rethink at least for the second point.

> Another thing that was annoying in the first version of the patch is
> the useless call to lstat() on Windows, not needed because it is
> possible to rely just on pgwin32_is_junction() to check if readlink()
> should be called or not.

Agreed. And v2 looks cleaner.

The test detects the lack of the feature.
It successfully builds and runs on Rocky8 and Windows11.

> This leads me to the revised version attached.  What do you think?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Standby got invalid primary checkpoint after crashed right after promoted.

2022-03-16 Thread Kyotaro Horiguchi
At Wed, 16 Mar 2022 07:16:16 +, hao harry  wrote in 
> Hi, pgsql-hackers,
> 
> I think I found a case that database is not recoverable, would you please 
> give a look?
> 
> Here is how it happens:
> 
> - setup primary/standby
> - do a lots INSERT at primary
> - create a checkpoint at primary
> - wait until standby start doing restart point, it take about 3mins syncing 
> buffers to complete
> - before the restart point update ControlFile, promote the standby, that 
> changed ControlFile
>   ->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving 
> the ControlFile
>   ->checkPoint pointing to a removed file

Yeah, it seems like exactly the same issue pointed in [1].  A fix is
proposed in [1].  Maybe I can remove "possible" from the mail subject:p

[1] 
https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com
[2] 
https://www.postgresql.org/message-id/20220316.102444.2193181487576617583.horikyota@gmail.com

> - before the promoted standby request the post-recovery checkpoint (fast 
> promoted), 
>   one backend crashed, it will kill other server process, so the 
> post-recovery checkpoint skipped
> - the database restart startup process, which report: "could not locate a 
> valid checkpoint record"
> 
> I attached a test to reproduce it, it does not fail every time, it fails 
> every 10 times to me.
> To increase the chance CreateRestartPoint skip update ControlFile and to 
> simulate a crash,
> the patch 0001 is needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Standby got invalid primary checkpoint after crashed right after promoted.

2022-03-16 Thread Kyotaro Horiguchi
(My previous mail hass crossed with this one)

At Wed, 16 Mar 2022 08:21:46 +, hao harry  wrote in 
> Found this issue is duplicated to [1], after applied that patch, I cannot 
> reproduce it anymore.
> 
> [1] 
> https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com<https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota@gmail.com>

Glad to hear that.  Thanks for checking it!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Possible corruption by CreateRestartPoint at promotion

2022-03-16 Thread Kyotaro Horiguchi
Just for the record.

An instance of the corruption showed up in this mailing list [1].

[1] 
https://www.postgresql.org/message-id/flat/9EB4CF63-1107-470E-B5A4-061FB9EF8CC8%40outlook.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-16 Thread Kyotaro Horiguchi
At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov  
wrote in 
> В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> > At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov  
> > wrote in 
> > In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> > the freelist_idx of the new key.  v8 uses that of the old key (at the
> > time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> > returns the stashed)" case the stashed element is returned to its
> > original partition.  But it is not what I mentioned.
> > 
> > On the other hand, once the stahsed element is reused by HASH_ENTER,
> > it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> > from old partition) case.  I suspect that ththat the frequent freelist
> > starvation comes from the latter case.
> 
> Doubtfully. Due to probabilty theory, single partition doubdfully
> will be too overflowed. Therefore, freelist.

Yeah.  I think so generally.

> But! With 128kb shared buffers there is just 32 buffers. 32 entry for
> 32 freelist partition - certainly some freelist partition will certainly
> have 0 entry even if all entries are in freelists. 

Anyway, it's an extreme condition and the starvation happens only at a
neglegible ratio.

> > RETURNED: 2
> > ALLOCED: 0
> > BORROWED: 435
> > REUSED: 495444
> > ASSIGNED: 495467 (-23)
> > 
> > Now "BORROWED" happens 0.8% of REUSED
> 
> 0.08% actually :)

Mmm.  Doesn't matter:p

> > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > > ...
> > > > > Strange thing: both master and patched version has higher
> > > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > > than in first october version [1]. But lower tps at higher
> > > > > connections number (>= 191 clients).
> > > > > I'll try to bisect on master this unfortunate change.
...
> I've checked. Looks like something had changed on the server, since
> old master commit behaves now same to new one (and differently to
> how it behaved in October).
> I remember maintainance downtime of the server in november/december.
> Probably, kernel were upgraded or some system settings were changed.

One thing I have a little concern is that numbers shows 1-2% of
degradation steadily for connection numbers < 17.

I think there are two possible cause of the degradation.

1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
  This might cause degradation for memory-contended use.

2. nallocs operation might cause degradation on non-shared dynahasyes?
  I believe doesn't but I'm not sure.

  On a simple benchmarking with pgbench on a laptop, dynahash
  allocation (including shared and non-shared) happend about at 50
  times per second with 10 processes and 200 with 100 processes.

> > I don't think nalloced needs to be the same width to long.  For the
> > platforms with 32-bit long, anyway the possible degradation if any by
> > 64-bit atomic there doesn't matter.  So don't we always define the
> > atomic as 64bit and use the pg_atomic_* functions directly?
> 
> Some 32bit platforms has no native 64bit atomics. Then they are
> emulated with locks.
> 
> Well, and for 32bit platform long is just enough. Why spend other
> 4 bytes per each dynahash?

I don't think additional bytes doesn't matter, but emulated atomic
operations can matter. However I'm not sure which platform uses that
fallback implementations.  (x86 seems to have __sync_fetch_and_add()
since P4).

My opinion in the previous mail is that if that level of degradation
caued by emulated atomic operations matters, we shouldn't use atomic
there at all since atomic operations on the modern platforms are not
also free.

In relation to 2 above, if we observe that the degradation disappears
by (tentatively) use non-atomic operations for nalloced, we should go
back to the previous per-freelist nalloced.

I don't have access to such a musculous machines, though..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-16 Thread Kyotaro Horiguchi
At Wed, 16 Mar 2022 20:49:12 +0530, Ashutosh Sharma  
wrote in 
> I can see that the pg_get_wal_records_info function shows the details
> of the WAL record whose existence is beyond the user specified
> stop/end lsn pointer. See below:
> 
> ashu@postgres=# select * from pg_get_wal_records_info('0/0128',
> '0/0129');
> -[ RECORD 1 
> ]+--
> start_lsn| 0/128
> end_lsn  | 0/19F
> prev_lsn | 0/0
...
> record_length| 114
...
> In this case, the end lsn pointer specified by the user is
> '0/0129'. There is only one WAL record which starts before this
> specified end lsn pointer whose start pointer is at 0128, but that
> WAL record ends at 0/19F which is way beyond the specified end
> lsn. So, how come we are able to display the complete WAL record info?
> AFAIU, end lsn is the lsn pointer where you need to stop reading the
> WAL data. If that is true, then there exists no valid WAL record
> between the start and end lsn in this particular case.

You're right considering how pg_waldump behaves. pg_waldump works
almost the way as you described above.  The record above actually ends
at 199 and pg_waldump shows that record by specifying -s 0/128
-e 0/19a, but not for -e 0/199.

# I personally think the current behavior is fine, though..


It still suggests unspecifiable end-LSN..

> select * from pg_get_wal_records_info('4/4B28EB68', '4/4C60');
> ERROR:  cannot accept future end LSN
> DETAIL:  Last known WAL LSN on the database system is 4/4C60.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Unhyphenation of crash-recovery

2022-03-17 Thread Kyotaro Horiguchi
At Thu, 17 Mar 2022 07:42:42 +0100, Peter Eisentraut 
 wrote in 
> On 16.03.22 02:25, Kyotaro Horiguchi wrote:
> > Hello, this is a derived topic from [1], summarized as $SUBJECT.
> > This just removes useless hyphens from the words
> > "(crash|emergency)-recovery". We don't have such wordings for "archive
> > recovery" This patch fixes non-user-facing texts as well as
> > user-facing ones.
> 
> Most changes in this patch are not the correct direction.  The hyphens
> are used to group compound adjectives before nouns.  For example,
> 
> simple crash-recovery cases
> 
> means
> 
> simple (crash recovery) cases
> 
> rather than
> 
> simple crash (recovery cases)
> 
> if it were without hyphens.

Really? The latter recognization doesn't seem to make sense.  I might
be too-trained so that I capture "(crash|archive|blah) recovery" as
implicit compound words.  But anyway there's no strong reason to be
aggressive to unhyphenate compound words.

"point-in-time-recovery" and "(during) emergency-recovery operations"
seem like better be unhyphnated, but now I'm not sure it is really so.

Thanks for the comments.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Out-of-tree certificate interferes ssltest

2022-03-17 Thread Kyotaro Horiguchi
At Thu, 17 Mar 2022 16:22:14 +0900, Michael Paquier  wrote 
in 
> On Thu, Mar 17, 2022 at 02:59:26PM +0900, Michael Paquier wrote:
> > In both cases, enforcing sslcrl to a value of "invalid" interferes
> > with the failure scenario we expect from sslcrldir.  It is possible to
> > bypass that with something like the attached, but that's a kind of
> > ugly hack.  Another alternative would be to drop those two tests, and
> > I am not sure how much we care about these two negative scenarios.
> 
> Actually, there is a trick I have recalled here: we can enforce sslcrl
> to an empty value in the connection string after the default.  This
> still ensures that the test won't pick up any SSL data from the local
> environment and avoids any interferences of OpenSSL's
> X509_STORE_load_locations().  This gives a much simpler and cleaner
> patch.
> 
> Thoughts?

Ah! I didn't have a thought that we can specify the same parameter
twice.  It looks like clean and robust.  $default_ssl_connstr contains
all required options in PQconninfoOptions[].

The same method worked for 003_sslinfo.pl, too. (of course).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-17 Thread Kyotaro Horiguchi
At Thu, 17 Mar 2022 16:39:52 +0900, Michael Paquier  wrote 
in 
> On Thu, Mar 17, 2022 at 07:55:30PM +1300, Thomas Munro wrote:
> > I don't really want to spill details of this developer-only stuff onto
> > more manual sections...  It's not really helping users if we confuse
> > them with irrelevant details of a feature they shouldn't be using, is
> > it?  And the existing treatment "Returns the file system path that
> > this tablespace is located in" is not invalidated by this special
> > case, so maybe we shouldn't mention it?
> 
> Right, I see your point.  The existing description is not wrong
> either.  Fine by me to just drop all that.

+1.  Sorry for my otiose comment..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: XID formatting and SLRU refactorings

2022-03-17 Thread Kyotaro Horiguchi
At Thu, 17 Mar 2022 19:25:00 +0300, Maxim Orlov  wrote in 
> Hi!
> 
> Here is the v20 patch. 0001 and 0002 are proposed into PG15 as
> discussed above.
> The whole set of patches is added into [1] to be committed into PG16.
> 
> In this version we've made a major revision related to printf/elog format
> compatible with localization
> as was proposed above.
> 
> We also think of adding 0003 patch related to 64 bit GUC's into this
> thread. Suppose it also may be delivered
> into PG15.
> 
> Aleksander Alekseev, we've done this major revision mentioned above and you
> are free to continue working on this patch set.
> 
> Reviews and proposals are very welcome!

(I'm afraid that this thread is not for the discussion of the patch?:)

> [1]
> https://www.postgresql.org/message-id/flat/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com

+/* printf/elog format compatible with 32 and 64 bit xid. */
+typedef unsigned long long XID_TYPE;
...
+errmsg_internal("found multixact %llu from before relminmxid %llu",
+(XID_TYPE) multi, (XID_TYPE) relminmxid)));

"(XID_TYPE) x" is actually equivalent to "(long long) x" here, but the
point here is "%llu in format string accepts (long long)" so we should
use literally (or bare) (long long) and the maybe-all precedents does
that.

And.. You looks like applying the change to some inappropriate places?

-  elog(DEBUG2, "deleted page from block %u has safexid %u",
- blkno, opaque->btpo_level);
+  elog(DEBUG2, "deleted page from block %u has safexid %llu",
+ blkno, (XID_TYPE) opaque->btpo_level);

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Corruption during WAL replay

2022-03-17 Thread Kyotaro Horiguchi
At Wed, 16 Mar 2022 10:14:56 -0400, Robert Haas  wrote 
in 
> Hmm. I think the last two instances of "buffers" in this comment
> should actually say "blocks".

Ok. I replaced them with "blocks" and it looks nicer. Thanks!

> > I'll try that, if you are already working on it, please inform me. (It
> > may more than likely be too late..)
> 
> If you want to take a crack at that, I'd be delighted.

Finally, no two of from 10 to 14 doesn't accept the same patch.

As a cross-version check, I compared all combinations of the patches
for two adjacent versions and confirmed that no hunks are lost.

All versions pass check world.


The differences between each two adjacent versions are as follows.

master->14:

 A hunk fails due to the change in how to access rel->rd_smgr.

14->13:

  Several hunks fail due to simple context differences.

13->12:

 Many hunks fail due to the migration of delayChkpt from PGPROC to
 PGXACT and the context difference due to change of FSM trancation
 logic in RelationTruncate.

12->11:

 Several hunks fail due to the removal of volatile qalifier from
 pointers to PGPROC/PGXACT.

11-10:

 A hunk fails due to the context difference due to an additional
 member tempNamespaceId of PGPROC.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c88858d3e5681005ba0396b7e7ebcde4322b3308 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Tue, 15 Mar 2022 12:29:14 -0400
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/byapr06mb6373bf50b469ca393c614257ab...@byapr06mb6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 
 src/backend/access/transam/xact.c   |  5 ++--
 src/backend/access/transam/xlog.c   | 16 +--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c   | 29 ++-
 src/backend/storage/buffer/bufmgr.c |  6 ++--
 src/backend/storage/ipc/procarray.c | 26 -
 src/backend/storage/lmgr/proc.c |  4 +--
 src/include/storage/proc.h  | 37 -
 src/include/storage/procarray.h |  5 ++--
 11 files changed, 120 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 6a70d49738..9f65c600d0 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3088,8 +3088,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	 * crash/basebackup, even though the state of the data directory would
 	 * require it.
 	 */
-	Assert(!MyProc->delayChkpt);
-	MyProc->delayChkpt = true;
+	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+	MyProc->delayChkpt |= DELAY_CHKPT_START;
 
 	/* WAL log truncation */
 	WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3115,7 +3115,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	/* Then offsets */
 	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-	MyProc->delayChkpt = false;
+	MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
 	END_CRIT_SECTION();
 	LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 874c8ed125..4dc8ccc12b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -475,7 +475,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	}
 	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
-	proc->delayChkpt = false;
+	proc->delayChkpt = 0;
 	proc->statusFlags = 0;
 	proc->pid = 0;
 	proc->databaseId = databaseid;
@@ -1164,7 +1164,8 @@ EndPrepare(GlobalTransaction gxact)
 
 	START_CRIT_SECTION();
 
-	MyProc->delayChkpt = true;
+	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+	MyProc->delayChkpt |= DELAY_CHKPT_START;
 
 	XLogBeginInsert();
 	for (record = records.head; record != NULL; record = record->next)
@@ -1207,7 +1208,7 @@ EndPrepare(GlobalTransaction gxa

Re: XID formatting and SLRU refactorings

2022-03-17 Thread Kyotaro Horiguchi
At Fri, 18 Mar 2022 10:20:08 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> "(XID_TYPE) x" is actually equivalent to "(long long) x" here, but the
> point here is "%llu in format string accepts (long long)" so we should

Of course it's the typo of
 "%llu in format string accepts (*unsigned* long long)".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Accept IP addresses in server certificate SANs

2022-03-18 Thread Kyotaro Horiguchi
At Thu, 17 Mar 2022 21:55:07 +, Jacob Champion  wrote 
in 
> On Wed, 2022-03-16 at 23:49 +, Jacob Champion wrote:
> > Thank you for the explanation -- the misunderstanding was all on my
> > end. I thought you were asking me to move the check_cn assignment
> > instead of copying it to the end. I agree that your suggestion is much
> > clearer, and I'll make that change tomorrow.
> 
> Done in v8. Thanks again for your suggestions (and for your
> perseverance when I didn't get it)!

Thanks!  .. and some nitpicks..(Sorry)

fe-secure-common.c doesn't need netinet/in.h.


+++ b/src/include/utils/inet.h
.. 
+#include "common/inet-common.h"

I'm not sure about the project policy on #include practice, but I
think it is the common practice not to include headers that are not
required by the file itself.  In this case, fe-secure-common.h itself
doesn't need the include.  Instead, fe-secure-openssl.c and
fe-secure-common.c needs the include.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Replication slot drop message is sent after pgstats shutdown.

2022-03-18 Thread Kyotaro Horiguchi
At Fri, 18 Mar 2022 00:28:37 -0700, Noah Misch  wrote in 
> ===
> diff -U3 
> /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out
>  
> /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out
> --- 
> /export/home/nm/farm/studio64v12_6/HEAD/pgsql/contrib/test_decoding/expected/slot_creation_error.out
>   Tue Feb 15 06:58:14 2022
> +++ 
> /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/contrib/test_decoding/output_iso/results/slot_creation_error.out
>   Tue Feb 15 11:38:14 2022


> I plan to fix this as attached, similar to how commit c04c767 fixed the same
> challenge in detach-partition-concurrently-[34].

It looks correct and I confirmed that it works.


It looks like a similar issue with [1] but this is cleaner and stable.

[1] 
https://www.postgresql.org/message-id/20220304.113347.2105652521035137491.horikyota@gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Make mesage at end-of-recovery less scary.

2022-03-21 Thread Kyotaro Horiguchi
At Mon, 21 Mar 2022 17:01:19 -0700, Andres Freund  wrote in 
> Patch currently fails to apply, needs a rebase:
> http://cfbot.cputube.org/patch_37_2490.log

Thanks for noticing me of that.

Rebased to the current HEAD.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From a7c9f36e631eaba5078398598dae5d459e79add9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v17] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 145 +-
 src/backend/access/transam/xlogrecovery.c |  92 ++
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 +-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 
 6 files changed, 306 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e437c42992..0942265408 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -46,6 +46,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 			 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -147,6 +149,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -552,6 +555,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -633,25 +637,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -661,18 +661,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-  "invalid record length at %X/%X: wanted %u, got %u",
-  LSN_FORMAT_ARGS(RecPtr),
-  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking. 

  1   2   3   4   5   6   7   8   9   10   >