from:"Andres Freund"

Re: Adding CI to our tree (ccache)

2022-02-20 Thread Andres Freund

Hi,

On 2022-02-20 13:36:55 -0600, Justin Pryzby wrote:
> Have you tried to use the yet-to-be-released ccache with MSVC ?

Yes, it doesn't work, because it requires cl.exe to be used in a specific way
(only a single input file, specific output file naming). Which would require a
decent amount of changes to src/tools/msvc. I think it's more realistic with
meson etc.

> Also, do you know about msbuild /outputResultsCache ?

I don't think it's really usable for what we need. But it's hard to tell.

> Did you ever try to use clcache (or others) ?
> 
> When I tried, it refused to cache because of our debug settings
> (DebugInformationFormat) - which seem to be enabled even in release mode.

> I wonder if that'll be an issue for ccache, too.  I think that line may need 
> to
> be conditional on debug mode.

That's relatively easily solvable by using a different debug format IIRC (/Z7
or such).

Greetings,

Andres Freund

Re: initdb / bootstrap design

2022-02-20 Thread Andres Freund

Hi,

On 2022-02-19 20:46:26 -0500, Tom Lane wrote:
> I tried it like that (full patch attached) and the results are intensely
> disappointing.  On my Mac laptop, the time needed for 50 iterations of
> initdb drops from 16.8 sec to 16.75 sec.

Hm. I'd hoped for at least a little bit bigger win. But I think it enables
more, see below:


> Not sure that this is worth pursuing any further.

I experimented with moving all the bootstrapping into --boot mode and got it
working. Albeit definitely with a few hacks (more below).

While I had hoped for a bit more of a win, it's IMO a nice improvement.
Executing 10 initdb -N --wal-segsize 1 in a loop:

HEAD:

  assert:
  8.06user 1.17system 0:09.25elapsed 99%CPU (0avgtext+0avgdata 
91724maxresident)k
  0inputs+549280outputs (40major+99824minor)pagefaults 0swaps

  opt:
  2.89user 0.99system 0:04.81elapsed 80%CPU (0avgtext+0avgdata 
88864maxresident)k
  0inputs+549280outputs (40major+99792minor)pagefaults 0swaps


default to lz4:

  assert:
  7.61user 1.03system 0:08.69elapsed 99%CPU (0avgtext+0avgdata 
91508maxresident)k
  0inputs+546400outputs (42major+99551minor)pagefaults 0swaps

  opt:
  2.55user 0.94system 0:03.49elapsed 99%CPU (0avgtext+0avgdata 
88816maxresident)k
  0inputs+546400outputs (40major+99551minor)pagefaults 0swaps


bootstrap replace:

  assert:
  7.42user 1.00system 0:08.52elapsed 98%CPU (0avgtext+0avgdata 
91656maxresident)k
  0inputs+546400outputs (40major+97737minor)pagefaults 0swaps

  opt:
  2.49user 0.98system 0:03.49elapsed 99%CPU (0avgtext+0avgdata 
88700maxresident)k
  0inputs+546400outputs (40major+97728minor)pagefaults 0swaps


everything in bootstrap:

  assert:
  6.31user 0.94system 0:07.35elapsed 98%CPU (0avgtext+0avgdata 
97812maxresident)k
  0inputs+547360outputs (30major+88617minor)pagefaults 0swaps

  opt:
  2.42user 0.85system 0:03.28elapsed 99%CPU (0avgtext+0avgdata 
94572maxresident)k
  0inputs+547360outputs (30major+83712minor)pagefaults 0swaps


optimize WAL in bootstrap:
  assert:
  6.26user 0.96system 0:07.29elapsed 99%CPU (0avgtext+0avgdata 
97844maxresident)k
  0inputs+547360outputs (30major+88586minor)pagefaults 0swaps

  opt:
  2.43user 0.80system 0:03.24elapsed 99%CPU (0avgtext+0avgdata 
94436maxresident)k
  0inputs+547360outputs (30major+83664minor)pagefaults 0swaps


remote isatty in bootstrap:

  assert:
  6.15user 0.83system 0:06.99elapsed 99%CPU (0avgtext+0avgdata 
97832maxresident)k
  0inputs+465120outputs (30major+88559minor)pagefaults 0swaps

  opt:
  2.28user 0.85system 0:03.14elapsed 99%CPU (0avgtext+0avgdata 
94604maxresident)k
  0inputs+465120outputs (30major+83728minor)pagefaults 0swaps


That's IMO not bad.

On windows I see a higher gains, which makes sense, because filesystem IO is
slower. Freebsd as well, but the variance is oddly high, so I might be doing
something wrong.


The main reason I like this however isn't the speedup itself, but that after
this initdb doesn't depend on single user mode at all anymore.


About the prototype:

- Most of the bootstrap SQL is executed from bootstrap.c itself. But some
  still comes from the client. E.g. password, a few information_schema
  details and the database / authid changes.

- To execute the sql I mostly used extension.c's
  read_whole_file()/execute_sql_string(). But VACUUM, CREATE DATABASE require
  all the transactional hacks in portal.c etc. So I wrapped
  exec_simple_query() for that phase.

  Might be better to just call vacuum.c / database.c directly.

- for indexed relcache access to work the phase of
  RelationCacheInitializePhase3() that's initially skipped needs to be
  executed. I hacked that up by adding a RelationCacheInitializePhase3b() that
  bootstrap.c can call, but that's obviously too ugly to live.

- InvalidateSystemCaches() is needed after bki processing. Otherwise I see an
  "row is too big:" error. Didn't investigate yet.

- I definitely removed some validation that we'd probably want. But that seems
  something to care about later...

- 0004 prevents a fair bit of WAL from being written. While XLogInsert did
  some of that, it didn't block FPIs, which obviously are bulky. This reduces
  WAL from ~5MB to ~100kB.


There's quite a bit of further speedup potential:

- One bottleneck, particularly in optimized mode, is the handling of huge node
  trees for views. strToNode() and nodeRead() are > 10% alone

- Enabling index access sometime during the postgres.bki processing would make
  invalidation handling for subsequent indexes faster. Or maybe we can disable
  a few more invalidations. Inval processing is >10%

- more than 10% (assert) / 7% (optimized) is spent in
  compute_scalar_stats()->qsort_arg(). Something seems off with that to me.


Completely crazy?


Greetings,

Andres Freund
>From 45d63168ddeb8bdf3ed29ca150f453ffcd051697 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sun, 20 Feb 2022 12:20:42 -0800
Subject: [PATCH v1 1/

Re: do only critical work during single-user vacuum?

2022-02-20 Thread Andres Freund

Hi,

On 2022-02-19 20:57:57 -0800, Noah Misch wrote:
> On Wed, Feb 16, 2022 at 03:43:12PM +0700, John Naylor wrote:
> > On Wed, Feb 16, 2022 at 6:17 AM Peter Geoghegan  wrote:
> > > On Tue, Feb 15, 2022 at 9:28 AM Peter Geoghegan  wrote:
> > 
> > > > I did notice from my own testing of the failsafe (by artificially
> > > > inducing wraparound failure using an XID burning C function) that
> > > > autovacuum seemed to totally correct the problem, even when the system
> > > > had already crossed xidStopLimit - it came back on its own. I wasn't
> > > > completely sure of how robust this effect was, though.
> > 
> > I'll put some effort in finding any way that it might not be robust.
> 
> A VACUUM may create a not-trivially-bounded number of multixacts via
> FreezeMultiXactId().  In a cluster at multiStopLimit, completing VACUUM
> without error needs preparation something like:
> 
> 1. Kill each XID that might appear in a multixact.
> 2. Resolve each prepared transaction that might appear in a multixact.
> 3. Run VACUUM.  At this point, multiStopLimit is blocking new multixacts from
>other commands, and the lack of running multixact members removes the need
>for FreezeMultiXactId() to create multixacts.
> 
> Adding to the badness of single-user mode so well described upthread, one can
> enter it without doing (2) and then wrap the nextMXact counter.

If we collected the information along the lines of  I proposed in the second 
half of
https://www.postgresql.org/message-id/20220204013539.qdegpqzvayq3d4y2%40alap3.anarazel.de
we should be able to handle such cases more intelligently, I think?

We could e.g. add an error if FreezeMultiXactId() needs to create a new
multixact for a far-in-the-past xid. That's not great, of course, but if we
include the precise cause (pid of backend / prepared xact name / slot name /
...) necessitating creating a new multi, it'd still be a significant
improvement over the status quo.

Greetings,

Andres Freund

Re: set TESTDIR from perl rather than Makefile

2022-02-20 Thread Andres Freund

Hi,

On 2022-02-19 17:53:09 -0600, Justin Pryzby wrote:
> I also meant to also attach it.

Is the patch actually independent of the other patches in your stack?


I like this concept a lot:

- I've had to use a wrapper around individual tap tests for meson, just to set
  the CWD etc.
- Being able to run all tap tests at once, instead of many separate prove
  invocations is a lot more readable. And can be faster.
- makes it easier to invoke tap tests "manually", which can be useful when
  debugging problems (not fun to run make in valgrind or rr)
- I'd like to put test data and test log files in different places than they
  are eventually. This seems like it gets us a tiny bit closer to that.



> - $expected = slurp_file_eval("traces/$testname.trace");
> + my $inputdir = "$ENV{'TESTDIR'}/tmp_check";
> + $expected = slurp_file_eval("$inputdir/traces/$testname.trace");

Why is this needed? Shouldn't we end up in exactly the same dir with/without
this patch?


> diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm 
> b/src/test/perl/PostgreSQL/Test/Utils.pm
> index 31e2b0315e..8a8d95ca8c 100644
> --- a/src/test/perl/PostgreSQL/Test/Utils.pm
> +++ b/src/test/perl/PostgreSQL/Test/Utils.pm
> @@ -184,19 +184,21 @@ INIT
>   # test may still fail, but it's more likely to report useful facts.
>   $SIG{PIPE} = 'IGNORE';
>  
> - # Determine output directories, and create them.  The base path is the
> - # TESTDIR environment variable, which is normally set by the invoking
> - # Makefile.
> - $tmp_check = $ENV{TESTDIR} ? "$ENV{TESTDIR}/tmp_check" : "tmp_check";
> + my $test_dir = File::Spec->rel2abs(dirname($0));
> + my $test_name = basename($0);
> + $test_name =~ s/\.[^.]+$//;
> +
> + # Determine output directories, and create them.
> + # TODO: set srcdir?
> + $tmp_check = "$test_dir/tmp_check";
>   $log_path = "$tmp_check/log";
> + $ENV{TESTDIR} = $test_dir;
>  
>   mkdir $tmp_check;
>   mkdir $log_path;
>  
>   # Open the test log file, whose name depends on the test name.
> - $test_logfile = basename($0);
> - $test_logfile =~ s/\.[^.]+$//;
> - $test_logfile = "$log_path/regress_log_$test_logfile";
> + $test_logfile = "$log_path/regress_log_$test_name";
>   open my $testlog, '>', $test_logfile
> or die "could not open STDOUT to logfile \"$test_logfile\": $!";
>  
> diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl
> index e2b0db0879..63085506e0 100644
> --- a/src/tools/msvc/vcregress.pl
> +++ b/src/tools/msvc/vcregress.pl
> @@ -261,10 +261,8 @@ sub tap_check
>   $ENV{PG_REGRESS}= "$topdir/$Config/pg_regress/pg_regress";
>   $ENV{REGRESS_SHLIB} = "$topdir/src/test/regress/regress.dll";
>  
> - $ENV{TESTDIR} = "$dir";
>   my $module = basename $dir;
> - # add the module build dir as the second element in the PATH
> - $ENV{PATH} =~ s!;!;$topdir/$Config/$module;!;
> + #$ENV{VCREGRESS_MODE} = $Config;

Hm. How does the module build dir end up on PATH after this?

Greetings,

Andres Freund

Re: pg_upgrade verbosity when redirecting output to log file

2022-02-20 Thread Andres Freund

Hi,

On 2022-02-18 19:46:26 -0600, Justin Pryzby wrote:
> +* If outputting to a tty / or , append newline. pg_log_v() will put 
> the
> +* individual progress items onto the next line.
> +*/
> +   if (log_opts.isatty || log_opts.verbose)
>
> I guess the comment should say "or in verbose mode".

Indeed. I think I got caught in a back-and-forth between different
formulations.

Baring that, anybody against committing this?

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-20 Thread Andres Freund

Hi,

On 2022-02-21 12:56:46 +0900, Masahiko Sawada wrote:
> > The patch you referenced [1] should just store the stats in the
> > pg_stat_subscription view, not pg_stat_subscription_workers.
> >
> > It *does* make sense to keep stats about the number of table syncs that 
> > failed
> > etc. But that should be a counter in pg_stat_subscription, not a row in
> > pg_stat_subscription_workers.
>
> We have discussed using pg_stat_subscription before but concluded it's
> not an appropriate place to store error information because it ends up
> keeping cumulative stats mixed with non-cumulative stats.

Well, as we've amply discussed, the non-cumulative stats shouldn't be in the
pgstat subsystem.


> To take a precedent, we used to store accumulative statistics such as
> spill_txns to pg_stat_replication, but then for the same reason we moved
> those statistics to the new stats view, pg_stat_replication_slot. New
> subscription statistics that we're introducing are cumulative statistics
> whereas pg_stat_subscription is a dynamic statistics view.

I'm happy to have cumulative subscriber stats somewhere in pgstats. But it
shouldn't be split by worker or relation, and it shouldn't contain
non-cumulative error information.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-20 Thread Andres Freund

Hi,

On 2022-02-21 12:39:31 +0530, Amit Kapila wrote:
> Fair enough. Then, how about the following keeping the following information:

Mostly sounds good.

> * subid (subscription id)
> * subname (subscription name)

Coming from catalog, via join, I assume?

> * sync_error_count/sync_failure_count (number of timed table sync failed)
> * apply_error_count/apply_failure_count (number of times apply failed)

Yep.

> * sync_success_count (number of table syncs successfully finished)

This one I'm not quite convinced by. You can't rely on precise counts with
pgstats and we should be able to get a better idea from somewhere more
permanent which relations succeeded?  But it also doesn't do much harm, so ...

> * apply_commit_count (number of transactions applied successfully)
> * apply_rollback_count (number of transactions explicitly rolled back)

What does "explicit" mean here?

> * stats_reset (Time at which these statistics were last reset)
> 
> The view name could be pg_stat_subscription_lrep,
> pg_stat_logical_replication, or something on those lines.

pg_stat_subscription_stats :)

(I really dislike that we have pg_stat_ stuff that's not actually stats, but
something describing the current state, but that ship has well sailed).

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-21 Thread Andres Freund

Hi,

On 2022-02-21 14:49:01 +0530, Amit Kapila wrote:
> On Mon, Feb 21, 2022 at 1:18 PM Andres Freund  wrote:
> > > * stats_reset (Time at which these statistics were last reset)
> > >
> > > The view name could be pg_stat_subscription_lrep,
> > > pg_stat_logical_replication, or something on those lines.
> >
> > pg_stat_subscription_stats :)

> Having *stat* two times in the name sounds slightly odd to me but let
> us see what others think. One more option could be
> pg_stat_subscription_replication.

It was a joke, making light of our bad naming in pg_stat_*, not a serious
suggestion...

Greetings,

Andres Freund

Re: pg_upgrade verbosity when redirecting output to log file

2022-02-21 Thread Andres Freund

On 2022-02-21 15:29:09 +0100, Daniel Gustafsson wrote:
> > On 21 Feb 2022, at 02:07, Andres Freund  wrote:
> 
> > Baring that, anybody against committing this?
> 
> LGTM. The above mentioned comment was the only thing I found as well.

Thanks for the review Justin and Daniel. Pushed.

making pg_regress less noisy by removing boilerplate

2022-02-21 Thread Andres Freund

Hi,

When running check-world, a good chunk of the output is just pg_regress
boilerplate. It doesn't matter when running tests individually or for tests
with a lot of individual tests like the main regression tests. But for lots of
the rest it is noisy. These days there are many more regression tests than
there used to be when the output was designed...


  == creating temporary instance==
  == initializing database system   ==
  == starting postmaster==
  running on port 51696 with PID 1156405
  == creating database "contrib_regression" ==
  CREATE DATABASE
  ALTER DATABASE
  == installing hstore  ==
  CREATE EXTENSION
  == running regression test queries==
  test python3/hstore_plpython  ... ok   50 ms
  == shutting down postmaster   ==
  == removing temporary instance==

  =
   All 1 tests passed.
  =

  == creating temporary instance==
  [...]


It seems we could

- combine "creating temporary instance", "initializing database system" and
  "starting postmaster"
- combine "creating database" and "installing *" sections
- combine "shutting down postmaster" and "removing temporary instance"
- put CREATE DATABASE, ALTER DATABASE, CREATE EXTENSION into a single line

without loosing much information, but reducing the vertical space used a lot?

It might even be worth further collapsing "starting postmaster" and "creating
database" etc into one.


There's also Daniel's patch of adding a tap compatible output mode [1]. I
still like that idea a lot - being able to parse test successes / failures in
a uniform way across tests would be great. The meson testrunner for example
would benefit from that.  But if I understood that patch, it'd also benefit
from the reduction in unnecessary headers, albeit not as crucially.

Greetings,

Andres Freund

[1] https://postgr.es/m/62A6B9F0-C3D3-45CD-8E8B-90A8E5B08DFA%40yesql.se

Re: [RFC] building postgres with meson

2022-02-21 Thread Andres Freund

Hi,

On 2021-10-13 13:54:10 +0200, Daniel Gustafsson wrote:
> I added a --tap option for TAP output to pg_regress together with Jinbao Chen
> for giggles and killing some time a while back.

Sorry for not replying to this earlier. I somehow thought I had, but the
archives disagree.

I think this would be great.


> If it's helpful and there's any interest for this I'm happy to finish it up 
> now.

Yes!  Probably worth starting a new thread for...


> One thing that came out of this, is that we don't really handle the ignored
> tests in the way the code thinks it does for normal output, the attached 
> treats
> ignored tests as SKIP tests.

I can't really parse the first sentence...


>   if (exit_status != 0)
>   log_child_failure(exit_status);
> @@ -2152,6 +2413,7 @@ regression_main(int argc, char *argv[],
>   {"config-auth", required_argument, NULL, 24},
>   {"max-concurrent-tests", required_argument, NULL, 25},
>   {"make-testtablespace-dir", no_argument, NULL, 26},
> + {"tap", no_argument, NULL, 27},
>   {NULL, 0, NULL, 0}
>   };

I'd make it a --format=(regress|tap) or such.

Greetings,

Andres Freund

Re: making pg_regress less noisy by removing boilerplate

2022-02-21 Thread Andres Freund

Hi,

On 2022-02-21 12:05:42 -0500, Tom Lane wrote:
> Also, those steps typically run a lot faster than they did then
> (both software speedups, and most people use better hardware).
> We no longer need that output to reassure ourselves that progress
> is being made.

Indeed.

> > It seems we could [ combine printouts ]
>
> How about going further, and just not print *any* of this overhead
> stuff

+1

> except maybe the "running on port 51696 with PID 1156405" line (and I'm not
> too wedded to that)?

We still have a few issues with ports conflicts on windows. We should really
consider just desupporting all windows versions without unix socket
support. But until then it seems useful to be able to figure out random
failures.

> It'd also be a good idea to start using "make -s" by default for the
> preparatory steps.

For the temp-install and checkprep targets, or for something else? That'd
probably just be a few @'s in front of rm, mkdir?

One thing with temp-install that's been bothering me is that it often hides
compilation failures inside install.log, and that it ends up building contrib
with -j1. It'd be a bit of redundant work, but perhaps we should make it
depend on a top-directory world-bin?

This reminds me that a while ago I added the below to my Makefile.custom, to
avoid all the "Nothing to be done for xyz" noise. Not sure if it bothers
others as much as me...

# Targets that don't require building anything cause make to say "make:
# Nothing to be done for 'target'."  That's OK if it happens once at the
# top-level, but distracting for recursive make invocations of targets like
# distprep, generated-headers-symlinks etc.
#
# The message can be silenced by depending on another target that always does
# something. That works better than adding recipes to each of these common
# targets, because in some makefiles they have their own recipses, triggering
# warnings.
#
# To avoid repetition, use foreach/eval/call to create the dependencies, and a 
prefix
# rule for the recipe.
#
# @ silences, : is the shell do-nothing command.
%.silence:
@:

define create_silence_target
$(1): | $(1).silence
endef
$(foreach t,$(standard_targets) $(standard_always_targets) 
generated-header-symlinks,$(eval $(call create_silence_target,$(t

Greetings,

Andres Freund

Re: Time to drop plpython2?

2022-02-21 Thread Andres Freund

Hi,

On 2022-02-21 09:49:32 -0800, Mark Wong wrote:
> Oops, made another pass for python3 dev libraries.

Thanks!

> I can't seem to find archived ppc repos OpenSUSE Leap 43.2.  I'm
> debating whether to disable python or upgrade/rebrand that animal for a
> newer SUSE release.  I've stopped my cron jobs on this animal for the
> time being.

I assume you mean leap 42.3, that seemed to be the newest 4x.x? If so that's
been archived 2019-07-01 [1].

Leap's versioning is, uh, confusing. 13 -> 42 -> 15. Yea.

I don't think it's really useful to run out-of-support distribution
versions. Leaving security etc aside, not being able to install packages seems
sufficient reason to upgrade.

Greetings,

Andres Freund

[1] 
https://lists.opensuse.org/archives/list/security-annou...@lists.opensuse.org/message/6CIQPV3H6J4AIIXUKUGI4IESMTHIFFFB/

Re: making pg_regress less noisy by removing boilerplate

2022-02-21 Thread Andres Freund

Hi,

On 2022-02-21 12:40:18 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-02-21 12:05:42 -0500, Tom Lane wrote:
> >> except maybe the "running on port 51696 with PID 1156405" line (and I'm not
> >> too wedded to that)?
> 
> > We still have a few issues with ports conflicts on windows. We should really
> > consider just desupporting all windows versions without unix socket
> > support. But until then it seems useful to be able to figure out random
> > failures.
> 
> Yeah, also it seems useful to distinguish installcheck and check cases.
> With context removed, it might not make as much sense as it does today,
> so I'd vote for adding a word or two.  Perhaps the output could
> look like
> 
>   test postmaster started on port 51696 with PID 1156405
>   "testname ... ok" lines here
>   test postmaster stopped
>   "all tests passed" or "n tests passed" here

I'm all for something minimal like this.

I guess we could just try to go for a tap compatible output, but it'd probably
be too annoying to not see the test name when it starts.

Perhaps we could still take a page out of tap's book, and denote non-test
output lines by starting with an #, so it's visually obvious what are lines
about tests?

It'd probably not be a good idea to backpatch this, even if it'd be nice to
have at least semi-consistent formatting across branches? Too likely somebody
built scripts depending on the old format?

> >> It'd also be a good idea to start using "make -s" by default for the
> >> preparatory steps.
> 
> > For the temp-install and checkprep targets, or for something else? That'd
> > probably just be a few @'s in front of rm, mkdir?
> 
> Well, all that stuff is interesting only in case of a failure.

> > One thing with temp-install that's been bothering me is that it often hides
> > compilation failures inside install.log,
> 
> Yeah, I've run into that too --- even if there's no failure, you'll
> never see any compiler warnings.  Perhaps if we started using
> "make -s" we'd not need install.log at all, and could just let
> errors/warnings spew to stderr where the user would see 'em.

WFM.  I'd still like to address the issue of building contrib with -j1 (due to
checkprep ending up building contrib). But we can do that separately too.

I assume we'd want to do this in all branches?

Greetings,

Andres Freund

Re: External data files possible?

2022-02-21 Thread Andres Freund

Hi,

On 2022-02-21 15:16:31 -0600, Chris Cleveland wrote:
> It's turning out to be difficult to store the data for my custom index
> access method in the main fork. Breaking up the data into pages with page
> headers means a lot of extra work, a big performance hit, and disk space
> management headaches. It's just not a good fit for my particular file
> format.

I assume you're planning to not go through shared buffers, right?

> It would be much better to store the index in a set of external data files.
> This seems possible so long as I put the files under the database's
> directory and name things properly.
> 
> But here's the one thing I haven't figured out: how to delete the files
> when the index, table, or database gets dropped. The IndexAmRoutine does
> not have an "amdrop" hook that gets called when the index gets dropped.

For some things it'd probably work to just use the normal files, but format
them differently. I.e. go through the smgr.c layer, but not bufmgr.

But unfortunately e.g. basebackup.c will assume they're the normal format and
complain about checksums etc. I don't think there's a way around that right
now.

> Is there a hook I can use to clean these files up? More generally, can I
> get away with using my own data files without causing a problem?

Not currently. A plain hook wouldn't suffice, because it'd not integrate with
transactional DDL and crash recovery.

Greetings,

Andres Freund

Re: speed up a logical replica setup

2022-02-21 Thread Andres Freund

Hi,

On 2022-02-21 09:09:12 -0300, Euler Taveira wrote:
> Logical replication has been used to migration with minimal downtime. However,
> if you are dealing with a big database, the amount of required resources (disk
> -- due to WAL retention) increases as the backlog (WAL) increases. Unless you
> have a generous amount of resources and can wait for long period of time until
> the new replica catches up, creating a logical replica is impracticable on
> large databases.

Indeed.


> DESIGN
> 
> The conversion requires 8 steps.
> 
> 1. Check if the target data directory has the same system identifier than the
> source data directory.
> 2. Stop the target server if it is running as a standby server. (Modify
> recovery parameters requires a restart.)
> 3. Create one replication slot per specified database on the source server. 
> One
> additional replication slot is created at the end to get the consistent LSN
> (This consistent LSN will be used as (a) a stopping point for the recovery
> process and (b) a starting point for the subscriptions).
> 4. Write recovery parameters into the target data directory and start the
> target server (Wait until the target server is promoted).
> 5. Create one publication (FOR ALL TABLES) per specified database on the 
> source
> server.
> 6. Create one subscription per specified database on the target server (Use
> replication slot and publication created in a previous step. Don't enable the
> subscriptions yet).
> 7. Sets the replication progress to the consistent LSN that was got in a
> previous step.
> 8. Enable the subscription for each specified database on the target server.

I think the system identifier should also be changed, otherwise you can way
too easily get into situations trying to apply WAL from different systems to
each other. Not going to end well, obviously.


> This tool does not take a base backup. It can certainly be included later.
> There is already a tool do it: pg_basebackup.

It would make sense to allow to call pg_basebackup from the new tool. Perhaps
with a --pg-basebackup-parameters or such.



Greetings,

Andres Freund

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-10 14:26:59 -0800, Andres Freund wrote:
> On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:
> > It seems like I should go ahead and do that today, and we can study
> > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?
> 
> Yes.

I wrote a test to show the problem. While looking at the code I unfortunately
found that CREATE DATABASE ... OID isn't the only problem. Two consecutive
ALTER DATABASE ... SET TABLESPACE also can cause corruption.

The test doesn't work on < 15 (due to PostgresNode renaming and use of
allow_in_place_tablespaces). But manually it's unfortunately quite
reproducible :(

Start a server with

shared_buffers = 1MB
autovacuum=off
bgwriter_lru_maxpages=1
bgwriter_delay=1

these are just to give more time / to require smaller tables.

Start two psql sessions

s1: \c postgres
s1: CREATE DATABASE conflict_db;
s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg';
s1: CREATE EXTENSION pg_prewarm;

s2: \c conflict_db
s2: CREATE TABLE large(id serial primary key, dataa text, datab text);
s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 
4000) g(i);
s2: UPDATE large SET datab = 1;

-- prevent AtEOXact_SMgr
s1: BEGIN;

-- Fill shared_buffers with other contents. This needs to write out the prior 
dirty contents
-- leading to opening smgr rels / file descriptors
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE 
pg_relation_filenode(oid) != 0;

s2: \c postgres
-- unlinks the files
s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace;
-- now new files with the same relfilenode exist
s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default;
s2: \c conflict_db
-- dirty buffers again
s2: UPDATE large SET datab = 2;

-- this again writes out the dirty buffers. But because nothing forced the smgr 
handles to be closed,
-- fds pointing to the *old* file contents are used. I.e. we'll write to the 
wrong file
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE 
pg_relation_filenode(oid) != 0;

-- verify that everything is [not] ok
s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10;

┌───┬───┐
│ datab │ count │
├───┼───┤
│ 1 │  2117 │
│ 2 │67 │
└───┴───┘
(2 rows)

oops.

I don't yet have a good idea how to tackle this in the backbranches.

I've started to work on a few debugging aids to find problem like
these. Attached are two WIP patches:

1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with
   deleted files. That seems to work (almost) everywhere. Optionally, on
   linux, use readlink(/proc/$pid/fd/$fd) to get the filename.
2) On linux, iterate over PGPROC to get a list of pids. Then iterate over
   /proc/$pid/fd/ to check for deleted files. This only can be done after
   we've just made sure there's no fds open to deleted files.

Greetings,

Andres Freund

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 01:11:21 -0800, Andres Freund wrote:
> I've started to work on a few debugging aids to find problem like
> these. Attached are two WIP patches:

Forgot to attach. Also importantly includes a tap test for several of these
issues

Greetings,

Andres Freund
>From 0bc64874f8e5faae9a38731a83aa7b001095cc35 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 21 Feb 2022 15:44:02 -0800
Subject: [PATCH v1 1/4] WIP: test for file reuse dangers around database and
 tablespace commands.

---
 src/test/recovery/t/029_relfilenode_reuse.pl | 233 +++
 1 file changed, 233 insertions(+)
 create mode 100644 src/test/recovery/t/029_relfilenode_reuse.pl

diff --git a/src/test/recovery/t/029_relfilenode_reuse.pl b/src/test/recovery/t/029_relfilenode_reuse.pl
new file mode 100644
index 000..22d8e85614c
--- /dev/null
+++ b/src/test/recovery/t/029_relfilenode_reuse.pl
@@ -0,0 +1,233 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Basename;
+
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', q[
+allow_in_place_tablespaces = true
+log_connections=on
+# to avoid "repairing" corruption
+full_page_writes=off
+log_min_messages=debug2
+autovacuum_naptime=1s
+shared_buffers=1MB
+]);
+$node_primary->start;
+
+
+# Create streaming standby linking to primary
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(300);
+
+my %psql_primary = (stdin => '', stdout => '', stderr => '');
+$psql_primary{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_primary->connstr('postgres') ],
+	'<',
+	\$psql_primary{stdin},
+	'>',
+	\$psql_primary{stdout},
+	'2>',
+	\$psql_primary{stderr},
+	$psql_timeout);
+
+my %psql_standby = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_standby->connstr('postgres') ],
+	'<',
+	\$psql_standby{stdin},
+	'>',
+	\$psql_standby{stdout},
+	'2>',
+	\$psql_standby{stderr},
+	$psql_timeout);
+
+
+# Create template database with a table that we'll update, to trigger dirty
+# rows. Using a template database + preexisting rows makes it a bit easier to
+# reproduce, because there's no cache invalidations generated.
+
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db_template OID = 5;");
+$node_primary->safe_psql('conflict_db_template', q[
+CREATE TABLE large(id serial primary key, dataa text, datab text);
+INSERT INTO large(dataa, datab) SELECT g.i::text, 1 FROM generate_series(1, 4000) g(i);]);
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;");
+
+$node_primary->safe_psql('postgres', q[
+CREATE EXTENSION pg_prewarm;
+CREATE TABLE replace_sb(data text);
+INSERT INTO replace_sb(data) SELECT random()::text FROM generate_series(1, 15000);]);
+
+# Use longrunning transactions, so that AtEOXact_SMgr doesn't close files
+send_query_and_wait(
+	\%psql_primary,
+	q[BEGIN;],
+	qr/BEGIN/m);
+send_query_and_wait(
+	\%psql_standby,
+	q[BEGIN;],
+	qr/BEGIN/m);
+
+# Cause lots of dirty rows in shared_buffers
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 1;");
+
+# Now do a bunch of work in another database. That will end up needing to
+# write back dirty data from the previous step, opening the relevant file
+# descriptors
+cause_eviction(\%psql_primary, \%psql_standby);
+
+# drop and recreate database
+$node_primary->safe_psql('postgres', "DROP DATABASE conflict_db;");
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;");
+
+verify($node_primary, $node_standby, 1,
+	   "initial contents as expected");
+
+# Again cause lots of dirty rows in shared_buffers, but use a different update
+# value so we can check everything is OK
+$node_primary-&

Re: TAP output format in pg_regress

2022-02-22 Thread Andres Freund

Hi,

Thanks for the updated version!

On 2022-02-22 15:10:11 +0100, Daniel Gustafsson wrote:
> The errorpaths that exit(2) the testrun should be converted to "bail out" 
> lines
> when running with TAP output, but apart from that I think it's fairly spec
> compliant.

I'd much rather not use BAIL - I haven't gotten around to doing anything about
it, but I really want to get rid of nearly all our uses of bail:

https://www.postgresql.org/message-id/20220213232249.7sevhlioapydla37%40alap3.anarazel.de

Greetings,

Andres Freund

Re: bailing out in tap tests nearly always a bad idea

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-14 09:46:20 -0800, Andres Freund wrote:
> > t/die-immediately.t  aieee at t/die-immediately.t line 1.
> > t/die-immediately.t  Dubious, test returned 255 (wstat 65280, 0xff00)
> > No subtests run
> 
> Hm. Looks different when a test is including our test helpers.
> 
> t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
> No subtests run
> t/000_die_test_more.pl .. error at t/000_die_test_more.pl line 2.
> t/000_die_test_more.pl .. Dubious, test returned 255 (wstat 65280, 0xff00)
> No subtests run
> 
> So I think we broke something... I think it's the log file redirection stuff
> in INIT.

Any chance somebody with more perl knowledge could look into this? Clearly our
effort to copy all the output the original file descriptors isn't successful
here.

$ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use 
PostgreSQL::Test::Utils; die 'blorb';"
$ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 
'blorb';"
blorb at -e line 1.

(note that the former will create a tmp_check in your current directory)

Greetings,

Andres Freund

Re: bailing out in tap tests nearly always a bad idea

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 09:28:37 -0800, Andres Freund wrote:
> On 2022-02-14 09:46:20 -0800, Andres Freund wrote:
> > > t/die-immediately.t  aieee at t/die-immediately.t line 1.
> > > t/die-immediately.t  Dubious, test returned 255 (wstat 65280, 0xff00)
> > > No subtests run
> >
> > Hm. Looks different when a test is including our test helpers.
> >
> > t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
> > No subtests run
> > t/000_die_test_more.pl .. error at t/000_die_test_more.pl line 2.
> > t/000_die_test_more.pl .. Dubious, test returned 255 (wstat 65280, 
> > 0xff00)
> > No subtests run
> >
> > So I think we broke something... I think it's the log file redirection stuff
> > in INIT.
>
> Any chance somebody with more perl knowledge could look into this? Clearly our
> effort to copy all the output the original file descriptors isn't successful
> here.
>
> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use 
> PostgreSQL::Test::Utils; die 'blorb';"
> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 
> 'blorb';"
> blorb at -e line 1.

So the problem is that die just outputs to stderr, but we've decided to have
stderr connected to the output going to tap for some reason. I guess that
prevents us from messing up the tap output, but it's still weird. Because of
that redirection, die printing to stderr isn't visible to tap.

Seems we can use black magic to "fix" that... Putting the following into
INIT{} seems to do the trick:

# Because of the above redirection the tap output wouldn't contain
# information about tests failing due to die etc. Fix that by also
# printing the failure to the original stderr.
$SIG{__DIE__} = sub
{
# Don't redirect if it's a syntax error ($^S is undefined) or 
in an
# eval block ($^S == 1).
if (defined $^S and $^S == 0)
{
print $orig_stderr "$_[0]\n";
#fail("died: $_[0]");
#done_testing();
}
};

$ cat /tmp/die_pg_utils.pl
use PostgreSQL::Test::Utils;
use Test::More;

ok(1, "foo");
die 'blorb';
done_testing();

With the print we get something like:

$ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl
ok 1 - foo
blorb at /tmp/die_pg_utils.pl line 5.

# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 25 just after 1.

With the fail() and done_testing() we get

$ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl
ok 1 - foo
not ok 2 - died: blorb at /tmp/die_pg_utils.pl line 5.
#
#   Failed test 'died: blorb at /tmp/die_pg_utils.pl line 5.
# '
#   at /home/andres/src/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 
235.
1..2
# Looks like your test exited with 25 just after 2.

The latter output doesn't confuse with stuff about plans and exit codes. But
contains redundant messages (whatever) and it doesn't seem particularly clean
to hijack a "test number".

Greetings,

Andres Freund

Re: bailing out in tap tests nearly always a bad idea

2022-02-22 Thread Andres Freund

On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote:
> I'll be surprised if we can't come up with something cleaner than that.

Suggestions?

Re: making pg_regress less noisy by removing boilerplate

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 08:55:23 -0500, Andrew Dunstan wrote:
> I'm pretty sure all my Windows machines with buildfarm animals are
> sufficiently modern except the XP machine, which only builds release 10
> nowadays.

Cool.

> What's involved in moving to require Unix socket support?

It works today (the CI scripts use it on windows for example).

But it's awkward because make_temp_sockdir() defaults to /tmp/ if TMPDIR isn't
set. Which it is not by default on windows. There's PG_REGRESS_SOCK_DIR, which
kind of works for the msvc build, because pg_regress tests aren't run
concurrently (whereas tap tests can be run concurrently with
PROVE_FLAGS-j).

I think we just make make_temp_sockdir() a tad smarter. Perhaps by lifting the
code in src/bin/psql/command.c:do_edit() to src/port/path.c or such?

Greetings,

Andres Freund

Re: small development tip: Consider using the gold linker

2022-02-22 Thread Andres Freund

Hi,

On 2022-01-13 19:24:09 -0800, Peter Geoghegan wrote:
> On Fri, Jul 20, 2018 at 8:16 PM Peter Geoghegan  wrote:
> > On Mon, Jul 9, 2018 at 4:40 PM, Andres Freund  wrote:
> > > FWIW, I've lately noticed that I spend a fair time waiting for the
> > > linker during edit-compile-test cycles.  Due to an independent issue I
> > > just used the gold linker, and the speedup is quite noticable.
> > > For me just adding '-fuse-ld=gold' to CFLAGS works.
> >
> > I tried this out today. It makes quite a noticeable difference for me.
> > Thanks for the tip.
> 
> The 2022 version of the same tip: switching over to LLVM lld seems to
> offer a further significant speedup over gold. Not surprising, since
> lld is specifically promoted as a linker that is faster than gold [1].

I've switched back and forth between gold and lld a couple times. Eventually I
always get frustrated by lld causing slowdowns and other issues for gdb. Did
you ever hit one of those?


> Again, this appears to just be a case of installing lld, and adding
> '-fuse-ld=lld' to CFLAGS -- a very easy switch to make. Link time is
> still a noticeable contributor to overall build time for me, even with
> gold (I use ccache, of course).

Yea. gold also isn't really maintained anymore, so it'd be nice to move on...


> There is another linker called mold [2]. It claims to be faster than
> lld (which was faster than gold, which was faster than ld). I didn't
> bother trying it out.

Yea, didn't yet quite seem there yet last time I looked, and lld seems plenty
fast for our purposes.

Greetings,

Andres Freund

Re: Race condition in InvalidateObsoleteReplicationSlots()

2022-02-22 Thread Andres Freund

Hi,

On 2021-06-11 12:27:57 -0400, Álvaro Herrera wrote:
> On 2021-Jun-10, Álvaro Herrera wrote:
> 
> > Here's a version that I feel is committable (0001).  There was an issue
> > when returning from the inner loop, if in a previous iteration we had
> > released the lock.  In that case we need to return with the lock not
> > held, so that the caller can acquire it again, but weren't.  This seems
> > pretty hard to hit in practice (I suppose somebody needs to destroy the
> > slot just as checkpointer killed the walsender, but before checkpointer
> > marks it as its own) ... but if it did happen, I think checkpointer
> > would block with no recourse.  Also added some comments and slightly
> > restructured the code.
> > 
> > There are plenty of conflicts in pg13, but it's all easy to handle.
> 
> Pushed, with additional minor changes.

I stared at this code, due to [1], and I think I found a bug. I think it's not
the cause of the failures in that thread, but we probably should still do
something about it.

I think the minor changes might unfortunately have introduced a race? Before
the patch just used ConditionVariableSleep(), but now it also has a
ConditionVariablePrepareToSleep().  Without re-checking the sleep condition
until
/* Wait until the slot is released. */
ConditionVariableSleep(&s->active_cv,
   WAIT_EVENT_REPLICATION_SLOT_DROP);

which directly violates what ConditionVariablePrepareToSleep() documents:

 * This can optionally be called before entering a test/sleep loop.
 * Doing so is more efficient if we'll need to sleep at least once.
 * However, if the first test of the exit condition is likely to succeed,
 * it's more efficient to omit the ConditionVariablePrepareToSleep call.
 * See comments in ConditionVariableSleep for more detail.
 *
 * Caution: "before entering the loop" means you *must* test the exit
 * condition between calling ConditionVariablePrepareToSleep and calling
 * ConditionVariableSleep.  If that is inconvenient, omit calling
 * ConditionVariablePrepareToSleep.

Afaics this means we can potentially sleep forever if the prior owner of the
slot releases it before the ConditionVariablePrepareToSleep().

There's a comment that's mentioning this danger:

/*
 * Prepare the sleep on the slot's condition variable before
 * releasing the lock, to close a possible race condition if the
 * slot is released before the sleep below.
 */
ConditionVariablePrepareToSleep(&s->active_cv);

LWLockRelease(ReplicationSlotControlLock);

but afaics that is bogus, because releasing a slot doesn't take
ReplicationSlotControlLock. That just protects against the slot being dropped,
not against it being released.

We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier,
before the checks at the start of the while loop.

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20220218231415.c4plkp4i3reqcwip%40alap3.anarazel.de

Re: Race condition in InvalidateObsoleteReplicationSlots()

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 17:48:55 -0800, Andres Freund wrote:
> I think the minor changes might unfortunately have introduced a race? Before
> the patch just used ConditionVariableSleep(), but now it also has a
> ConditionVariablePrepareToSleep().  Without re-checking the sleep condition
> until
> /* Wait until the slot is released. */
> ConditionVariableSleep(&s->active_cv,
>WAIT_EVENT_REPLICATION_SLOT_DROP);
> 
> which directly violates what ConditionVariablePrepareToSleep() documents:
> 
>  * This can optionally be called before entering a test/sleep loop.
>  * Doing so is more efficient if we'll need to sleep at least once.
>  * However, if the first test of the exit condition is likely to succeed,
>  * it's more efficient to omit the ConditionVariablePrepareToSleep call.
>  * See comments in ConditionVariableSleep for more detail.
>  *
>  * Caution: "before entering the loop" means you *must* test the exit
>  * condition between calling ConditionVariablePrepareToSleep and calling
>  * ConditionVariableSleep.  If that is inconvenient, omit calling
>  * ConditionVariablePrepareToSleep.
> 
> 
> Afaics this means we can potentially sleep forever if the prior owner of the
> slot releases it before the ConditionVariablePrepareToSleep().
> 
> There's a comment that's mentioning this danger:
> 
> /*
>  * Prepare the sleep on the slot's condition variable before
>  * releasing the lock, to close a possible race condition if the
>  * slot is released before the sleep below.
>  */
>   ConditionVariablePrepareToSleep(&s->active_cv);
> 
>   LWLockRelease(ReplicationSlotControlLock);
> 
> but afaics that is bogus, because releasing a slot doesn't take
> ReplicationSlotControlLock. That just protects against the slot being dropped,
> not against it being released.
> 
> We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier,
> before the checks at the start of the while loop.

Not at the start of the while loop, outside of the while loop. Doing it in the
loop body doesn't make sense, even if it's at the top. Each
ConditionVariablePrepareToSleep() will unregister itself:

/*
 * If some other sleep is already prepared, cancel it; this is necessary
 * because we have just one static variable tracking the prepared sleep,
 * and also only one cvWaitLink in our PGPROC.  It's okay to do this
 * because whenever control does return to the other test-and-sleep loop,
 * its ConditionVariableSleep call will just re-establish that sleep as
 * the prepared one.
 */
if (cv_sleep_target != NULL)
ConditionVariableCancelSleep();

The intended use is documented in this comment:

 * This should be called in a predicate loop that tests for a specific exit
 * condition and otherwise sleeps, like so:
 *
 *   ConditionVariablePrepareToSleep(cv);  // optional
 *   while (condition for which we are waiting is not true)
 *   ConditionVariableSleep(cv, wait_event_info);
 *   ConditionVariableCancelSleep();

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

2022-02-22 Thread Andres Freund

Hi,


I think I did find a bug related to the test, but afaics not the cause of the
test failures we're seeing. See
https://www.postgresql.org/message-id/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de

I don't think it's related to the problem of this thread, because the logs of
primary3 don't have a single mention of

ereport(LOG,
(errmsg("terminating process %d to release replication 
slot \"%s\"",
active_pid, NameStr(slotname))));

On 2022-02-18 15:14:15 -0800, Andres Freund wrote:
> I'm running out of ideas for how to try to reproduce this. I think we might
> need some additional debugging information to get more information from the
> buildfarm.

> I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing --verbose
> to pg_basebackup in $node_primary3->backup(...).
>
> It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
> ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots().

Planning to commit something like the attached.

Greetings,

Andres Freund
>From afdeff10526e29e3fc63b18c08100458780489d9 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 22 Feb 2022 18:02:34 -0800
Subject: [PATCH v1] Add temporary debug info to help debug
 019_replslot_limit.pl failures.

I have not been able to reproduce the occasional failures of
019_replslot_limit.pl we are seeing in the buildfarm and not for lack of
trying. The additional logging and increased log level will hopefully help.

Will be reverted once the cause is identified.

Discussion: https://postgr.es/m/20220218231415.c4plkp4i3reqc...@alap3.anarazel.de
---
 src/backend/replication/slot.c| 21 +
 src/bin/pg_basebackup/pg_basebackup.c | 10 +-
 src/test/recovery/t/019_replslot_limit.pl |  5 -
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5da5fa825a2..3d39fddaaef 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -177,6 +177,10 @@ ReplicationSlotInitialize(void)
 static void
 ReplicationSlotShmemExit(int code, Datum arg)
 {
+	/* temp debugging aid to analyze 019_replslot_limit failures */
+	elog(DEBUG3, "replication slot exit hook, %s active slot",
+		 MyReplicationSlot != NULL ? "with" : "without");
+
 	/* Make sure active replication slots are released */
 	if (MyReplicationSlot != NULL)
 		ReplicationSlotRelease();
@@ -554,6 +558,9 @@ ReplicationSlotCleanup(void)
 	Assert(MyReplicationSlot == NULL);
 
 restart:
+	/* temp debugging aid to analyze 019_replslot_limit failures */
+	elog(DEBUG3, "temporary replication slot cleanup: begin");
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -579,6 +586,8 @@ restart:
 	}
 
 	LWLockRelease(ReplicationSlotControlLock);
+
+	elog(DEBUG3, "temporary replication slot cleanup: done");
 }
 
 /*
@@ -1284,6 +1293,12 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
 (void) kill(active_pid, SIGTERM);
 last_signaled_pid = active_pid;
 			}
+			else
+			{
+/* temp debugging aid to analyze 019_replslot_limit failures */
+elog(DEBUG3, "not signalling process %d during invalidation of slot \"%s\"",
+	 active_pid, NameStr(slotname));
+			}
 
 			/* Wait until the slot is released. */
 			ConditionVariableSleep(&s->active_cv,
@@ -1347,6 +1362,10 @@ InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
 	XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
 
 restart:
+	/* temp debugging aid to analyze 019_replslot_limit failures */
+	elog(DEBUG3, "begin invalidating obsolete replication slots older than %X/%X",
+		 LSN_FORMAT_ARGS(oldestLSN));
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 	for (int i = 0; i < max_replication_slots; i++)
 	{
@@ -1372,6 +1391,8 @@ restart:
 		ReplicationSlotsComputeRequiredLSN();
 	}
 
+	elog(DEBUG3, "done invalidating obsolete replication slots");
+
 	return invalidated;
 }
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 08b07d5a06e..8c77c533e64 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -700,8 +700,16 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 	bgchild = fork();
 	if (bgchild == 0)
 	{
+		int			ret;
+
 		/* in child process */
-		exit(LogStreamerMain(param));
+		ret = LogStreamerMain(param);
+
+		/* temp debugging aid to analyze 019_replslot_limit failures */
+		if (verbose)
+			pg_log_info("log streamer with pid %d exiting", getpid());
+
+		exit(ret);
 	}
 	else if (bgchild < 0)
 	{
diff --git a/src/test/recovery/t/019_replslot_limit.pl

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 14:45:19 +0900, Masahiko Sawada wrote:
> I've attached a patch that changes pg_stat_subscription_workers view.

Thanks for working on this!

Why are the stats stored in the per-database stats file / as a second level
below the database? While they're also associated with a database, it's a
global catalog, so it seems to make more sense to have them "live" globally as
well?

Not just from an aesthetical perspective, but there might also be cases where
it's useful to send stats from the stats launcher. E.g. the number of times
the launcher couldn't start a worker because the max numbers of workers was
already active or such.

Greetings,

Andres Freund

Re: Use generation context to speed up tuplesorts

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-18 12:10:51 +1300, David Rowley wrote:
> The other way I thought to fix it was by changing the logic for when
> generation blocks are freed.  In the problem case mentioned above, the
> block being freed is the current block (which was just allocated).  I
> made some changes to adjust this behaviour so that we no longer free
> the block when the final chunk is pfree()'d. Instead, that now lingers
> and can be reused by future allocations, providing they fit inside it.

That makes sense to me, as long as we keep just one such block.


> The problem I see with this method is that there still could be some
> pathological case that causes us to end up storing just a single tuple per
> generation block.

Crazy idea: Detect the situation, and recompact. Create a new context, copy
all the tuples over, delete the old context. That could be a win even in less
adversarial situations than "a single tuple per generation block".


Greetings,

Andres Freund

Re: catalog access with reset GUCs during parallel worker startup

2022-02-22 Thread Andres Freund

Hi,

I think we need to do something about this, because afaics this has data
corruption risks, albeit with a narrow window.

E.g. consider what happens if the value of wal_sync_method is reset to the
boot value and one of the catalog accesses in GUC check hooks causes WAL
writes (e.g. due to pruning, buffer replacement). One backend doing
fdatasync() and the rest O_DSYNC or vice versa won't work lead to reliable
durability.

On 2022-02-09 21:47:09 -0500, Robert Haas wrote:
> I remember trying that, and if I remember correctly, it broke with
> core GUCs, without any need for extensions in the picture. I don't
> remember the details too well, unfortunately, but I think it had to do
> with the behavior of some of the check and/or assign hooks.
>
> It's probably time to try it again, because (a) maybe things are
> different now, or (b) maybe I did it wrong, and in any event (c) I
> can't really remember why it didn't work and we probably want to know
> that.

The tests fail:
+ERROR:  invalid value for parameter "session_authorization": "andres"
+CONTEXT:  while setting parameter "session_authorization" to "andres"
+parallel worker
+while scanning relation "public.pvactst"

but that's easily worked around. In fact, there's already code to do so, it
just is lacking awareness of session_authorization:

static bool
can_skip_gucvar(struct config_generic *gconf)
...
 * Role must be handled specially because its current value can be an
 * invalid value (for instance, if someone dropped the role since we set
 * it).  So if we tried to serialize it normally, we might get a 
failure.
 * We skip it here, and use another mechanism to ensure the worker has 
the
 * right value.

Don't those reasons apply just as well to session_authorization as they do to
role?

If I skip session_authorization in can_skip_gucvar() and move
RestoreGUCState() out of the transaction, all tests pass.

Unsurprisingly we get bogus results for parallel accesses to
session_authorization:

postgres[3312622][1]=> SELECT current_setting('session_authorization');
┌─┐
│ current_setting │
├─┤
│ frak│
└─┘
(1 row)

postgres[3312622][1]=> SET force_parallel_mode = on;
SET
postgres[3312622][1]=> SELECT current_setting('session_authorization');
┌─┐
│ current_setting │
├─────┤
│ andres  │
└─┘
(1 row)

but that could be remedied just already done for role.

Greetings,

Andres Freund

Re: Frontend error logging style

2022-02-22 Thread Andres Freund

Hi,

On 2022-02-22 18:23:19 -0500, Tom Lane wrote:
> Robert Haas  writes:
> > On Fri, Nov 19, 2021 at 5:17 AM Peter Eisentraut
> >  wrote:
> >> If people want to do that kind of thing (I'm undecided whether the
> >> complexity is worth it), then make it a different API.  The pg_log_*
> >> calls are for writing formatted output.  They normalized existing
> >> hand-coded patterns at the time.  We can wrap another API on top of them
> >> that does flow control and output.  The pg_log_* stuff is more on the
> >> level of syslog(), which also just outputs stuff.  Nobody is suggesting
> >> that syslog(LOG_EMERG) should exit the program automatically.  But you
> >> can wrap higher-level APIs such as ereport() on top of that that might
> >> do that.

That'd maybe be a convincing argument in a project that didn't have
elog(FATAL).

> > Yeah, that might be a way forward.
> 
> This conversation seems to have tailed off without full resolution,
> but I observe that pretty much everyone except Peter is on board
> with defining pg_log_fatal as pg_log_error + exit(1).  So I think
> we should just do that, unless Peter wants to produce a finished
> alternative proposal.

What about adding a pg_fatal() that's pg_log_fatal() + exit()? That keeps
pg_log_* stuff "log only", but adds something adjacent enough to hopefully
reduce future misunderstandings?

To further decrease the chance of such mistakes, we could rename
pg_log_fatal() to something else. Or add a a pg_log_fatal() function that's
marked pg_nodiscard(), so that each callsite explicitly has to be (void)
pg_log_fatal().

> I've now gone through and fleshed out the patch I sketched upthread.
> This exercise convinced me that we absolutely should do something
> very like this, because
> 
> (1) As it stands, this patch removes a net of just about 1000 lines
> of code.  So we clearly need *something*.

> (2) The savings would be even more, except that people have invented
> macros for pg_log_error + exit(1) in at least four places already.
> (None of them quite the same of course.)  Here I've just fixed those
> macro definitions to use pg_log_fatal.  In the name of consistency, we
> should probably get rid of those macros in favor of using pg_log_fatal
> directly; but I've not done so here, since this patch is eyewateringly
> long and boring already.

Agreed.

I don't care that much about the naming here.

For a moment I was wondering whether there'd be a benefit for having the "pure
logging" stuff be in a different static library than the erroring variant. So
that e.g. libpq's check for exit() doesn't run into trouble. But then I
remembered that libpq doesn't link against common...

Greetings,

Andres Freund

Re: Frontend error logging style

2022-02-22 Thread Andres Freund

On 2022-02-22 22:44:25 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > What about adding a pg_fatal() that's pg_log_fatal() + exit()? That keeps
> > pg_log_* stuff "log only", but adds something adjacent enough to hopefully
> > reduce future misunderstandings?
> 
> I'd be okay with that, except that pg_upgrade already has a pg_fatal
> (because it has its *own* logging system, just in case you thought
> this wasn't enough of a mess yet).  I'm in favor of aligning
> pg_upgrade's logging with the rest, but I'd hoped to leave that for
> later.  Making the names collide would be bad even as a short-term
> thing, I fear.

I guess we could name pg_upgrade's out of the way...


> I'm not against choosing some name other than pg_log_fatal, but that
> particular suggestion has got conflicts.  Got any other ideas?

Maybe pg_fatal_exit(), pg_exit_fatal() or pg_fatal_error()?

Re: bailing out in tap tests nearly always a bad idea

2022-02-23 Thread Andres Freund

Hi,

On 2022-02-23 08:43:38 -0500, Andrew Dunstan wrote:
> 
> On 2/22/22 15:54, Andres Freund wrote:
> > On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote:
> >> I'll be surprised if we can't come up with something cleaner than that.
> > Suggestions?
> 
> 
> If we just have the sig handler actions as:
> 
>     diag("died: $_[0]");
>     done_testing();   
> 
> we get:
> 
>     ok 1 - foo
>     # died: blorb at tst/tst.pl line 5.
>     1..1
>     # Looks like your test exited with 25 just after 1.
> 
> 
> Would that work?

Well, the if condition I had is needed, afaics. Otherwise we break eval() and
/ or risk crashing on syntax errors.  If you're just talking about diag() vs
ok(0, ...), I'm good with that.

Greetings,

Andres Freund

Re: Frontend error logging style

2022-02-23 Thread Andres Freund

Hi,

On 2022-02-23 10:30:02 -0500, Tom Lane wrote:
> So I now propose modifying yesterday's patch thus:
> 
> * Reinstantiate the PG_LOG_FATAL enum value, add support macros
> pg_log_fatal, pg_log_fatal_hint, pg_log_fatal_detail.
> 
> * Define pg_fatal as pg_log_fatal + exit(1).  (This would essentially
> move pg_rewind's definition into logging.h.  pg_upgrade will
> define it slightly differently, but the semantics end up the same.)
> 
> * Adjust call sites to match.

+1


> I do like this idea because it would not break any existing code
> that expects pg_log_fatal to return.  There is likely to be some
> of that in outstanding patches, and this approach would merely
> render it less-than-idiomatic rather than outright broken.
> 
> Updating the patch is going to be a bit tedious, so I'm not
> going to do it without buy-in that this solution would be
> okay to commit.  Any objections?

+1

Greetings,

Andres Freund

convert libpq uri-regress tests to tap test

2022-02-23 Thread Andres Freund

Hi,

When verifying that the meson port actually runs all perl based tests I came
across src/interfaces/libpq/test/regress.pl.  Instead of running tests yet
another way, it seems better to convert it to a tap test.

I hope others agree?

Where would we want that test to live? Right now we have the slightly odd
convention that some tap tests live in src/test/{misc,modules,...}. But
e.g. frontend binary ones are below src/bin/.

For now I've left it in src/interfaces/libpq/test, with the test in
t/001_uri.pl. But we should at least get rid of the test/...

Perhaps we should just rename src/test/modules/libpq_pipeline to
src/test/modules/libpq and move uri-regress in there as well?


The tap test needs a bit more polish, mostly posted this to get some feedback
before wasting more time :)

Greetings,

Andres Freund
>From 77b3666c9a7f17f98120cfc9c2a8e99f7d4ab491 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 23 Feb 2022 12:22:56 -0800
Subject: [PATCH v1] Convert src/interfaces/libpq/test to a tap test.

---
 src/interfaces/libpq/Makefile  |   2 +-
 src/interfaces/libpq/test/Makefile |  10 +-
 src/interfaces/libpq/test/README   |   7 -
 src/interfaces/libpq/test/expected.out | 171 
 src/interfaces/libpq/test/regress.in   |  57 --
 src/interfaces/libpq/test/regress.pl   |  65 --
 src/interfaces/libpq/test/t/001_uri.pl | 266 +
 7 files changed, 274 insertions(+), 304 deletions(-)
 delete mode 100644 src/interfaces/libpq/test/README
 delete mode 100644 src/interfaces/libpq/test/expected.out
 delete mode 100644 src/interfaces/libpq/test/regress.in
 delete mode 100644 src/interfaces/libpq/test/regress.pl
 create mode 100644 src/interfaces/libpq/test/t/001_uri.pl

diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile
index 844c95d47de..8920f7e6365 100644
--- a/src/interfaces/libpq/Makefile
+++ b/src/interfaces/libpq/Makefile
@@ -137,7 +137,7 @@ install: all installdirs install-lib
 	$(INSTALL_DATA) $(srcdir)/pqexpbuffer.h '$(DESTDIR)$(includedir_internal)'
 	$(INSTALL_DATA) $(srcdir)/pg_service.conf.sample '$(DESTDIR)$(datadir)/pg_service.conf.sample'
 
-installcheck:
+installcheck check:
 	$(MAKE) -C test $@
 
 installdirs: installdirs-lib
diff --git a/src/interfaces/libpq/test/Makefile b/src/interfaces/libpq/test/Makefile
index 4832fab9d23..5dbfe5d09ae 100644
--- a/src/interfaces/libpq/test/Makefile
+++ b/src/interfaces/libpq/test/Makefile
@@ -1,3 +1,5 @@
+# src/interfaces/libpq/test/Makefile
+
 subdir = src/interfaces/libpq/test
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
@@ -13,10 +15,12 @@ PROGS = uri-regress
 
 all: $(PROGS)
 
+check: all
+	$(prove_check)
+
 installcheck: all
-	SRCDIR='$(top_srcdir)' SUBDIR='$(subdir)' \
-		   $(PERL) $(top_srcdir)/$(subdir)/regress.pl
+	$(prove_installcheck)
 
 clean distclean maintainer-clean:
 	rm -f $(PROGS) *.o
-	rm -f regress.out regress.diff
+	rm -rf tmp_check
diff --git a/src/interfaces/libpq/test/README b/src/interfaces/libpq/test/README
deleted file mode 100644
index a05eb6bb3bc..000
--- a/src/interfaces/libpq/test/README
+++ /dev/null
@@ -1,7 +0,0 @@
-This is a testsuite for testing libpq URI connection string syntax.
-
-To run the suite, use 'make installcheck' command.  It works by
-running 'regress.pl' from this directory with appropriate environment
-set up, which in turn feeds up lines from 'regress.in' to
-'uri-regress' test program and compares the output against the correct
-one in 'expected.out' file.
diff --git a/src/interfaces/libpq/test/expected.out b/src/interfaces/libpq/test/expected.out
deleted file mode 100644
index d375e82b4ac..000
--- a/src/interfaces/libpq/test/expected.out
+++ /dev/null
@@ -1,171 +0,0 @@
-trying postgresql://uri-user:secret@host:12345/db
-user='uri-user' password='secret' dbname='db' host='host' port='12345' (inet)
-
-trying postgresql://uri-user@host:12345/db
-user='uri-user' dbname='db' host='host' port='12345' (inet)
-
-trying postgresql://uri-user@host/db
-user='uri-user' dbname='db' host='host' (inet)
-
-trying postgresql://host:12345/db
-dbname='db' host='host' port='12345' (inet)
-
-trying postgresql://host/db
-dbname='db' host='host' (inet)
-
-trying postgresql://uri-user@host:12345/
-user='uri-user' host='host' port='12345' (inet)
-
-trying postgresql://uri-user@host/
-user='uri-user' host='host' (inet)
-
-trying postgresql://uri-user@
-user='uri-user' (local)
-
-trying postgresql://host:12345/
-host='host' port='12345' (inet)
-
-trying postgresql://host:12345
-host='host' port='12345' (inet)
-
-trying postgresql://hos

Re: convert libpq uri-regress tests to tap test

2022-02-23 Thread Andres Freund

Hi,

On 2022-02-23 15:57:08 -0500, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On 2/23/22 15:30, Andres Freund wrote:
> >> Perhaps we should just rename src/test/modules/libpq_pipeline to
> >> src/test/modules/libpq and move uri-regress in there as well?
> 
> > WFM
> 
> +1

Cool.

One question on making that happen: Right now the makefiles building C stuff
in src/test/modules all have the contrib-style stanza to build via pgxs.  But
afaics pgxs doesn't support building multiple programs. Which
src/test/modules/libpq would need to.

Aaics there's currently no way to have Mkvcbuild.pm build multiple programs in
one dir via its makefile parsing stuff? Andrew, any suggestions for not
needing to spell this out in both the makefile and Mkvcbuild.pm?

Separately: I don't really understand why we do the whole if USE_PGXS dance in
src/test/modules?

- Andres

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 13:23:55 +0100, Peter Eisentraut wrote:
> On 24.02.22 12:46, Masahiko Sawada wrote:
> > > We have a view called pg_stat_activity, which is very well known.  From
> > > that perspective, "activity" means what is happening right now or what
> > > has happened most recently.  The reworked view in this patch does not
> > > contain that (we already have pg_stat_subscription for that), but it
> > > contains accumulated counters.
> > Right.
> > 
> > What pg_stat_subscription shows is rather suitable for the name
> > pg_stat_subscription_activity than the reworked view. But switching
> > these names would also not be a good idea.  I think it's better to use
> > "subscription" in the view name since it shows actually statistics for
> > subscriptions and subscription OID is the key. I personally prefer
> > pg_stat_subscription_counters among the ideas that have been proposed
> > so far, but I'd like to hear opinions and votes.
> 
> _counters will fail if there is something not a counter (such as
> last-timestamp-of-something).
> 
> Earlier, pg_stat_subscription_stats was mentioned, which doesn't have that
> problem.

We really should try to fix this in a more general way at some point. We have
way too many different things mixed up in pg_stat_*.

I'd like to get something like the patch in soon though, we can still change
the name later. I've been blocked behind this stuff for weeks, and it's
getting really painful...

Greetings,

Andres Freund

Re: convert libpq uri-regress tests to tap test

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 13:31:40 +0100, Peter Eisentraut wrote:
> I think the TAP scripts should be in src/interfaces/libpq/t/, as usual. The
> supporting code snippets could live in some other directory under
> src/interfaces/libpq/, which might be called "test" or something else, not
> that important.

Why not in t/? We can't easily build the test programs in libpq/ itself, but
libpq/t should be fairly doable.

Greetings,

Andres Freund

Re: Frontend error logging style

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 14:06:18 +0100, Peter Eisentraut wrote:
> My suggestion is to just get rid of pg_log_fatal() and replace them all with
> pg_log_error().

-1. This ignores that already several places came up with their slightly
different versions of fatal exit handlers. We don't gain anything by not
standardizing on one notion of a fatal error wrapper.

Greetings,

Andres Freund

Re: Readd use of TAP subtests

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 11:16:03 +0100, Peter Eisentraut wrote:
> Now that we have switched everything to done_testing(), the subtests feature
> isn't that relevant anymore, but it might still be useful to get better
> output when running with PROVE_FLAGS=--verbose.

I've incidentally played with subtests yesterdays, when porting
src/interfaces/libpq/test/regress.pl to a tap test. Unfortunately it seems
that subtests aren't actually specified in the tap format, and that different
libraries generate different output formats. The reason this matters somewhat
is that meson's testrunner can parse tap and give nicer progress / error
reports. But since subtests aren't in the spec it can't currently parse
them...

Open issue since 2015:
https://github.com/TestAnything/Specification/issues/2

The perl ecosystem is so moribund :(.

> t/001_basic.pl ..
> # Subtest: vacuumlo --help
> ok 1 - exit code 0
> ok 2 - goes to stdout
> ok 3 - nothing to stderr
> 1..3

It's clearly better.

We can approximate some of it by using is_deeply() and comparing exit, stdout,
stderr at once. Particularly for helpers like program_help() that are used in
a lot of places.

Greetings,

Andres Freund

Re: create_index test fails when synchronous_commit = off @ master

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 16:47:25 +0300, Aleksander Alekseev wrote:
> -  QUERY PLAN
> 
> - Index Only Scan using tenk1_thous_tenthous on tenk1
> -   Index Cond: (thousand < 2)
> -   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
> -(3 rows)
> +  QUERY PLAN
> +--
> + Sort
> +   Sort Key: thousand
> +   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
> + Index Cond: ((thousand < 2) AND (tenthous = ANY
> ('{1001,3000}'::integer[])))
> +(4 rows)

Heh. We've been having a lot of fights with exactly this plan change in the
AIO branch, before cc50080a82, and without synchronous_commit =
off. Interestingly near-exclusively with the regression run within
pg_upgrade's tests.

For aio we (David did a lot of that IIRC) finally hunted it down to be due
vacuum skipping pages due to inability to get a cleanup lock. If that happens
enough, pg_class.relallvisible changes enough to lead to the different plan.

I first was going to suggest that we should just use VACUUM FREEZE to prevent
the issue.

But in this instance the cause isn't cleanup locks, probably that we can't yet
set hint bits due to synchronous_commit=off? But I don't *fully* understand
how it leads to this.

I added the SELECT relpages, reltuples, relallvisible FROM pg_class WHERE oid = 
'tenk1'::regclass;
just after the
VACUUM ANALYZE tenk1;

synchronous_commit=on
+ relpages | reltuples | relallvisible
+--+---+---
+  345 | 1 |   345
+(1 row)

synchronous_commit=off
+ relpages | reltuples | relallvisible
+--+---+---
+  345 | 1 | 0
+(1 row)

So it clearly is the explanation for the issue.

Obviously we can locally work around it by adding a
SET LOCAL synchronous_commit = local;
to the COPY. But I'd like to fully understand what's going on.

> I didn't investigate further. Do we assume that `make installcheck` suppose
> to pass with a different postgresql.conf options?

Depends on the option, I think... There's some where it's interesting to run
tests with different options and where the effort required is reasonable. And
some cases where it's not... synchronous_commit=off worked until recently, and
I think we should keep it working.

Greetings,

Andres Freund

Re: create_index test fails when synchronous_commit = off @ master

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 07:33:39 -0800, Andres Freund wrote:
> I added the SELECT relpages, reltuples, relallvisible FROM pg_class WHERE oid 
> = 'tenk1'::regclass;
> just after the
> VACUUM ANALYZE tenk1;
>
> synchronous_commit=on
> + relpages | reltuples | relallvisible
> +--+---+---
> +  345 | 1 |   345
> +(1 row)
>
> synchronous_commit=off
> + relpages | reltuples | relallvisible
> +--+---+---
> +  345 | 1 | 0
> +(1 row)
>
> So it clearly is the explanation for the issue.
>
>
> Obviously we can locally work around it by adding a
> SET LOCAL synchronous_commit = local;
> to the COPY. But I'd like to fully understand what's going on.

It is the hint bit sets delayed by asynchronous commit. I traced execution and
we do end up not setting all visible due to reaching the
!HeapTupleHeaderXminCommitted() path in lazy_scan_prune()

case HEAPTUPLE_LIVE:
...
/*
 * Is the tuple definitely visible to all 
transactions?
 *
 * NB: Like with per-tuple hint bits, we can't 
set the
 * PD_ALL_VISIBLE flag if the inserter committed
 * asynchronously. See SetHintBits for more 
info. Check that
 * the tuple is hinted xmin-committed because 
of that.
 */
if (prunestate->all_visible)
{
TransactionId xmin;

if 
(!HeapTupleHeaderXminCommitted(tuple.t_data))

So it might be reasonable to use synchronous_commit=on in test_setup.sql?

It's not super satisfying, but i don't immediately see what else could prevent
all-visible to be set in this case? There's no dead rows, so concurrent
snapshots shouldn't prevent cleanup.

I guess we could alternatively try doing something like flushing pending async
commits at the start of vacuum. But that probably isn't warranted.

Greetings,

Andres Freund

Re: convert libpq uri-regress tests to tap test

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 10:17:28 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-02-24 13:31:40 +0100, Peter Eisentraut wrote:
> >> I think the TAP scripts should be in src/interfaces/libpq/t/, as usual. The
> >> supporting code snippets could live in some other directory under
> >> src/interfaces/libpq/, which might be called "test" or something else, not
> >> that important.
>
> > Why not in t/? We can't easily build the test programs in libpq/ itself, but
> > libpq/t should be fairly doable.
>
> I think that having t/ directories contain only Perl test scripts
> is a good convention that we should stick to.  Peter's proposal
> of a separate test/ subdirectory for C test scaffolding is
> probably fine.

That exists today and continues to exist in the patch upthread, so it's easy
;). I just need to move the libpq/test/t to libpq/t and adjust the binary
path.

One annoying bit is that our current tap invocation infrastructure for msvc
won't know how to deal with that. We put the build directory containing t/
onto PATH, but that won't work for test/. But we also don't want to install
test binaries. Not sure what the solution for that is.

Even on !windows, we only know how to find "test executables" in tap tests via
PATH. We're in the source dir, so we can't just do test/executable.

I probably just need another coffee, but right now I don't even see how to add
anything to PATH given $(prove_check)'s definition - it ends up with multiple
shells. We can fix that by using && in the definition, which might be a good
thing anyway?

Attached three patches:

0001: Convert src/interfaces/libpq/test to a tap test
0002: Add tap test infrastructure to src/interfaces/libpq
0003: Move libpq_pipeline test into src/interfaces/libpq.

I did 0001 and 0002 in that order because prove errors out with a stacktrace
if no tap tests exist... It might make more sense to commit them together, but
for review it's easier to keep them separate I think.

Andrew, do you have an idea about the feasibility of supporting any of this
with the msvc build?

I'm mildly inclined to only do 0001 and 0002 for now. We'd not loose msvc
coverage, because it already doesn't build the test. Once we've ironed that
stuff out, we could do 0003?

Greetings,

Andres Freund
>From 71fa1532a1540e8bbf8bdee3ec0b64e863f212f2 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 23 Feb 2022 12:22:56 -0800
Subject: [PATCH v2 1/3] Convert src/interfaces/libpq/test to a tap test.

The invocation of the tap test will be added in the next commit.
---
 src/interfaces/libpq/t/001_uri.pl  | 265 +
 src/interfaces/libpq/test/.gitignore   |   2 -
 src/interfaces/libpq/test/Makefile |   7 +-
 src/interfaces/libpq/test/README   |   7 -
 src/interfaces/libpq/test/expected.out | 171 
 src/interfaces/libpq/test/regress.in   |  57 --
 src/interfaces/libpq/test/regress.pl   |  65 --
 7 files changed, 267 insertions(+), 307 deletions(-)
 create mode 100644 src/interfaces/libpq/t/001_uri.pl
 delete mode 100644 src/interfaces/libpq/test/README
 delete mode 100644 src/interfaces/libpq/test/expected.out
 delete mode 100644 src/interfaces/libpq/test/regress.in
 delete mode 100644 src/interfaces/libpq/test/regress.pl

diff --git a/src/interfaces/libpq/t/001_uri.pl b/src/interfaces/libpq/t/001_uri.pl
new file mode 100644
index 000..0225666d9f6
--- /dev/null
+++ b/src/interfaces/libpq/t/001_uri.pl
@@ -0,0 +1,265 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use Test::More;
+use IPC::Run;
+
+my @tests = (
+	[q{postgresql://uri-user:secret@host:12345/db},
+	 q{user='uri-user' password='secret' dbname='db' host='host' port='12345' (inet)},
+ q{},
+],
+	[q{postgresql://uri-user@host:12345/db},
+	 q{user='uri-user' dbname='db' host='host' port='12345' (inet)},
+ q{},
+],
+	[q{postgresql://uri-user@host/db},
+	 q{user='uri-user' dbname='db' host='host' (inet)},
+ q{},
+],
+	[q{postgresql://host:12345/db},
+	 q{dbname='db' host='host' port='12345' (inet)},
+ q{},
+],
+	[q{postgresql://host/db},
+	 q{dbname='db' host='host' (inet)},
+ q{},
+],
+	[q{postgresql://uri-user@host:12345/},
+	 q{user='uri-user' host='host' port='12345' (inet)},
+ q{},
+],
+	[q{postgresql://uri-user@host/},
+	 q{user='uri-user' host='host' (inet)},
+ q{},
+],
+	[q{postgresql://uri-user@},
+	 q{user='uri-user' (local)},
+ q{},
+],
+	[q{postgresql://host:12345/},
+	 q{host='host' port=

Re: convert libpq uri-regress tests to tap test

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 17:03:33 +, Jacob Champion wrote:
> On Thu, 2022-02-24 at 08:46 -0800, Andres Freund wrote:
> > One annoying bit is that our current tap invocation infrastructure for msvc
> > won't know how to deal with that. We put the build directory containing t/
> > onto PATH, but that won't work for test/. But we also don't want to install
> > test binaries. Not sure what the solution for that is.
> 
> Would it help if the C executable, not Perl, was the thing actually
> producing the TAP output? The binaries built from test/ could be placed
> into t/. Or does that just open up a new set of problems?

I don't think it would help that much. And for the libpq pipeline test it
doesn't really make sense anyway, because we intentionally use it with
different traces and such.

I've thought about a few C tap tests that I'd like, but I think that's a
separate discussion.

Greetings,

Andres Freund

Re: ltree_gist indexes broken after pg_upgrade from 12 to 13

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 22:15:20 +0100, Tomas Vondra wrote:
> After thinking about this I only see two options:
> 
> 1) Don't apply the patch and tell everyone using ltree_gist they need to
> rebuild the index after pg_upgrade from 12 to 13+. The downside of this
> is larger indexes (because some tuples are 20B larger).
> 
> 2) Apply the patch + tell those who already upgraded from 12 to rebuild
> ltree_gist indexes, because those might be broken due to new inserts.
> 
> 
> My opinion is to do (2), because at least those who'll upgrade later
> (which is a lot of people) will be fine without a rebuild. And it also
> makes the indexes a bit smaller, thanks to saving 20B.

Won't 2) also break indexes created without a pg_upgrade? "already upgraded
from 12" sounds like it wouldn't but I don't see why?

Greetings,

Andres Freund

Re: ltree_gist indexes broken after pg_upgrade from 12 to 13

2022-02-24 Thread Andres Freund

Hi, 

On February 24, 2022 5:44:57 PM PST, Tom Lane  wrote:
>Tomas Vondra  writes:
>> I wonder if we could check the index tuple length, and adjust the siglen
>> based on that, somehow ...
>
>In an already-corrupted index, there won't be a unique answer.

If we're breaking every ltree gist index, can we detect that it's not an index 
generated by the new version? Most people don't read release notes, and this is 
just about guaranteed to break all. 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Proposal: Support custom authentication methods using hooks

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 17:02:45 -0800, Jeff Davis wrote:
> On Thu, 2022-02-17 at 11:25 -0800, samay sharma wrote:
> One caveat is that this only works given information available from
> existing authentication methods, because that's all the client
> supports. In practice, it seems to only be useful with plaintext
> password authentication over an SSL connection.

Why is it restricted to that? You could do sasl negotiation as well from what
I can see? And that'd theoretically also allow to negotiate whether the client
supports different ways of doing auth?  Not saying that that's easy, but I
don't think it's a fundamental restriction.

I also can imagine things like using selinux labeling of connections.

We have several useful authentication technologies built ontop of plaintext
exchange. Radius, Ldap, Pam afaics could be implemented as an extension?

Greetings,

Andres Freund

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-25 13:01:26 +0800, Julien Rouhaud wrote:
> On Thu, Feb 24, 2022 at 08:44:08PM +, Jacob Champion wrote:
> > 
> > Yeah... I was following a similar track with the initial work last
> > year, but I dropped it when the cost of implementation started to grow
> > considerably. At the time, though, it looked like some overhauls to the
> > stats framework were being pursued? I should read up on that thread.
> 
> Do you mean the shared memory stats patchset?  IIUC this is unrelated, as the
> beentry stuff Michael was talking about is a different infrastructure
> (backend_status.[ch]), and I don't think there are any plans to move that to
> dynamic shared memory.

Until a year ago the backend_status.c stuff was in in pgstat.c - I just split
them out because of the shared memory status work - so it'd not be surprising
for them to mentally be thrown in one bucket.

Basically the type of stats we're trying to move to dynamic shared memory is
about counters that should persist for a while and are accumulated across
connections etc. Whereas backend_status.c is more about tracking the current
state (what query is a backend running, what user, etc).

They're not unrelated though: backend_status.c feeds pgstat.c with some
information occasionally.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-24 Thread Andres Freund

 they were still all-frozen at that
> time.  But it skipped the page either way (since we already committed to
> skipping the page at the point of the recheck).  This was correct, but
> sometimes resulted in non-aggressive VACUUMs needlessly wasting an
> opportunity to advance relfrozenxid (when a page was modified in just
> the wrong way, at just the wrong time).  It also resulted in a needless
> recheck of the visibility map for each and every page skipped during
> non-aggressive VACUUMs.
>
> Avoid these problems by conditioning the "skippable page was definitely
> all-frozen when range of skippable pages was first determined" behavior
> on what the visibility map _actually said_ about the range as a whole
> back when we first determined the extent of the range (don't deduce what
> must have happened at that time on the basis of aggressive-ness).  This
> allows us to reliably count skipped pages in frozenskipped_pages when
> they were initially all-frozen.  In particular, when a page's visibility
> map bit is unset after the point where a skippable range of pages is
> initially determined, but before the point where the page is actually
> skipped, non-aggressive VACUUMs now count it in frozenskipped_pages,
> just like aggressive VACUUMs always have [1].  It's not critical for the
> non-aggressive case to get this right, but there is no reason not to.
>
> [1] Actually, it might not work that way when there happens to be a mix
> of all-visible and all-frozen pages in a range of skippable pages.
> There is no chance of VACUUM advancing relfrozenxid in this scenario
> either way, though, so it doesn't matter.

I think this commit message needs a good amount of polishing - it's very
convoluted. It's late and I didn't sleep well, but I've tried to read it
several times without really getting a sense of what this precisely does.




> From 15dec1e572ac4da0540251253c3c219eadf46a83 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan 
> Date: Thu, 24 Feb 2022 17:21:45 -0800
> Subject: [PATCH v9 4/4] Avoid setting a page all-visible but not all-frozen.

To me the commit message body doesn't actually describe what this is doing...


> This is pretty much an addendum to the work in the "Make page-level
> characteristics drive freezing" commit.  It has been broken out like
> this because I'm not even sure if it's necessary.  It seems like we
> might want to be paranoid about losing out on the chance to advance
> relfrozenxid in non-aggressive VACUUMs, though.

> The only test that will trigger this case is the "freeze-the-dead"
> isolation test.  It's incredibly narrow.  On the other hand, why take a
> chance?  All it takes is one heap page that's all-visible (and not also
> all-frozen) nestled between some all-frozen heap pages to lose out on
> relfrozenxid advancement.  The SKIP_PAGES_THRESHOLD stuff won't save us
> then [1].

FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up
causing a lot of time doing IO that we never need, completely trashing all CPU
caches, while not actually causing decent readaead IO from what I've seen.

Greetings,

Andres Freund

Re: [Proposal] Global temporary tables

2022-02-24 Thread Andres Freund

  * this session, its index does not have a root page, just 
> returns 0.
> +  */
> + if (RELATION_IS_GLOBAL_TEMP(rel) &&
> + !gtt_storage_attached(RelationGetRelid(rel)))
> + return 0;
> +
>   metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
>   metad = _bt_getmeta(rel, metabuf);

Stuff like this seems not acceptable. Accesses would have to be prevented much
earlier. Otherwise each index method is going to need copies of this logic. I
also doubt that _bt_getrootheight() is the only place that'd need this.


>  static void
>  index_update_stats(Relation rel,
>  bool hasindex,
> -double reltuples)
> +double reltuples,
> +bool isreindex)
>  {
>   Oid relid = RelationGetRelid(rel);
>   Relationpg_class;
> @@ -2797,6 +2824,13 @@ index_update_stats(Relation rel,
>   Form_pg_class rd_rel;
>   booldirty;
>  
> + /*
> +  * Most of the global Temp table data is updated to the local hash, and 
> reindex
> +  * does not refresh relcache, so call a separate function.
> +  */
> + if (RELATION_IS_GLOBAL_TEMP(rel))
> + return index_update_gtt_relstats(rel, hasindex, reltuples, 
> isreindex);
> +

So basically every single place in the code that does catalog accesses is
going to need a completely separate implementation for GTTs? That seems
unmaintainable.



> +/*-
> + *
> + * storage_gtt.c
> + * The body implementation of Global temparary table.
> + *
> + * IDENTIFICATION
> + * src/backend/catalog/storage_gtt.c
> + *
> + * See src/backend/catalog/GTT_README for Global temparary table's
> + * requirements and design.
> + *
> + *-
> + */

I don't think that path to the readme is correct.

Greetings,

Andres Freund

Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:
> From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
> From: Yura Sokolov 
> Date: Mon, 21 Feb 2022 08:49:03 +0300
> Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.
> 
> Acquiring two partition locks leads to complex dependency chain that hurts
> at high concurrency level.
> 
> There is no need to hold both lock simultaneously. Buffer is pinned so
> other processes could not select it for eviction. If tag is cleared and
> buffer removed from old partition other processes will not find it.
> Therefore it is safe to release old partition lock before acquiring
> new partition lock.

Yes, the current design is pretty nonsensical. It leads to really absurd stuff
like holding the relation extension lock while we write out old buffer
contents etc.



> +  * We have pinned buffer and we are single pinner at the moment so there
> +  * is no other pinners.

Seems redundant.


> We hold buffer header lock and exclusive partition
> +  * lock if tag is valid. Given these statements it is safe to clear tag
> +  * since no other process can inspect it to the moment.
> +  */

Could we share code with InvalidateBuffer here? It's not quite the same code,
but nearly the same.


> +  * The usage_count starts out at 1 so that the buffer can survive one
> +  * clock-sweep pass.
> +  *
> +  * We use direct atomic OR instead of Lock+Unlock since no other backend
> +  * could be interested in the buffer. But StrategyGetBuffer,
> +  * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them 
> to
> +  * compare tag, and UnlockBufHdr does raw write to state. So we have to
> +  * spin if we found buffer locked.

So basically the first half of of the paragraph is wrong, because no, we
can't?


> +  * Note that we write tag unlocked. It is also safe since there is 
> always
> +  * check for BM_VALID when tag is compared.



>*/
>   buf->tag = newTag;
> - buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> -BM_CHECKPOINT_NEEDED | BM_IO_ERROR | 
> BM_PERMANENT |
> -BUF_USAGECOUNT_MASK);
>   if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == 
> INIT_FORKNUM)
> - buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> + new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
>   else
> - buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> -
> - UnlockBufHdr(buf, buf_state);
> + new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
>  
> - if (oldPartitionLock != NULL)
> + buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> + while (unlikely(buf_state & BM_LOCKED))

I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Why don't you just use LockBufHdr/UnlockBufHdr?

Greetings,

Andres Freund

Re: Add parameter jit_warn_above_fraction

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 16:16:01 +0100, Magnus Hagander wrote:
> This patch adds a configuration parameter jit_warn_above_fraction that
> will cause a warning to be logged if the fraction of time spent on
> doing JIT is bigger than the specified one. For example, this can be
> used to track down those cases where JIT ends up taking 90% of the
> query runtime because of bad estimates...

Hm. Could it make sense to do this as a auto_explain feature?

Greetings,

Andres Freund

Re: Readd use of TAP subtests

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 14:39:15 +0100, Peter Eisentraut wrote:
> On 24.02.22 16:00, Andres Freund wrote:
> > I've incidentally played with subtests yesterdays, when porting
> > src/interfaces/libpq/test/regress.pl to a tap test. Unfortunately it seems
> > that subtests aren't actually specified in the tap format, and that 
> > different
> > libraries generate different output formats. The reason this matters 
> > somewhat
> > is that meson's testrunner can parse tap and give nicer progress / error
> > reports. But since subtests aren't in the spec it can't currently parse
> > them...
>
> Ok that's good to know.  What exactly happens when it tries to parse them?
> Does it not count them or does it fail somehow?  The way the output is
> structured

Says that it can't pase a line of the tap output:
16:06:55 MALLOC_PERTURB_=156 /usr/bin/perl /tmp/meson-test/build/../subtest.pl
--- output ---
stdout:
# Subtest: a
ok 1 - a: a
ok 2 - a: b
1..2
ok 1 - a
1..1
stderr:

TAP parsing error: unexpected input at line 4
--


> t/001_basic.pl ..
> # Subtest: vacuumlo --help
> ok 1 - exit code 0
> ok 2 - goes to stdout
> ok 3 - nothing to stderr
> 1..3
> ok 1 - vacuumlo --help
>
> it appears that it should be able to parse it nonetheless and should just
> count the non-indented lines.

It looks like it's not ignoring indented lines...

Greetings,

Andres Freund

Re: Readd use of TAP subtests

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 09:43:20 -0500, Andrew Dunstan wrote:
> AIUI TAP consumers are supposed to ignore lines they don't understand.

Are they?

In http://testanything.org/tap-version-13-specification.html there's:

"Lines written to standard output matching /^(not )?ok\b/ must be interpreted
as test lines. [...]All other lines must not be considered test output."

That says that all other lines aren't "test ouput". But what does that mean?
It certainly doesn't mean they can just be ignored, because obviously
^(TAP version|#|1..|Bail out) isn't to be ignored.

And then there's:


"
Anything else

  Any output that is not a version, a plan, a test line, a YAML block, a
  diagnostic or a bail out is incorrect. How a harness handles the incorrect
  line is undefined. Test::Harness silently ignores incorrect lines, but will
  become more stringent in the future. TAP::Harness reports TAP syntax errors at
  the end of a test run.
"

If I were to to implement a tap parser this wouldn't make me ignore lines.


Contrasting to that:
http://testanything.org/tap-specification.html

"
Anything else

  A TAP parser is required to not consider an unknown line as an error but may
  optionally choose to capture said line and hand it to the test harness,
  which may have custom behavior attached. This is to allow for forward
  compatability. Test::Harness silently ignores incorrect lines, but will
  become more stringent in the future. TAP::Harness reports TAP syntax errors
  at the end of a test run.
"

I honestly don't know what to make of that. Parsers are supposed to ignore
unknown lines, Test::Harness silently ignores, but also "TAP::Harness reports
TAP syntax errors"?  This seems like a self contradictory mess.

Greetings,

Andres Freund

Re: Add parameter jit_warn_above_fraction

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 17:28:41 +0100, Magnus Hagander wrote:
> On Fri, Feb 25, 2022 at 5:20 PM Andres Freund  wrote:
> > On 2022-02-25 16:16:01 +0100, Magnus Hagander wrote:
> > > This patch adds a configuration parameter jit_warn_above_fraction that
> > > will cause a warning to be logged if the fraction of time spent on
> > > doing JIT is bigger than the specified one. For example, this can be
> > > used to track down those cases where JIT ends up taking 90% of the
> > > query runtime because of bad estimates...
> >
> > Hm. Could it make sense to do this as a auto_explain feature?
> 
> It could be. But I was looking for something a lot more "light weight"
> than having to install an extension. But yes, if we wanted to, we
> could certainly change jit_warn_above_fraction to be
> auto_explain.log_min_jit_fraction or something like that, and do
> basically the same thing.  But then, we could also have
> log_min_duration_statement be part of auto_explain instead, so it's
> all about where to draw the line :)

I guess it feels a tad on the "too narrow/specific" side of things for the
general code. We don't have log_min_duration_{parsing,planning,execution}
either.  But I also get it. So I just wanted to raise it ;)

Greetings,

Andres Freund

Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 12:51:22 +0300, Yura Sokolov wrote:
> > > +  * The usage_count starts out at 1 so that the buffer can survive one
> > > +  * clock-sweep pass.
> > > +  *
> > > +  * We use direct atomic OR instead of Lock+Unlock since no other backend
> > > +  * could be interested in the buffer. But StrategyGetBuffer,
> > > +  * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them 
> > > to
> > > +  * compare tag, and UnlockBufHdr does raw write to state. So we have to
> > > +  * spin if we found buffer locked.
> > 
> > So basically the first half of of the paragraph is wrong, because no, we
> > can't?
> 
> Logically, there are no backends that could be interesting in the buffer.
> Physically they do LockBufHdr/UnlockBufHdr just to check they are not 
> interesting.

Yea, but that's still being interested in the buffer...


> > > +  * Note that we write tag unlocked. It is also safe since there is 
> > > always
> > > +  * check for BM_VALID when tag is compared.
> > 
> > 
> > >*/
> > >   buf->tag = newTag;
> > > - buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> > > -BM_CHECKPOINT_NEEDED | BM_IO_ERROR | 
> > > BM_PERMANENT |
> > > -BUF_USAGECOUNT_MASK);
> > >   if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == 
> > > INIT_FORKNUM)
> > > - buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > > + new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > >   else
> > > - buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > > -
> > > - UnlockBufHdr(buf, buf_state);
> > > + new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > >  
> > > - if (oldPartitionLock != NULL)
> > > + buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> > > + while (unlikely(buf_state & BM_LOCKED))
> > 
> > I don't think it's safe to atomic in arbitrary bits. If somebody else has
> > locked the buffer header in this moment, it'll lead to completely bogus
> > results, because unlocking overwrites concurrently written contents (which
> > there shouldn't be any, but here there are)...
> 
> That is why there is safety loop in the case buf->state were locked just
> after first optimistic atomic_fetch_or. 99.999% times this loop will not
> have a job. But in case other backend did lock buf->state, loop waits
> until it releases lock and retry atomic_fetch_or.

> > And or'ing contents in also doesn't make sense because we it doesn't work to
> > actually unset any contents?
> 
> Sorry, I didn't understand sentence :((


You're OR'ing multiple bits into buf->state. LockBufHdr() only ORs in
BM_LOCKED. ORing BM_LOCKED is fine:
Either the buffer is not already locked, in which case it just sets the
BM_LOCKED bit, acquiring the lock. Or it doesn't change anything, because
BM_LOCKED already was set.

But OR'ing in multiple bits is *not* fine, because it'll actually change the
contents of ->state while the buffer header is locked.


> > Why don't you just use LockBufHdr/UnlockBufHdr?
> 
> This pair makes two atomic writes to memory. Two writes are heavier than
> one write in this version (if optimistic case succeed).

UnlockBufHdr doesn't use a locked atomic op. It uses a write barrier and an
unlocked write.

Greetings,

Andres Freund

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 09:33:45 -0800, Jeff Davis wrote:
> On Thu, 2022-02-24 at 20:47 -0500, Tom Lane wrote:
> > ... and, since we can't readily enforce that the client only sends
> > those cleartext passwords over suitably-encrypted connections, this
> > could easily be a net negative for security.  Not sure that I think
> > it's a good idea.
> 
> I don't understand your point. Can't you just use "hostssl" rather than
> "host"?

And the extension could check Port->ssl_in_use before 
sendAuthRequest(AUTH_REQ_PASSWORD)
if it wanted to restrict it.

Greetings,

Andres Freund

Re: convert libpq uri-regress tests to tap test

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-24 08:46:23 -0800, Andres Freund wrote:
> I'm mildly inclined to only do 0001 and 0002 for now. We'd not loose msvc
> coverage, because it already doesn't build the test. Once we've ironed that
> stuff out, we could do 0003?

>From what I can see in the buildfarm client, we'd not loose (nor gain) any
buildfarm coverage either. It doesn't run the test today.

Greetings,

Andres Freund

Re: Two noncritical bugs of pg_waldump

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-14 18:18:47 +0900, Kyotaro Horiguchi wrote:
> > If I give an empty file to the tool it complains as the follows.
> > 
> > > pg_waldump: fatal: could not read file "hoge": No such file or directory
> > 
> > No, the file exists.  The cause is it reads uninitialized errno to
> > detect errors from the system call.  read(2) is defined to set errno
> > always when it returns -1 and doesn't otherwise. Thus it seems to me
> > that it is better to check that the return value is less than zero
> > than to clear errno before the call to read().
> 
> So I post a patch contains only the indisputable part.

Thanks for the report and fix. Pushed. This was surprisingly painful, all but
one branch had conflicts...

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-22 18:06:24 -0800, Andres Freund wrote:
> On 2022-02-18 15:14:15 -0800, Andres Freund wrote:
> > I'm running out of ideas for how to try to reproduce this. I think we might
> > need some additional debugging information to get more information from the
> > buildfarm.
> 
> > I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing 
> > --verbose
> > to pg_basebackup in $node_primary3->backup(...).
> >
> > It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
> > ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots().
> 
> Planning to commit something like the attached.

This did provide us a bit more detail.

Seems to suggest something is holding a problematic lock in a way that I do not 
understand yet:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 
019_replslot_limit.pl LOG:  received replication command: 
CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( 
RESERVE_WAL)
...
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 
019_replslot_limit.pl DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to 
make
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 
019_replslot_limit.pl DEBUG:  replication slot exit hook, without active slot
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 
019_replslot_limit.pl DEBUG:  temporary replication slot cleanup: begin

last message from 1997084 until the immediate shutdown. Just checking for
temporary replication slots that need to be dropped requires
ReplicationSlotControlLock. Actually dropping also requires
ReplicationSlotAllocationLock

1997084 does have to a temporary replication slot to clean up.

2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:35] 
019_replslot_limit.pl DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to 
make
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:36] 
019_replslot_limit.pl DEBUG:  replication slot exit hook, without active slot
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:37] 
019_replslot_limit.pl DEBUG:  temporary replication slot cleanup: begin
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:38] 
019_replslot_limit.pl DEBUG:  temporary replication slot cleanup: done

1997083 succeeds in doing the cleanup. It does not have a temporary
replication slot. Making it look like somehow ReplicationSlotAllocationLock
hasn't been released.

2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:39] 
019_replslot_limit.pl DEBUG:  shmem_exit(0): 7 on_shmem_exit callbacks to make
...
2022-02-23 09:09:53.076 EST [2022-02-23 09:09:52 EST 1997072:87] LOG:  received 
immediate shutdown request
...
2022-02-23 09:09:53.095 EST [2022-02-23 09:09:52 EST 1997072:90] DEBUG:  server 
process (PID 1997084) exited with exit code 2

It's *possible*, but not likely, that somehow 1997084 just doesn't get
scheduled for a prolonged amount of time.

We could be more certain if we shut down the cluster in fast rather than
immediate mode. So I'm thinking of doing something like

# We've seen occasionales cases where multiple walsender pids are active. An
# immediate shutdown may hide evidence of a locking bug. So if multiple
# walsenders are observed, shut down in fast mode, and collect some more
# information.
if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
{
my ($stdout, $stderr);
$node_primary3->psql('postgres',
 "\\a\\t\nSELECT * FROM pg_stat_activity",
 stdout => \$stdout, stderr => \$stderr);
diag $stdout, $stderr;
$node_primary3->stop('fast');
$node_standby3->stop('fast');
die "could not determine walsender pid, can't continue";
}

Does that make sense? Better ideas?

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 16:25:01 -0300, Euler Taveira wrote:
> On Fri, Feb 25, 2022, at 11:52 AM, Greg Stark wrote:
> > On Tue, 25 Jan 2022 at 01:32, Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > I was looking the shared memory stats patch again.
> > 
> > Can you point me to this thread? I looked for it but couldn't find it.

> https://postgr.es/m/20180629.173418.190173462.horiguchi.kyot...@lab.ntt.co.jp

I'll post a rebased version as soon as this is resolved... I have a local one,
but it just works by nuking a bunch of tests / #ifdefing out code related to
this.

Greetings,

Andres Freund

Re: Proposal: Support custom authentication methods using hooks

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 14:10:39 -0500, Tom Lane wrote:
> I'm happy to add support for custom auth methods if they can use
> a protocol that's safer than cleartext-password.  But if that's the
> only feasible option, then we're just encouraging people to use
> insecure methods.

It looks like scram can be used without much trouble. All the necessary
infrastructure to implement it without duplication appears to be public.  The
plugin would need to get a secret from whereever and call
  CheckSASLAuth(&pg_be_scram_mech, port, shadow_pass, logdetail);

or if it needs to do something more complicated than CheckSASLAuth(), it can
use AUTH_REQ_SASL{,_FIN CONT} itself.

Of course whether it's viable depends on what the custom auth method wants to
do. But it's not a restriction of the authentication plugin model.

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 15:07:01 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Seems to suggest something is holding a problematic lock in a way that I do 
> > not understand yet:
> 
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
> > 2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 
> > 019_replslot_limit.pl LOG:  received replication command: 
> > CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( 
> > RESERVE_WAL)
> > ...
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 
> > 019_replslot_limit.pl DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks 
> > to make
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 
> > 019_replslot_limit.pl DEBUG:  replication slot exit hook, without active 
> > slot
> > 2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 
> > 019_replslot_limit.pl DEBUG:  temporary replication slot cleanup: begin
> 
> > last message from 1997084 until the immediate shutdown.
> 
> Hmm.  Maybe put a couple more debug messages into ReplicationSlotCleanup
> and/or ReplicationSlotDropPtr?  It doesn't seem very clear where in that
> sequence it's hanging up.

Yea, was thinking that as well.


I'm also wondering whether it's worth adding an assert, or at least a WARNING,
about no lwlocks held to the tail end of ShutdownPostgres?  I don't want to
add an LWLockReleaseAll() yet, before I understand what's actually happening.


> > We could be more certain if we shut down the cluster in fast rather than
> > immediate mode. So I'm thinking of doing something like
> 
> Does that risk an indefinite hangup of the buildfarm run?

I think not. The pg_ctl stop -m fast should time out after PGCTLTIMEOUT,
$self->_update_pid(-1); should notice it's not dead. The END{} block should
then shut it down in immediate mode.

Greetings,

Andres Freund

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-02-25 Thread Andres Freund

On 2022-02-25 20:19:24 +, Jacob Champion wrote:
> From 2fde60a6bc3739f1894c8c264120e4fa0f04df64 Mon Sep 17 00:00:00 2001
> From: Jacob Champion 
> Date: Mon, 14 Feb 2022 08:10:53 -0800
> Subject: [PATCH v3] Add API to retrieve authn_id from SQL

> The authn_id field in MyProcPort is currently only accessible to the
> backend itself.  Add a SQL function, session_authn_id(), to expose the
> field to triggers that may want to make use of it.

Looks to me like authn_id isn't synchronized to parallel workers right now. So
the function will return the wrong thing when executed as part of a parallel
query.

I don't think we should add further functions not prefixed with pg_.

Perhaps a few tests for less trivial authn_ids could be worthwhile?
E.g. certificate DNs.

Greetings,

Andres Freund

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 14:25:52 +0800, Julien Rouhaud wrote:
> > Basically the type of stats we're trying to move to dynamic shared memory is
> > about counters that should persist for a while and are accumulated across
> > connections etc. Whereas backend_status.c is more about tracking the current
> > state (what query is a backend running, what user, etc).
> 
> But would it be acceptable to use dynamic shared memory in backend_status and
> e.g. have a dsa_pointer rather than a fixed length array?

Might be OK, but it does add a fair bit of complexity. Suddenly there's a
bunch more initialization order dependencies that you don't have right
now. I'd not go there for just this.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 14:00:12 -0800, Peter Geoghegan wrote:
> On Thu, Feb 24, 2022 at 11:14 PM Andres Freund  wrote:
> > I am not a fan of the backstop terminology. It's still the reason we need to
> > do freezing for correctness reasons.
> 
> Thanks for the review!
> 
> I'm not wedded to that particular terminology, but I think that we
> need something like it. Open to suggestions.
>
> How about limit-based? Something like that?

freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or
s/limit/below/? I kind of like below because that answers < vs <= which I find
hard to remember around freezing.


> > I'm a tad concerned about replacing mxids that have some members that are
> > older than OldestXmin but not older than FreezeLimit. It's not too hard to
> > imagine that accelerating mxid consumption considerably.  But we can 
> > probably,
> > if not already done, special case that.
> 
> Let's assume for a moment that this is a real problem. I'm not sure if
> it is or not myself (it's complicated), but let's say that it is. The
> problem may be more than offset by the positive impact on relminxmid
> advancement. I have placed a large emphasis on enabling
> relfrozenxid/relminxmid advancement in every non-aggressive VACUUM,
> for a number of reasons -- this is one of the reasons. Finding a way
> for every VACUUM operation to be "vacrel->scanned_pages +
> vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some*
> amount of relfrozenxid/relminxmid advancement possible in every
> VACUUM) has a great deal of value.

That may be true, but I think working more incrementally is better in this
are. I'd rather have a smaller improvement for a release, collect some data,
get another improvement in the next, than see a bunch of reports of larger
wind and large regressions.


> As I said recently on the "do only critical work during single-user
> vacuum?" thread, why should the largest tables in databases that
> consume too many MXIDs do so evenly, across all their tables? There
> are usually one or two large tables, and many more smaller tables. I
> think it's much more likely that the largest tables consume
> approximately zero MultiXactIds in these databases -- actual
> MultiXactId consumption is probably concentrated in just one or two
> smaller tables (even when we burn through MultiXacts very quickly).
> But we don't recognize these kinds of distinctions at all right now.

Recognizing those distinctions seems independent of freezing multixacts with
live members. I am happy with freezing them more aggressively if they don't
have live members. It's freezing mxids with live members that has me
concerned.  The limits you're proposing are quite aggressive and can advance
quickly.

I've seen large tables with plenty multixacts. Typically concentrated over a
value range (often changing over time).


> Under these conditions, we will have many more opportunities to
> advance relminmxid for most of the tables (including the larger
> tables) all the way up to current-oldestMxact with the patch series.
> Without needing to freeze *any* MultiXacts early (just freezing some
> XIDs early) to get that benefit. The patch series is not just about
> spreading the burden of freezing, so that non-aggressive VACUUMs
> freeze more -- it's also making relfrozenxid and relminmxid more
> recent and therefore *reliable* indicators of which tables any
> wraparound problems *really* are.

My concern was explicitly about the case where we have to create new
multixacts...


> Does that make sense to you?

Yes.


> > > On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM 
> > > operation's
> > > OldestXmin at all (it actually just gets FreezeLimit passed as its
> > > cutoff_xid argument). It cannot possibly recognize any of this for itself.
> >
> > It does recognize something like OldestXmin in a more precise and expensive
> > way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().
> 
> It doesn't look that way to me.
> 
> While it's true that FreezeMultiXactId() will call MultiXactIdIsRunning(),
> that's only a cross-check.

> This cross-check is made at a point where we've already determined that the
> MultiXact in question is < cutoff_multi. In other words, it catches cases
> where a "MultiXactId < cutoff_multi" Multi contains an XID *that's still
> running* -- a correctness issue. Nothing to do with being smart about
> avoiding allocating new MultiXacts during freezing, or exploiting the fact
> that "FreezeLimit < OldestXmin" (which is almost always true, very true).

If there&#

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 15:28:17 -0800, Peter Geoghegan wrote:
> But why should we ever get to the FreezeMultiXactId() loop with the
> stuff from 0002 in place? The whole purpose of the loop is to handle
> cases where we have to remove *some* (not all) XIDs from before
> cutoff_xid that appear in a MultiXact, which requires careful checking
> of each XID (this is only possible when the MultiXactId is <
> cutoff_multi to begin with, which is OldestMxact in the patch, which
> is presumably very recent).
> 
> It's not impossible that we'll get some number of "skewed MultiXacts"
> with the patch -- cases that really do necessitate allocating a new
> MultiXact, just to "freeze some XIDs from a MultiXact". That is, there
> will sometimes be some number of XIDs that are < OldestXmin, but
> nevertheless appear in some MultiXactIds >= OldestMxact. This seems
> likely to be rare with the patch, though, since VACUUM calculates its
> OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi
> really are in the patch) at the same point in time. Which was the
> point I made in my email yesterday.

I don't see why it matters that OldestXmin and OldestMxact are computed at the
same time?  It's a question of the workload, not vacuum algorithm.

OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all
members are older than OldestXmin (not quite true, but that's the bound), and
they have always more than one member.


> How many of these "skewed MultiXacts" can we really expect?

I don't think they're skewed in any way. It's a fundamental aspect of
multixacts.

Greetings,

Andres Freund

Re: Mingw task for Cirrus CI

2022-02-25 Thread Andres Freund

Hi,

Andrew, CCIng you both because you might be interested in the CI bit, and
because you might know the answer.

On 2022-02-25 19:44:27 +0300, Melih Mutlu wrote:
> I've been working on adding Windows+MinGW environment into cirrus-ci tasks
> (discussion about ci is here [1]).
> It uses MSYS2 to set the environment. UCRT is chosen as C standard library,
> instead of MSVCRT.
> The task configures postgres with features that are available in MSYS2 (see
> available packages [2]) and tap tests are enabled.
> I already added the necessary docker image, you can find the related PR at
> [3] and a successful cirruc-ci run with these changes at [4].
> Attached patch adds a task runs on Windows with MinGW for cirrus-ci
>
> However, I cannot run configure with --with-python, --with-perl or
> --with-tcl.
> There are two issues I encountered while trying to enable.  e.g. for
> --with-python
>
> 1-  python_ldlibrary is set to "libpython3.9.dll.a". So the related line in
> configure cannot retrieve "ldlibrary"

This presumably is due to using mingw's python rather than python.org's
python. Seems like a reasonable thing to support for the mingw build.

Melih, you could try to build against the python.org python (i installed in
the CI container).


> 2-  src/pl/plpython/Makefile looks under "C:/Windows/system32/" for
> PYTHONDLL. gendef cannot open that file, give an error like " failed to
> open()" when creating a .def file.

On my win10 VM in which I installed python.org python I don't see a python dll
in c:/windows/system32 either. Looks like none of our windows mingw animals
build with python. So it might just be bitrot.


> In the end, make check-world still fails, even though I was able to run
> configure and make without any obvious error.
> I see bunch of errors in tests like:
> +ERROR:  language "plpython3u" does not exist
> +HINT:  Use CREATE EXTENSION to load the language into the database.

> Here is the logs from failed ci run:
> https://api.cirrus-ci.com/v1/artifact/task/4645682031099904/log/build/src/pl/plpython/regression.diffs

The URL to the rest of the CI run is https://cirrus-ci.com/task/4645682031099904


The relevant failure is earlier:

+ERROR:  could not load library 
"C:/cirrus/build/tmp_install/ucrt64/lib/postgresql/plpython3.dll": The 
specified module could not be found.

Clearly plpython is being built, because we see some warnings:

[22:44:24.456] C:/msys64/ucrt64/include/python3.9/pyconfig.h:1474: warning: 
"SIZEOF_OFF_T" redefined
[22:44:24.456]  1474 | #define SIZEOF_OFF_T 8
[22:44:24.456]   |
[22:44:24.456] In file included from c:/cirrus/src/include/c.h:54,
[22:44:24.456]  from c:/cirrus/src/include/postgres.h:46,
[22:44:24.456]  from 
c:/cirrus/contrib/jsonb_plpython/jsonb_plpython.c:1:
[22:44:24.456] ../../src/include/pg_config.h:875: note: this is the location of 
the previous definition
[22:44:24.456]   875 | #define SIZEOF_OFF_T 4

Seems we're doing something wrong and end up with a 4 byte off_t, whereas
python ends up with an 8byte one. We probably need to fix that. But it's not
the cause of this problem.


You could take out -s from the make flags and see whether plpython3.dll is
being built and installed, and where to.


Greetings,

Andres Freund

Re: convert libpq uri-regress tests to tap test

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 09:56:47 -0800, Andres Freund wrote:
> On 2022-02-24 08:46:23 -0800, Andres Freund wrote:
> > I'm mildly inclined to only do 0001 and 0002 for now. We'd not loose msvc
> > coverage, because it already doesn't build the test. Once we've ironed that
> > stuff out, we could do 0003?
> 
> From what I can see in the buildfarm client, we'd not loose (nor gain) any
> buildfarm coverage either. It doesn't run the test today.

Attached are rebased patches. I polished 0001, the regress.pl -> 001_uri.pl
conversion some more (although some of perltidy's changes aren't clearly an
improvement).

I'd like to commit 0001 and 0002 soon, unless somebody sees a reason not to?

Greetings,

Andres Freund
>From 87017e4b0a78c486eca24aab96f32e1168c82b93 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 23 Feb 2022 12:22:56 -0800
Subject: [PATCH v3 1/3] Convert src/interfaces/libpq/test to a tap test.

The old form of the test needed a bunch of custom infrastructure. These days
tap tests provide the necessary infrastructure to do better.

We discussed whether to move this test to src/test/modules, alongside
libpq_pipeline, but concluded that the opposite direction would be
better. libpq_pipeline will be moved at a later date, once the buildfarm and
msvc build infrastructure is ready for it.

The invocation of the tap test will be added in the next commit. It involves
just enough buildsystem changes to be worth commiting separately. Can't happen
the other way round because prove errors out when invoked without tests.

Discussion: https://postgr.es/m/20220223203031.ezrd73ohvjgfk...@alap3.anarazel.de
---
 src/interfaces/libpq/t/001_uri.pl  | 244 +
 src/interfaces/libpq/test/.gitignore   |   2 -
 src/interfaces/libpq/test/Makefile |   7 +-
 src/interfaces/libpq/test/README   |   7 -
 src/interfaces/libpq/test/expected.out | 171 -
 src/interfaces/libpq/test/regress.in   |  57 --
 src/interfaces/libpq/test/regress.pl   |  65 ---
 7 files changed, 246 insertions(+), 307 deletions(-)
 create mode 100644 src/interfaces/libpq/t/001_uri.pl
 delete mode 100644 src/interfaces/libpq/test/README
 delete mode 100644 src/interfaces/libpq/test/expected.out
 delete mode 100644 src/interfaces/libpq/test/regress.in
 delete mode 100644 src/interfaces/libpq/test/regress.pl

diff --git a/src/interfaces/libpq/t/001_uri.pl b/src/interfaces/libpq/t/001_uri.pl
new file mode 100644
index 000..90f370f8fd6
--- /dev/null
+++ b/src/interfaces/libpq/t/001_uri.pl
@@ -0,0 +1,244 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use Test::More;
+use IPC::Run;
+
+
+# List of URIs tests. For each test the first element is the input string, the
+# second the expected stdout and the third the expected stderr.
+my @tests = (
+	[
+		q{postgresql://uri-user:secret@host:12345/db},
+		q{user='uri-user' password='secret' dbname='db' host='host' port='12345' (inet)},
+		q{},
+	],
+	[
+		q{postgresql://uri-user@host:12345/db},
+		q{user='uri-user' dbname='db' host='host' port='12345' (inet)}, q{},
+	],
+	[
+		q{postgresql://uri-user@host/db},
+		q{user='uri-user' dbname='db' host='host' (inet)}, q{},
+	],
+	[
+		q{postgresql://host:12345/db},
+		q{dbname='db' host='host' port='12345' (inet)}, q{},
+	],
+	[ q{postgresql://host/db}, q{dbname='db' host='host' (inet)}, q{}, ],
+	[
+		q{postgresql://uri-user@host:12345/},
+		q{user='uri-user' host='host' port='12345' (inet)},
+		q{},
+	],
+	[
+		q{postgresql://uri-user@host/},
+		q{user='uri-user' host='host' (inet)},
+		q{},
+	],
+	[ q{postgresql://uri-user@},   q{user='uri-user' (local)}, q{}, ],
+	[ q{postgresql://host:12345/}, q{host='host' port='12345' (inet)}, q{}, ],
+	[ q{postgresql://host:12345},  q{host='host' port='12345' (inet)}, q{}, ],
+	[ q{postgresql://host/db}, q{dbname='db' host='host' (inet)},  q{}, ],
+	[ q{postgresql://host/},   q{host='host' (inet)},  q{}, ],
+	[ q{postgresql://host},q{host='host' (inet)},  q{}, ],
+	[ q{postgresql://},q{(local)}, q{}, ],
+	[
+		q{postgresql://?hostaddr=127.0.0.1}, q{hostaddr='127.0.0.1' (inet)},
+		q{},
+	],
+	[
+		q{postgresql://example.com?hostaddr=63.1.2.4},
+		q{host='example.com' hostaddr='63.1.2.4' (inet)},
+		q{},
+	],
+	[ q{postgresql://%68ost/}, q{host='host' (inet)}, q{}, ],
+	[
+		q{postgresql://host/db?user=uri-user},
+		q{user='uri-user' dbname='db' host='ho

Re: Race conditions in 019_replslot_limit.pl

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 12:15:58 -0800, Andres Freund wrote:
> On 2022-02-25 15:07:01 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > Hmm.  Maybe put a couple more debug messages into ReplicationSlotCleanup
> > and/or ReplicationSlotDropPtr?  It doesn't seem very clear where in that
> > sequence it's hanging up.
> 
> Yea, was thinking that as well.
> 
> 
> I'm also wondering whether it's worth adding an assert, or at least a WARNING,
> about no lwlocks held to the tail end of ShutdownPostgres?  I don't want to
> add an LWLockReleaseAll() yet, before I understand what's actually happening.

Did those things. Sure looks like there's some interaction with
checkpointer. There's a similar sequence in the two failures since the
additional debugging.

2022-02-26 06:35:52.453 CET [6219bc37.8717f:20] DEBUG:  replication slot drop: 
pg_basebackup_553343: removed on-disk
2022-02-26 06:35:52.453 CET [6219bc37.87168:17] DEBUG:  snapshot of 0+0 running 
transaction ids (lsn 0/700060 oldest xid 720 latest complete 719 next xid 720)
2022-02-26 06:35:52.453 CET [6219bc37.87168:18] DEBUG:  begin invalidating 
obsolete replication slots older than 0/70
2022-02-26 06:35:52.453 CET [6219bc37.87168:19] DEBUG:  done invalidating 
obsolete replication slots
2022-02-26 06:35:52.453 CET [6219bc37.87168:20] DEBUG:  attempting to remove 
WAL segments older than log file 0006
2022-02-26 06:35:52.453 CET [6219bc37.87168:21] DEBUG:  removing write-ahead 
log file "00010006"
2022-02-26 06:35:52.453 CET [6219bc37.87168:22] DEBUG:  SlruScanDirectory 
invoking callback on pg_subtrans/
2022-02-26 06:35:52.453 CET [6219bc37.87168:23] LOG:  checkpoint complete: 
wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.001 
s, sync=0.001 s, total=0.351 s; sync files=0, longest=0.000 s, average=0.000 s; 
distance=1024 kB, estimate=1024 kB
[...]
2022-02-26 06:35:52.970 CET [6219bc37.87139:71] LOG:  received fast shutdown 
request
2022-02-26 06:35:52.970 CET [6219bc37.87139:72] LOG:  aborting any active 
transactions
[...]
2022-02-26 06:35:52.971 CET [6219bc37.87168:24] LOG:  shutting down
2022-02-26 06:35:52.976 CET [6219bc37.8717f:21] DEBUG:  replication slot drop: 
pg_basebackup_553343: done
2022-02-26 06:35:52.976 CET [6219bc37.8717f:22] DEBUG:  temporary replication 
slot cleanup: begin
2022-02-26 06:35:52.976 CET [6219bc37.8717f:23] DEBUG:  temporary replication 
slot cleanup: 0 in use, active_pid: 553385
2022-02-26 06:35:52.976 CET [6219bc37.8717f:24] DEBUG:  temporary replication 
slot cleanup: done
2022-02-26 06:35:52.977 CET [6219bc37.8717f:25] DEBUG:  shmem_exit(0): 7 
on_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:26] DEBUG:  proc_exit(0): 3 
callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:27] LOG:  disconnection: session 
time: 0:00:00.996 user=bf database= host=[local]
2022-02-26 06:35:52.977 CET [6219bc37.8717f:28] DEBUG:  exit(0)
2022-02-26 06:35:52.977 CET [6219bc37.8717f:29] DEBUG:  shmem_exit(-1): 0 
before_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:30] DEBUG:  shmem_exit(-1): 0 
on_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:31] DEBUG:  proc_exit(-1): 0 
callbacks to make

So the backend (8717f/553343) is in the middle of ReplicationSlotDrop(), after
removing on-disk data. And then for 500ms nothing happens until checkpointer
wakes up again. As soon as it does, the slot drop continues.

Just before calling ReplicationSlotDrop() we were able to acquire
ReplicationSlotControlLock in share mode. Just after the log message after
which there's the delay, ReplicationSlotControlLock is locked in exclusive
mode.

Too tired now..

Greetings,

Andres Freund

Re: Missed condition-variable wakeups on FreeBSD

2022-02-26 Thread Andres Freund

Hi,

On 2022-02-26 14:07:05 -0500, Tom Lane wrote:
> About once a month over the last six months, my buildfarm animal
> florican has gotten stuck while running the core regression tests.
> The symptoms have looked very much the same each time: there is
> a backend with two parallel worker processes that are just sitting
> and not consuming any CPU time.  Each time I've attached to these
> processes with gdb to check their stack traces, and seen pretty
> much the same story every time (traces below).  What is really
> interesting is that after I detach from the second worker, the
> processes resume running and finish out the test successfully!
> I don't know much about how gdb interacts with kernel calls on
> FreeBSD, but I speculate that the poll(2) call returns with EINTR
> after gdb releases the process, and then things resume fine,
> suggesting that we lost an interrupt somewhere.
>
> I have observed this three times in the REL_11 branch, once
> in REL_12, and a couple of times last summer before it occurred
> to me to start keeping notes.  Over that time the machine has
> been running various patchlevels of FreeBSD 13.0.

It's certainly interesting that it appears to happen only in the branches
using poll rather than kqueue to implement latches. That changed between 12
and 13.

Have you tried running the core regression tests with force_parallel_mode =
on, or with the parallel costs lowered, to see if that makes the problem
appear more often?

> Here's the stack trace from the leader process in the most
> recent event (REL_11 as of yesterday).  It's not always the
> same query that gets stuck, but it's always a parallel hash join:

Which does make me wonder if it's a latch issue or a logic issue in hashjoin /
barriers. HJ is the only user of barrier.c...

There's this commit, which subsequently was reverted due to some issue, and
not re-applied since...

commit 4e0f0995e923948631c4114ab353b256b51b58ad
Author: Thomas Munro 
Date:   2021-03-17 17:46:39 +1300

Fix race in Parallel Hash Join batch cleanup.

It doesn't seem super likely to be related, but...

> (gdb) p debug_query_string
> $1 = 0x21873090 "select count(*) from simple r join simple s using (id);"
> (gdb) bt
> #0  _poll () at _poll.S:4
> #1  0x21701361 in __thr_poll (fds=0x219dc170, nfds=2, timeout=-1) at 
> /usr/src/lib/libthr/thread/thr_syscalls.c:338
> #2  0x215eaf3f in poll (pfd=0x219dc170, nfds=2, timeout=-1) at 
> /usr/src/lib/libc/sys/poll.c:47
> #3  0x0097b0fd in WaitEventSetWaitBlock (set=, cur_timeout=-1, 
> occurred_events=, nevents=) at latch.c:1171

The next time this happens / if you still have this open, perhaps it could be
worth checking if there's a byte in the self pipe?

> Thoughts?  Ideas on debugging this?

Besides trying to make the issue more likely as suggested above, it might be
worth checking if signalling the stuck processes with SIGUSR1 gets them
unstuck.

Greetings,

Andres Freund

Checkpointer sync queue fills up / loops around pg_usleep() are bad

2022-02-26 Thread Andres Freund

Hi,

In two recent investigations in occasional test failures
(019_replslot_limit.pl failures, AIO rebase) the problems are somehow tied to
checkpointer.

I don't yet know if actually causally related to precisely those failures, but
when running e.g. 027_stream_regress.pl, I see phases in which many backends
are looping in RegisterSyncRequest() repeatedly, each time sleeping with
pg_usleep(1L).


Without adding instrumentation this is completely invisible at any log
level. There's no log messages, there's no wait events, nothing.

ISTM, we should not have any loops around pg_usleep(). And shorter term, we
shouldn't have any loops around pg_usleep() that don't emit log messages / set
wait events. Therefore I propose that we "prohibit" such loops without at
least a DEBUG2 elog() or so. It's just too hard to debug.


The reason for the sync queue filling up in 027_stream_regress.pl is actually
fairly simple:

1) The test runs with shared_buffers = 1MB, leading to a small sync queue of
   128 entries.
2) CheckpointWriteDelay() does pg_usleep(10L)

ForwardSyncRequest() wakes up the checkpointer using SetLatch() if the sync
queue is more than half full.

But at least on linux and freebsd that doesn't actually interrupt pg_usleep()
anymore (due to using signalfd / kqueue rather than a signal handler). And on
all platforms the signal might arrive just before the pg_usleep() rather than
during, also not causing usleep to be interrupted.

If I shorten the sleep in CheckpointWriteDelay() the problem goes away. This
actually reduces the time for a single run of 027_stream_regress.pl on my
workstation noticably.  With default sleep time it's ~32s, with shortened time
it's ~27s.

I suspect we need to do something about this concrete problem for 14 and
master, because it's certainly worse than before on linux / freebsd.

I suspect the easiest is to just convert that usleep to a WaitLatch(). That'd
require adding a new enum value to WaitEventTimeout in 14. Which probably is
fine?



Greetings,

Andres Freund

Re: Fix a typo in pg_receivewal.c's get_destination_dir()

2022-02-26 Thread Andres Freund

Hi,

On 2022-02-26 22:04:13 +0530, Bharath Rupireddy wrote:
> It looks like there's a typo in pg_receivewal.c's
> get_destination_dir(), that is, use the function parameter dest_folder
> instead of global variable basedir in the error message. Although
> there's no harm in it as basedir is a global variable in
> pg_receivewal.c, let's use the function parameter to be in sync with
> close_destination_dir.
> 
> Attaching a tiny patch to fix it.

Good catch, pushed.

Greetings,

Andres Freund

Re: Adding CI to our tree

2022-02-26 Thread Andres Freund

Hi,

> Subject: [PATCH 2/7] cirrus: upload changed html docs as artifacts

I still think the determination of the base branch needs to be resolved before
this can be considered.


> Always run doc build; to allow them to be shown in cfbot, they should not be
> skipped if the linux build fails.
> 
> This could be done on the client side (cfbot).  One advantage of doing it here
> is that fewer docs are uploaded - many patches won't upload docs at all.

Imo this stuff is largely independent from the commit subject


> XXX: if this is run in the same task, the configure flags should probably be
> consistent ?

What do you mean?


> Subject: [PATCH 3/7] s!build docs as a separate task..

Could you reorder this to earlier, then we can merge it before resolving the
branch issues. And we don't waffle on the CompilerWarnings dependency.


> I believe this'll automatically show up as a separate "column" on the cfbot
> page.

Yup.


> +# Verify docs can be built, and upload changed docs as artifacts
> +task:
> +  name: HTML docs
> +
> +  env:
> +CPUS: 1
> +
> +  only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || 
> $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*(docs|html).*'
> +
> +  container:
> +image: $CONTAINER_REPO/linux_debian_bullseye_ci:latest
> +cpu: $CPUS
> +

how about using something like (the syntax might be slightly off)
  skip: !changesInclude('doc/**')
to avoid running it for the many pushes where no docs are changed?


> +  sysinfo_script: |
> +id
> +uname -a
> +cat /proc/cmdline
> +ulimit -a -H && ulimit -a -S
> +export
> +
> +git remote -v
> +git branch -a
> +git remote add postgres https://github.com/postgres/postgres
> +time git fetch -v postgres master
> +git log -1 postgres/master
> +git diff --name-only postgres/master..

Hardly "sysinfo"?


> Subject: [PATCH 4/7] wip: cirrus: code coverage
> 
> XXX: lcov should be installed in the OS image

FWIW, you can open a PR in https://github.com/anarazel/pg-vm-images/
both the debian docker and VM have their packages installed
via scripts/linux_debian_install_deps.sh


> From 226699150e3e224198fc297689add21bece51c4b Mon Sep 17 00:00:00 2001
> From: Justin Pryzby 
> Date: Sun, 9 Jan 2022 18:25:02 -0600
> Subject: [PATCH 5/7] cirrus/vcregress: test modules/contrib with
>  NO_INSTALLCHECK=1

I don't want to commit the vcregress.pl part myself. But if you split off I'm
happy to push the --temp-config bits.


> From 08933bcd93d4f57ad73ab6df2f1627b93e61b459 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby 
> Date: Sun, 16 Jan 2022 12:51:13 -0600
> Subject: [PATCH 6/7] wip: cirrus/windows: save tmp_install as an artifact..
> 
> ..to allow users to easily download compiled binaries to try a patch.
> If they don't have a development environment handy or not familiar with
> compilation.
> 
> XXX: maybe this should be conditional or commented out ?

Yea, I don't want to do this by default, that's a fair bit of data that very
likely nobody will ever access. One can make entire tasks triggered manually,
but that'd then require building again :/.



> From a7d2bba6f51d816412fb645b0d4821c36ee5c400 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby 
> Date: Sun, 20 Feb 2022 15:01:59 -0600
> Subject: [PATCH 7/7] wip: cirrus/windows: add compiler_warnings_script
> 
> I'm not sure how to write this test in windows shell; it's also not easy to
> write it in posix sh, since windows shell is somehow interpretting && and 
> ||...

You could put the script in src/tools/ci and call it from the script to avoid
the quoting issues.

Would be good to add a comment explaining the fileLoggerParameters1 thing and
a warning that compiler_warnings_script should be the last script.


Greetings,

Andres Freund

Re: Adding CI to our tree

2022-02-26 Thread Andres Freund

On 2022-02-26 17:09:08 -0800, Andres Freund wrote:
> You could put the script in src/tools/ci and call it from the script to avoid
> the quoting issues.

Might also be a good idea for the bulk of the docs / coverage stuff, even if
there are no quoting issues.

Re: convert libpq uri-regress tests to tap test

2022-02-26 Thread Andres Freund

Hi,

On 2022-02-25 17:52:29 -0800, Andres Freund wrote:
> I'd like to commit 0001 and 0002 soon, unless somebody sees a reason not to?

Pushed.  Attached is the remainder, 0003, the move of libpq_pipeline to
src/interfaces/libpq that I'm not planning to push for now.

Regards,

Andres
>From 61a89721c8aab3a1fd2300067c470807b4fe87bc Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 24 Feb 2022 08:27:41 -0800
Subject: [PATCH v4] Move libpq_pipeline test into src/interfaces/libpq.

---
 .../libpq/t/002_pipeline.pl}  |  2 +-
 src/interfaces/libpq/test/.gitignore  |  1 +
 src/interfaces/libpq/test/Makefile|  2 +-
 .../libpq/test}/libpq_pipeline.c  |  0
 .../test}/traces/disallowed_in_pipeline.trace |  0
 .../libpq/test}/traces/multi_pipelines.trace  |  0
 .../libpq/test}/traces/nosync.trace   |  0
 .../libpq/test}/traces/pipeline_abort.trace   |  0
 .../libpq/test}/traces/prepared.trace |  0
 .../libpq/test}/traces/simple_pipeline.trace  |  0
 .../libpq/test}/traces/singlerow.trace|  0
 .../libpq/test}/traces/transaction.trace  |  0
 src/test/modules/Makefile |  1 -
 src/test/modules/libpq_pipeline/.gitignore|  5 
 src/test/modules/libpq_pipeline/Makefile  | 25 ---
 src/test/modules/libpq_pipeline/README|  1 -
 16 files changed, 3 insertions(+), 34 deletions(-)
 rename src/{test/modules/libpq_pipeline/t/001_libpq_pipeline.pl => interfaces/libpq/t/002_pipeline.pl} (96%)
 rename src/{test/modules/libpq_pipeline => interfaces/libpq/test}/libpq_pipeline.c (100%)
 rename src/{test/modules/libpq_pipeline => interfaces/libpq/test}/traces/disallowed_in_pipeline.trace (100%)
 rename src/{test/modules/libpq_pipeline => interfaces/libpq/test}/traces/multi_pipelines.trace (100%)
 rename src/{test/modules/libpq_pipeline => interfaces/libpq/test}/traces/nosync.trace (100%)
 rename src/{test/modules/libpq_pipeline => interfaces/libpq/test}/traces/pipeline_abort.trace (100%)
 rename src/{test/modules/libpq_pipeline => interfaces/libpq/test}/traces/prepared.trace (100%)
 rename src/{test/modules/libpq_pipeline => interfaces/libpq/test}/traces/simple_pipeline.trace (100%)
 rename src/{test/modules/libpq_pipeline => interfaces/libpq/test}/traces/singlerow.trace (100%)
 rename src/{test/modules/libpq_pipeline => interfaces/libpq/test}/traces/transaction.trace (100%)
 delete mode 100644 src/test/modules/libpq_pipeline/.gitignore
 delete mode 100644 src/test/modules/libpq_pipeline/Makefile
 delete mode 100644 src/test/modules/libpq_pipeline/README

diff --git a/src/test/modules/libpq_pipeline/t/001_libpq_pipeline.pl b/src/interfaces/libpq/t/002_pipeline.pl
similarity index 96%
rename from src/test/modules/libpq_pipeline/t/001_libpq_pipeline.pl
rename to src/interfaces/libpq/t/002_pipeline.pl
index 0c164dcaba5..2e288d8ba7c 100644
--- a/src/test/modules/libpq_pipeline/t/001_libpq_pipeline.pl
+++ b/src/interfaces/libpq/t/002_pipeline.pl
@@ -49,7 +49,7 @@ for my $testname (@tests)
 		my $expected;
 		my $result;
 
-		$expected = slurp_file_eval("traces/$testname.trace");
+		$expected = slurp_file_eval("test/traces/$testname.trace");
 		next unless $expected ne "";
 		$result = slurp_file_eval($traceout);
 		next unless $result ne "";
diff --git a/src/interfaces/libpq/test/.gitignore b/src/interfaces/libpq/test/.gitignore
index 5e803d8816a..e24d7f64dc3 100644
--- a/src/interfaces/libpq/test/.gitignore
+++ b/src/interfaces/libpq/test/.gitignore
@@ -1 +1,2 @@
 /uri-regress
+/libpq_pipeline
diff --git a/src/interfaces/libpq/test/Makefile b/src/interfaces/libpq/test/Makefile
index 54212159065..9f99309653f 100644
--- a/src/interfaces/libpq/test/Makefile
+++ b/src/interfaces/libpq/test/Makefile
@@ -11,7 +11,7 @@ endif
 override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
 LDFLAGS_INTERNAL += $(libpq_pgport)
 
-PROGS = uri-regress
+PROGS = uri-regress libpq_pipeline
 
 all: $(PROGS)
 
diff --git a/src/test/modules/libpq_pipeline/libpq_pipeline.c b/src/interfaces/libpq/test/libpq_pipeline.c
similarity index 100%
rename from src/test/modules/libpq_pipeline/libpq_pipeline.c
rename to src/interfaces/libpq/test/libpq_pipeline.c
diff --git a/src/test/modules/libpq_pipeline/traces/disallowed_in_pipeline.trace b/src/interfaces/libpq/test/traces/disallowed_in_pipeline.trace
similarity index 100%
rename from src/test/modules/libpq_pipeline/traces/disallowed_in_pipeline.trace
rename to src/interfaces/libpq/test/traces/disallowed_in_pipeline.trace
diff --git a/src/test/modules/libpq_pipeline/traces/multi_pipelines.trace b/src/interfaces/libpq/test/traces/multi_pipelines.trace
similarity index 100%
rename from src/test/modules/libpq_pipeline/traces/multi_pipelines.trace
rename to src/interfaces/libpq/test/traces/multi_pipelines.trace
diff --git a/src/test/modules/libpq_pipeline/traces/nosync.trace b/src/interfaces/libpq/test/traces

Re: Adding CI to our tree

2022-02-26 Thread Andres Freund

Hi,

On 2022-02-26 20:43:52 -0600, Justin Pryzby wrote:
> This doesn't do the right thing - I just tried.
> https://cirrus-ci.org/guide/writing-tasks/#environment-variables
> | changesInclude function can be very useful for skipping some tasks when no 
> changes to sources have been made since the last successful Cirrus CI build.

> That means it will not normally rebuild docs (and then this still requires
> resolving the "base branch").

Why would we want to rebuild docs if they're the same as in the last build for
the same branch?  For cfbot purposes each commit is independent from the prior
commit, so it should rebuild it every time if the CF entry has changes to the
docs.

Greetings,

Andres Freund

Re: Adding CI to our tree

2022-02-26 Thread Andres Freund

Hi,

On 2022-02-26 21:10:57 -0600, Justin Pryzby wrote:
> I did git commit --amend --no-edit and repushed to github to trigger a new CI
> run, and it did this: https://github.com/justinpryzby/postgres/runs/5347878714
>
> This is in a branch with changes to doc.  I wasn't intending it to skip
> building docs on this branch just because the same, changed docs were
> previously built.

But why not? If nothing in docs/ changes, there's little point? It'd probably
be good to check .cirrus.yml and docs/**, to make sure that .cirrus logic
changes rerun as well.

> Why wouldn't the docs be built following the same logic as the rest of the
> sources?

Tests have a certain rate of spurious failure, so rerunning them makes
sense. But what do we gain by rebuilding the docs? And especially, what do we
gain about uploading the docs e.g. in the postgres/postgres repo?

> If someone renames or removes an xref target, shouldn't CI fail on its next
> run for a patch which tries to reference it ?

Why wouldn't it?

> It would fail on the buildfarm, and I think one major use for the CI is to
> minimize the post-push cleanup cycles.

I personally see it more as access to a "mini buildfarm" before patches are
committable, but that's of course compatible.

> Are you sure about cfbot ?  AIUI cirrus would see that docs didn't change
> relative to the previous run for branch: commitfest/NN/.

Not entirely sure, but it's what I remember observing when ammending commits
in a repo using changesInclues. If I were to build a feature like it I'd look
at the list of files of
  git diff $(git merge-base last_green new_commit)..new_commit

or such.  cfbot doesn't commit incrementally but replaces the prior commit, so
I suspect it'll always be viewn as new. But who knows, shouldn't be hard to
figure out.

Greetings,

Andres Freund

Re: [Proposal] Global temporary tables

2022-02-26 Thread Andres Freund

Hi,

On 2022-02-27 04:17:52 +0100, Pavel Stehule wrote:
> > You redirect stats from pg_class and pg_statistics to a local hash table.
> > This is pretty hairy :(

As is I think the patch is architecturally completely unacceptable. Having
code everywhere to redirect to manually written in-memory catalog table code
isn't maintainable.

> > I guess you'd also need to handle pg_statistic_ext and ext_data.
> > pg_stats doesn't work, since the data isn't in pg_statistic - it'd need to
> > look
> > at pg_get_gtt_statistics.
>
> Without this, the GTT will be terribly slow like current temporary tables
> with a lot of problems with bloating of pg_class, pg_attribute and
> pg_depend tables.

I think it's not a great idea to solve multiple complicated problems at
once...

Greetings,

Andres Freund

Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

2022-02-27 Thread Andres Freund

Hi, 

On February 27, 2022 4:19:21 PM PST, Thomas Munro  
wrote:
>With the attached, 027_stream_regress.pl drops from ~29.5s to ~19.6s
>on my FreeBSD workstation!

That's impressive - wouldn't have guessed it to make that much of a difference. 
I assume that running the tests on freebsd for an older pg with a similar s_b & 
max_wal_size  doesn't benefit as much? I wonder how much windows will improve.

>It seems a little strange to introduce a new wait event that will very
>often appear into a stable branch, but ... it is actually telling the
>truth, so there is that.

In the back branches it needs to be at the end of the enum - I assume you 
intended that just to be for HEAD.

I wonder whether in HEAD we shouldn't make that sleep duration be computed from 
the calculation in IsOnSchedule...

>The sleep/poll loop in RegisterSyncRequest() may also have another
>problem.  The comment explains that it was a deliberate choice not to
>do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
>think there's an excuse to ignore postmaster death in a loop that
>presumably becomes infinite if the checkpointer exits.  I guess we
>could do:
>
>-   pg_usleep(1L);
>+   WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
>WAIT_EVENT_SYNC_REQUEST);
>
>But... really, this should be waiting on a condition variable that the
>checkpointer broadcasts on when the queue goes from full to not full,
>no?  Perhaps for master only?

Looks worth improving, but yes, I'd not do it in the back branches. 

I do think it's worth giving that sleep a proper wait event though, even in the 
back branches.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: convert libpq uri-regress tests to tap test

2022-03-01 Thread Andres Freund

Hi, 

On March 1, 2022 7:44:27 AM PST, Andrew Dunstan  wrote:
>
>On 2/24/22 07:19, Andrew Dunstan wrote:
>> On 2/23/22 20:52, Tom Lane wrote:
>>> Peter Eisentraut  writes:
 On 23.02.22 23:58, Tom Lane wrote:
> Peter Eisentraut  writes:
>> libpq TAP tests should be in src/interfaces/libpq/t/.
> That's failing to account for the fact that a libpq test can't
> really be a pure-perl TAP test; you need some C code to drive the
> library.
 Such things could be put under src/interfaces/libpq/test, or some other 
 subdirectory.  We already have src/interfaces/ecpg/test.
>>> OK, but then the TAP scripts are under src/interfaces/libpq/test/t,
>>> which isn't what you said.  I have no great objection to moving
>>> src/test/modules/libpq_pipeline/ to src/interfaces/libpq/test/,
>>> though, as long as the buildfarm will cope.
>>>
>>> 
>>
>> It won't without some adjustment.
>
>
>
>See
>
>and
>

Thanks! 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Proposal: Support custom authentication methods using hooks

2022-03-01 Thread Andres Freund

Hi,

On 2022-02-28 11:26:06 +0100, Peter Eisentraut wrote:
> We already have a variety of authentication mechanisms that support central
> management: LDAP, PAM, Kerberos, Radius.

LDAP, PAM and Radius all require cleartext passwords, so aren't a great
solution based on the concerns voiced in this thread. IME Kerberos is
operationally too complicated to really be used, unless it's already part of
the operating environment.

> What other mechanisms are people thinking about implementing using these
> hooks?

The cases I've heard about are about centralizing auth across multiple cloud
services. Including secret management in some form. E.g. allowing an
application to auth to postgres, redis and having the secret provided by
infrastructure, rather than having to hardcode it in config or such.

I can't see application developers configuring kerberos and I don't think
LDAP, PAM, Radius are a great answer either, due to the plaintext requirement
alone? LDAP is pretty clearly dying technology, PAM is fragile complicated C
stuff that's not portable across OSs. Radius is probably the most realistic,
but at least as implemented doesn't seem flexible enough (e.g. no access to
group memberships etc).

Nor does baking stuff like that in seem realistic to me, it'll presumably be
too cloud provider specific.

Greetings,

Andres Freund

Re: Proposal: Support custom authentication methods using hooks

2022-03-01 Thread Andres Freund

Hi,

On 2022-02-25 13:40:54 -0500, Jonathan S. Katz wrote:
> On 2/25/22 12:39 PM, Tom Lane wrote:
> > My point is that sending cleartext passwords over the wire is an
> > insecure-by-definition protocol that we shouldn't be encouraging
> > more use of.
> 
> This is my general feeling as well. We just spent a bunch of effort adding,
> refining, and making SCRAM the default method. I think doing anything that
> would drive more use of sending plaintext passwords, even over TLS, is
> counter to that.

I want to again emphasize that, as proposed, a custom auth method can use
SCRAM if relevant for it, with a small amount of code. So the whole plaintext
discussion seems independent.

Samay, what do you think about updating the test plugin to do SCRAM instead of
plaintext, just to highlight that fact?

Greetings,

Andres Freund

Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

2022-03-01 Thread Andres Freund

Hi,

On 2022-03-02 06:46:23 +1300, Thomas Munro wrote:
> > I do think it's worth giving that sleep a proper wait event though, even in 
> > the back branches.
> 
> I'm thinking that 0002 should be back-patched all the way, but 0001
> could be limited to 14.

No strong opinion on back to where to backpatch. It'd be nice to have a proper
wait event everywhere, but especially < 12 it looks different enough to be
some effort.


> From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Mon, 28 Feb 2022 11:27:05 +1300
> Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay().

LGTM. Would be nice to have this fixed soon, even if it's just to reduce test
times :)



> From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Tue, 1 Mar 2022 11:38:27 +1300
> Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest().

LGTM.


> From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Tue, 1 Mar 2022 17:34:43 +1300
> Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue
>  is full.
> 
> Previously, in the (hopefully) rare case that we need to wait for the
> checkpointer to create space in the sync request queue, we'd enter a
> sleep/retry loop.  Instead, create a condition variable so the
> checkpointer can wake us up whenever there is a transition from 'full'
> to 'not full'.


> @@ -1076,10 +1078,11 @@ RequestCheckpoint(int flags)
>   * to perform its own fsync, which is far more expensive in practice.  It
>   * is theoretically possible a backend fsync might still be necessary, if
>   * the queue is full and contains no duplicate entries.  In that case, we
> - * let the backend know by returning false.
> + * let the backend know by returning false, or if 'wait' is true, then we
> + * wait for space to become available.
>   */
>  bool
> -ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
> +ForwardSyncRequest(const FileTag *ftag, SyncRequestType type, bool wait)
>  {
>   CheckpointerRequest *request;
>   booltoo_full;
> @@ -1101,9 +1104,9 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType 
> type)
>* backend will have to perform its own fsync request.  But before 
> forcing
>* that to happen, we can try to compact the request queue.
>*/
> - if (CheckpointerShmem->checkpointer_pid == 0 ||
> - (CheckpointerShmem->num_requests >= 
> CheckpointerShmem->max_requests &&
> -  !CompactCheckpointerRequestQueue()))
> + if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests 
> &&
> + !CompactCheckpointerRequestQueue() &&
> + !wait)

Bit confused about the addition of the wait parameter / removal of the
CheckpointerShmem->checkpointer_pid check. What's that about?


> + /*
> +  * If we still don't have enough space and the caller asked us to wait,
> +  * wait for the checkpointer to advertise that there is space.
> +  */
> + if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests)
> + {
> + 
> ConditionVariablePrepareToSleep(&CheckpointerShmem->requests_not_full_cv);
> + while (CheckpointerShmem->num_requests >=
> +CheckpointerShmem->max_requests)
> + {
> + LWLockRelease(CheckpointerCommLock);
> + 
> ConditionVariableSleep(&CheckpointerShmem->requests_not_full_cv,
> +
> WAIT_EVENT_FORWARD_SYNC_REQUEST);
> + LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
> + }
> + ConditionVariableCancelSleep();
> + }

Could there be a problem with a lot of backends trying to acquire
CheckpointerCommLock in exclusive mode? I'm inclined to think it's rare enough
to not worry.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

2022-03-01 Thread Andres Freund

Hi,

On 2022-03-02 07:35:46 +0530, Amit Kapila wrote:
> Pushed.

Thanks!

Working on rebasing shared memory stats over this. Feels *much* better so far.

While rebasing, I was wondering why pgstat_reset_subscription_counter() has
"all subscription counters" support. We don't have a function to reset all
function stats or such either.

I'm asking because support for that is what currently prevents sub stats from
using the more general function for reset. It's an acceptable amount of code,
but if we don't really need it I'd rather not have it / add it in a more
general way if we want it.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

2022-03-01 Thread Andres Freund

Hi,

On 2022-03-02 12:39:57 +0530, Amit Kapila wrote:
> On Wed, Mar 2, 2022 at 10:39 AM Andres Freund  wrote:
> >
> > Working on rebasing shared memory stats over this. Feels *much* better so 
> > far.
> >
> 
> Good to hear that it helps. BTW, there is another patch [1] that
> extends this view. I think that patch is still not ready but once it
> is ready (which I expect to happen sometime in this CF), it might be
> good if you would be able to check whether it has any major problem
> with your integration.

I skimmed it briefly, and I don't see an architectural conflict. I'm not
convinced it's worth adding that information, but that's a separate
discussion.

> > While rebasing, I was wondering why pgstat_reset_subscription_counter() has
> > "all subscription counters" support. We don't have a function to reset all
> > function stats or such either.
> >
> 
> We have similar thing for srlu (pg_stat_reset_slru) and slots
> (pg_stat_reset_replication_slot).

Neither should have been added imo. We're already at 9 different reset
functions. Without a unified function to reset all stats, pretty much the only
actually relevant operation. And having pg_stat_reset_shared() with variable
'reset' systems but then also pg_stat_reset_slru() and
pg_stat_reset_subscription_stats() is absurd.

This is just making something incomprehensible evermore incomprehensible.

> For functions and tables, one can use pg_stat_reset.

Not in isolation.

Greetings,

Andres Freund

Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set

2022-03-02 Thread Andres Freund

On 2022-02-15 13:02:41 +0900, Michael Paquier wrote:
> >> +  @regress_command = (@regress_command, @extra_opts);
> >> +
> >> +  $oldnode->command_ok(@regress_command,
> >> +  'regression test run on old instance');
> > 
> > I also think this should take EXTRA_REGRESS_OPTS into account - test.sh did.
> 
> This is already taken into account, as of the @extra_opts bits.

But in a bad way, because EXTRA_REGRESS_OPTS now always wins, even for stuff
we want to override. Note how test.sh explicitly specifies port, bindir etc
after the pre-existing EXTRA_REGRESS_OPTS.

Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set

2022-03-02 Thread Andres Freund

Hi,

On 2022-03-02 15:57:23 +0900, Michael Paquier wrote:
> Do others have an opinion about a backpatch of the bugfix?  Nobody has
> complained about that since pg_upgrade exists, so I have just done the
> change on HEAD.

WFM.



> +++ b/src/bin/pg_upgrade/t/001_basic.pl
> @@ -0,0 +1,9 @@
> +use strict;
> +use warnings;
> +
> +use PostgreSQL::Test::Utils;
> +use Test::More tests => 8;

Outdated.

> +program_help_ok('pg_upgrade');
> +program_version_ok('pg_upgrade');
> +program_options_handling_ok('pg_upgrade');

Unrelated. But I kinda wish we'd do this in a saner manner than copying this
test into every binary. E.g. by ensuring that all tools installed in the temp
install are tested or such.


> +# The test of pg_upgrade consists in setting up an instance.  This is the
> +# source instance used for the upgrade. Then a new and fresh instance is
> +# created, and is used as the target instance for the upgrade.

This seems a bit repetitive. Lots of "instance".

> Before
> +# running an upgrade, a logical dump of the old instance is taken, and a
> +# second logical dump of the new instance is taken after the upgrade.
> +# The upgrade test passes if there are no differences in these two dumps.
> +
> +# Testing upgrades with an older instance of PostgreSQL requires setting up
> +# two environment variables, as of:
> +# - "olddump", to point to a dump file that will be used to set
> +#   up the old instance to upgrade from, the dump being restored in the
> +#   old cluster.
> +# - "oldinstall", to point to the installation path of the old
> +#   instance.
> +if (   (defined($ENV{olddump}) && !defined($ENV{oldinstall}))
> + || (!defined($ENV{olddump}) && defined($ENV{oldinstall})))

Odd indentation. Spaces between parens?


> +$newnode->init(extra => [ '--wal-segsize', '1', '--allow-group-access' ]);

I'd copy the comments from test.sh wrt --wal-segsize,
--allow-group-access.


Greetings,

Andres Freund

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-02 Thread Andres Freund

Hi,

On 2022-03-02 14:52:01 -0500, Robert Haas wrote:
> - I am having some trouble understanding clearly what 0001 is doing.
> I'll try to study it further.

It tests for the various scenarios I could think of that could lead to FD
reuse, to state the obvious ;). Anything particularly unclear.

> - In general, 0003 makes a lot of sense to me.
> 
> +   /*
> +* Finally tell the kernel to write the data to
> storage. Don't smgr if
> +* previously closed, otherwise we could end up evading 
> fd-reuse
> +* protection.
> +*/
> 
> - I think the above hunk is missing a word, because it uses smgr as a
> verb. I also think that it's easy to imagine this explanation being
> insufficient for some future hacker to understand the issue.

Yea, it's definitely not sufficient or gramattically correct. I think I wanted
to send something out, but was very tired by that point..

> - While 0004 seems useful for testing, it's an awfully big hammer. I'm
> not sure we should be thinking about committing something like that,
> or at least not as a default part of the build. But ... maybe?

Aggreed. I think it's racy anyway, because further files could get deleted
(e.g. segments > 0) after the barrier has been processed.

What I am stuck on is what we can do for the released branches. Data
corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like
something we need to address.

Greetings,

Andres Freund

Re: Proposal: Support custom authentication methods using hooks

2022-03-02 Thread Andres Freund

Hi,

On 2022-03-02 09:32:26 +0100, Peter Eisentraut wrote:
> On 01.03.22 22:34, Andres Freund wrote:
> > The cases I've heard about are about centralizing auth across multiple cloud
> > services. Including secret management in some form. E.g. allowing an
> > application to auth to postgres, redis and having the secret provided by
> > infrastructure, rather than having to hardcode it in config or such.
> > 
> > I can't see application developers configuring kerberos and I don't think
> > LDAP, PAM, Radius are a great answer either, due to the plaintext 
> > requirement
> > alone? LDAP is pretty clearly dying technology, PAM is fragile complicated C
> > stuff that's not portable across OSs. Radius is probably the most realistic,
> > but at least as implemented doesn't seem flexible enough (e.g. no access to
> > group memberships etc).
> > 
> > Nor does baking stuff like that in seem realistic to me, it'll presumably be
> > too cloud provider specific.
> 
> Let's gather some more information on this.  PostgreSQL should support the
> authentication that many people want to use out of the box.  I don't think
> it would be good to be at a point where all the built-in methods are
> outdated and if you want to use the good stuff you have to hunt for plugins.
> The number of different cloud APIs is effectively small.  I expect that
> there are a lot of similarities, like they probably all need support for
> http calls, they might need support for caching lookups, etc.  OIDC was
> mentioned elsewhere.  That's a standard.  Is that compatible with any cloud
> providers?  Would that be enough for many users?

I'm not opposed to putting something into the source tree eventually, if we
can figure out an overlapping set of capabilities that's useful enough. But I
don't see that as a reason to not make auth extensible? If anything, the
contrary - it's going to be much easier to figure out what this should look
like if it can be iteratively worked on with unmodified postgres.

Even if we were to put something in core, it seems that contrib/ would be a
better place than auth.c for something like it.

I think we should consider moving pam, ldap, radius to contrib
eventually. That way we'd not need to entirely remove them, but a "default"
postgres wouldn't have support for auth methods requiring plaintext secret
transmission.

Greetings,

Andres Freund

Re: Proposal: Support custom authentication methods using hooks

2022-03-02 Thread Andres Freund

Hi,

On 2022-03-02 15:26:32 -0500, Stephen Frost wrote:
> Part of the point, for my part anyway, of dropping support for plaintext
> transmission would be to remove support for that from libpq, otherwise a
> compromised server could still potentially convince a client to provide
> a plaintext password be sent to it.

IMO that's an argument for an opt-in option to permit plaintext, not an
argument for removal of the code alltogether. Even that will need a long
transition time, because it's effectively a form of an ABI break. Upgrading
libpq will suddenly cause applications to stop working. So adding an opt-out
option to disable plaintext is the next step...

I don't think it makes sense to discuss this topic as part of this thread
really. It seems wholly independent of making authentication pluggable.

> I also just generally disagree with the idea that it makes sense for
> these things to be in contrib.  We should be dropping them because
> they're insecure- moving them to contrib doesn't change the issue that
> we're distributing authentication solutions that send (either through an
> encrypted tunnel, or not, neither is good) that pass plaintext passwords
> around.

Shrug. I don't think it's a good idea to just leave people hanging without a
replacement. It's OK to make it a bit harder and require explicit
configuration, but dropping support for reasonable configurations IMO is
something we should be very hesitant doing.

Greetings,

Andres Freund

Re: [Proposal] Global temporary tables

2022-03-02 Thread Andres Freund

Hi,

On 2022-02-27 06:09:54 +0100, Pavel Stehule wrote:
> ne 27. 2. 2022 v 5:13 odesílatel Andres Freund  napsal:
> > On 2022-02-27 04:17:52 +0100, Pavel Stehule wrote:
> > > Without this, the GTT will be terribly slow like current temporary tables
> > > with a lot of problems with bloating of pg_class, pg_attribute and
> > > pg_depend tables.
> >
> > I think it's not a great idea to solve multiple complicated problems at
> > once...

> I thought about this issue for a very long time, and I didn't find any
> better (without more significant rewriting of pg storage). In a lot of
> projects, that I know, the temporary tables are strictly prohibited due
> possible devastating impact to system catalog bloat.  It is a serious
> problem. So any implementation of GTT should solve the questions: a) how to
> reduce catalog bloating, b) how to allow session related statistics for
> GTT. I agree so implementation of GTT like template based LTT (local
> temporary tables) can be very simple (it is possible by extension), but
> with the same unhappy performance impacts.

> I don't say so current design should be accepted without any discussions
> and without changes. Maybe GTT based on LTT can be better than nothing
> (what we have now), and can be good enough for a lot of projects where the
> load is not too high (and almost all projects have low load).

I think there's just no way that it can be merged with anything close to the
current design - it's unmaintainable. The need for the feature doesn't change
that.

That's not to say it's impossible to come up with a workable design. But it's
definitely not easy. If I were to work on this - which I am not planning to -
I'd try to solve the problems of "LTT" first, with an eye towards using the
infrastructure for GTT.

I think you'd basically have to come up with a generic design for partitioning
catalog tables into local / non-local storage, without needing explicit code
for each catalog. That could also be used to store the default catalog
contents separately from user defined ones (e.g. pg_proc is pretty large).

Greetings,

Andres Freund

Re: Add 64-bit XIDs into PostgreSQL 15

2022-03-02 Thread Andres Freund

Hi,

On 2022-03-02 16:25:33 +0300, Aleksander Alekseev wrote:
> I agree with Bruce it would be great to deliver this in PG15.

> Please let me know if you believe it's unrealistic for any reason so I will
> focus on testing and reviewing other patches.

I don't see 15 as a realistic target for this patch. There's huge amounts of
work left, it has gotten very little review.

I encourage trying to break down the patch into smaller incrementally useful
pieces. E.g. making all the SLRUs 64bit would be a substantial and
independently committable piece.

Greetings,

Andres Freund

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-03-02 Thread Andres Freund

Hi,

On 2022-03-01 08:35:27 -0500, Stephen Frost wrote:
> I'm not really sure why we're arguing about this, but clearly the authn
> ID of the leader process is what should be used because that's the
> authentication under which the parallel worker is running, just as much
> as the effective role is the authorization.  Having this be available in
> worker processes would certainly be good as it would allow more query
> plans to be considered when these functions are used.  At this time, I
> don't think that outweighs the complications around having it and I'm
> not suggesting that Jacob needs to go do that, but surely it would be
> better.

I don't think we should commit this without synchronizing the authn between
worker / leader (in a separate commit). Too likely that some function that's
marked parallel ok queries the authn_id, opening up a security/monitoring hole
or such because of a bogus return value.

Greetings,

Andres Freund

Re: Design of pg_stat_subscription_workers vs pgstats

2022-03-02 Thread Andres Freund

Hi,

On 2022-02-25 11:32:24 -0800, Andres Freund wrote:
> On 2022-02-25 16:25:01 -0300, Euler Taveira wrote:
> > On Fri, Feb 25, 2022, at 11:52 AM, Greg Stark wrote:
> > > On Tue, 25 Jan 2022 at 01:32, Andres Freund  wrote:
> > > >
> > > > Hi,
> > > >
> > > > I was looking the shared memory stats patch again.
> > > 
> > > Can you point me to this thread? I looked for it but couldn't find it.
> 
> > https://postgr.es/m/20180629.173418.190173462.horiguchi.kyot...@lab.ntt.co.jp
> 
> I'll post a rebased version as soon as this is resolved... I have a local one,
> but it just works by nuking a bunch of tests / #ifdefing out code related to
> this.

Now that the pg_stat_subscription_workers changes have been committed, I've
posted a rebased version to the above thread.

Greetings,

Andres Freund

Re: casting operand to proper type in BlockIdGetBlockNumber

2022-03-03 Thread Andres Freund

Hi,

On 2022-03-03 12:13:40 -0500, Tom Lane wrote:
> Zhihong Yu  writes:
> > On Thu, Mar 3, 2022 at 8:24 AM Tom Lane  wrote:
> >> Oh, I misread this as a compile-time warning, but it must be from ASAN.
> >> Was the test case one of your own, or just our normal regression tests?
> 
> > The Jenkins test is ported from contrib/postgres_fdw/sql/postgres_fdw.sql -
> > so theoretically PG would see the same error for clang12 on Alma.
> 
> Hmph.  I tried enabling -fsanitize=undefined here, and I get some
> complaints about passing null pointers to memcmp and the like, but
> nothing about this shift (tested with clang 12.0.1 on RHEL8 as well
> as clang 13.0.0 on Fedora 35).

We should fix these passing-null-pointer cases...


> What compiler switches are being used exactly?

FWIW, I've successfully used:
-Og -fsanitize=alignment,undefined -fno-sanitize=nonnull-attribute 
-fno-sanitize=float-cast-overflow -fno-sanitize-recover=all

Need to manually add -ldl, because -fsanitize breaks our dl test (it uses
dlopen, but not dlsym). Was planning to submit a fix for that...

Greetings,

Andres Freund

Re: casting operand to proper type in BlockIdGetBlockNumber

2022-03-03 Thread Andres Freund

Hi,

On 2022-03-03 12:45:22 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-03-03 12:13:40 -0500, Tom Lane wrote:
> >> Hmph.  I tried enabling -fsanitize=undefined here, and I get some
> >> complaints about passing null pointers to memcmp and the like, but
> >> nothing about this shift (tested with clang 12.0.1 on RHEL8 as well
> >> as clang 13.0.0 on Fedora 35).
> 
> > We should fix these passing-null-pointer cases...
> 
> Yeah, working on that now.  But I'm pretty confused about why I can't
> duplicate this shift complaint.  Alma is a Red Hat clone no?  Why
> doesn't its compiler act the same as RHEL8's?

I didn't see that either. It could be a question of building with full
optimizations / asserts vs without?


> > Need to manually add -ldl, because -fsanitize breaks our dl test (it uses
> > dlopen, but not dlsym). Was planning to submit a fix for that...
> 
> Hmm ... didn't get through check-world yet, but I don't see that
> so far.

Oh, for me it doesn't even build. Perhaps one of the dependencies injects it
as well?

Greetings,

Andres Freund

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-03 Thread Andres Freund

Hi,

On 2022-03-03 13:11:17 -0500, Robert Haas wrote:
> On Wed, Mar 2, 2022 at 3:00 PM Andres Freund  wrote:
> > On 2022-03-02 14:52:01 -0500, Robert Haas wrote:
> > > - I am having some trouble understanding clearly what 0001 is doing.
> > > I'll try to study it further.
> >
> > It tests for the various scenarios I could think of that could lead to FD
> > reuse, to state the obvious ;). Anything particularly unclear.
> 
> As I understand it, the basic idea of this test is:
> 
> 1. create a database called 'conflict_db' containing a table called 'large'
> 2. also create a table called 'replace_sb' in the postgres db.
> 3. have a long running transaction open in 'postgres' database that
> repeatedly accesses the 'replace_sb' table to evict the 'conflict_db'
> table, sometimes causing write-outs
> 4. try to get it to write out the data to a stale file descriptor by
> fiddling with various things, and then detect the corruption that
> results
> 5. use both a primary and a standby not because they are intrinsically
> necessary but because you want to test that both cases work

Right.


I wasn't proposing to commit the test as-is, to be clear. It was meant as a
demonstration of the types of problems I can see, and what's needed to fix
them...


> I can't remember that verify() is the one that accesses conflict.db large
> while cause_eviction() is the one that accesses postgres.replace_sb for more
> than like 15 seconds.

For more than 15seconds? The whole test runs in a few seconds for me...


> As to (5), the identifiers are just primary and standby, without
> mentioning the database name, which adds to the confusion, and there
> are no comments explaining why we have two nodes.

I don't follow this one - physical rep doesn't do anything below the database
level?



> I think it would be helpful to find a way to divide the test case up
> into sections that are separated from each other visually, and explain
> the purpose of each test a little more in a comment. For example I'm
> struggling to understand the exact purpose of this bit:
> 
> +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
> +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
> +
> +verify($node_primary, $node_standby, 3,
> +  "restored contents as expected");
> 
> I'm all for having test coverage of VACUUM FULL, but it isn't clear to
> me what this does that is related to FD reuse.

It's just trying to clean up prior damage, so the test continues to later
ones. Definitely not needed once there's a fix. But it's useful while trying
to make the test actually detect various corruptions.


> Similarly for the later ones. I generally grasp that you are trying to
> make sure that things are OK with respect to FD reuse in various
> scenarios, but I couldn't explain to you what we'd be losing in terms
> of test coverage if we removed this line:
> 
> +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");
> 
> I am not sure how much all of this potential cleanup matters because
> I'm pretty skeptical that this would be stable in the buildfarm.

Which "potential cleanup" are you referring to?


> It relies on a lot of assumptions about timing and row sizes and details of
> when invalidations are sent that feel like they might not be entirely
> predictable at the level you would need to have this consistently pass
> everywhere. Perhaps I am too pessimistic?

I don't think the test passing should be dependent on row size, invalidations
etc to a significant degree (unless the tables are too small to reach s_b
etc)? The exact symptoms of failures are highly unstable, but obviously we'd
fix them in-tree before committing a test. But maybe I'm missing a dependency?

FWIW, the test did pass on freebsd, linux, macos and windows with the
fix. After a few iterations of improving the fix ;)

Greetings,

Andres Freund

Re: casting operand to proper type in BlockIdGetBlockNumber

2022-03-03 Thread Andres Freund

Hi,

On 2022-03-03 14:00:14 -0500, Tom Lane wrote:
> The attached is enough to get through check-world with
> "-fsanitize=undefined" using RHEL8's clang 12.0.1.

Cool.

> I'm not sure whether to back-patch --- looking through the
> git logs, it seems we've back-patched some fixes like these
> and not others.  Thoughts?

It'd be easier to run a BF animal if we fixed it everywhere.

> In any case, if we're going to take this seriously it seems like we need a
> buildfarm machine or two testing this option.

I was planning to add it to the CI runs, just didn't have energy to fix the
failures yet. But you just did (although I think there might be failure or two
more on new-ish debians).

For the buildfarm, I could enable it on flaviventris? That runs an
experimental gcc, without optimization (whereas serinus runs with
optimization). Which seems reasonable to combine with sanitizers?

For CI I compared the cost of the different sanitizers. It looks like
alignment sanitizer is almost free, undefined is pretty cheap, and address
sanitizer is pretty expensive (but still much cheaper than valgrind).

Greetings,

Andres Freund

PS: Hm, seems mylodon died a while ago... Need to check what's up with that.

Re: casting operand to proper type in BlockIdGetBlockNumber

2022-03-03 Thread Andres Freund

Hi,

On 2022-03-03 15:31:51 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-03-03 14:00:14 -0500, Tom Lane wrote:
> > For the buildfarm, I could enable it on flaviventris? That runs an
> > experimental gcc, without optimization (whereas serinus runs with
> > optimization). Which seems reasonable to combine with sanitizers?
> 
> Dunno.  I already found out that my Mac laptop (w/ clang 13) detects
> the numeric.c problem but not any of the other ones.  The messages
> on RHEL8 cite where the system headers declare memcmp and friends
> with "attribute nonnull", so I'm betting that Apple's headers lack
> that annotation.

The sanitizers are documented to work best on linux... As flaviventris runs
linux, so I'm not sure what your concern is?

I think basically newer glibc versions have more annotations, so ubsan will
have more things to fail against. So it'd be good to have a fairly regularly
updated OS.

> I also tried adding the various -m switches shown in Zhihong's
> CFLAGS setting, but that still didn't repro the Alma warning
> for me.

The compilation flags make it look like it's from a run of yugabyte's fork,
rather than plain postgres.

The message says:
src/backend/utils/adt/tid.c:112:16: runtime error: left shift of 65535 by 16 
places cannot be represented in type 'int'

Afaics that means bi_hi is 65535. So either we're dealing with a very large
relation or BlockIdGetBlockNumber() is getting passed InvalidBlockNumber?

It might be enough to do something like
SELECT * FROM pg_class WHERE ctid = '(65535, 17)';
to trigger the problem?

Greetings,

Andres Freund

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 9854 matches

Mail list logo