date:20230110

Re: typos

2023-01-10 Thread Amit Kapila

On Tue, Jan 10, 2023 at 1:18 PM Michael Paquier  wrote:
>
> On Tue, Jan 10, 2023 at 12:24:40PM +0530, Amit Kapila wrote:
> > Thanks for noticing this. I'll take care of this and some other typo
> > patches together.
>
> Does this include 0010?  I was just looking at the whole set and this
> one looked like a cleanup worth on its own so I was going to handle
> it, until I saw your update.  If you are also looking at that, I won't
> stand in your way, of course :)
>

I have not yet started, so please go ahead.

-- 
With Regards,
Amit Kapila.

Re: Add a new pg_walinspect function to extract FPIs from WAL records

2023-01-10 Thread Drouvot, Bertrand


Hi,

On 1/6/23 6:41 PM, Bharath Rupireddy wrote:

On Fri, Jan 6, 2023 at 11:47 AM Bharath Rupireddy
 wrote:


On Thu, Jan 5, 2023 at 6:51 PM Bharath Rupireddy
 wrote:



I'm also wondering if it would make sense to extend the test coverage of it (and 
pg_waldump) to "validate" that both
extracted images are the same and matches the one modified right after the 
checkpoint.

What do you think? (could be done later in another patch though).


I think pageinspect can be used here. We can fetch the raw page from
the table after the checkpoint and raw FPI from the WAL record logged
as part of the update. I've tried to do so [1], but I see a slight
difference in the raw output. The expectation is that they both be the
same. It might be that the update operation logs the FPI with some
more info set (prune_xid). I'll try to see why it is so.

I'm attaching the v2 patch for further review.

[1]
SELECT * FROM page_header(:'page_from_table');
 lsn| checksum | flags | lower | upper | special | pagesize |
version | prune_xid
---+--+---+---+---+-+--+-+---
  0/1891D78 |0 | 0 |40 |  8064 |8192 | 8192 |
 4 | 0
(1 row)

SELECT * FROM page_header(:'page_from_wal');
 lsn| checksum | flags | lower | upper | special | pagesize |
version | prune_xid
---+--+---+---+---+-+--+-+---
  0/1891D78 |0 | 0 |44 |  8032 |8192 | 8192 |
 4 |   735
(1 row)


Ugh, v2 patch missed the new file added, I'm attaching v3 patch for
further review. Sorry for the noise.


I took a stab at how and what gets logged as FPI in WAL records:

Option 1:
WAL record with FPI contains both the unmodified table page from the
disk after checkpoint and new tuple (not applied to the unmodified
page) and the recovery (redo) applies the new tuple to the unmodified
page as part of recovery. A bit more WAL is needed to store both
unmodified page and new tuple data in the WAL record and recovery can
get slower a bit too as it needs to stitch the modified page.

Option 2:
WAL record with FPI contains only the modified page (new tuple applied
to the unmodified page from the disk after checkpoint) and the
recovery (redo)  just returns the applied block as BLK_RESTORED.
Recovery can get faster with this approach and less WAL is needed to
store just the modified page.

My earlier understanding was that postgres does option (1), however, I
was wrong, option (2) is what actually postgres has implemented for
the obvious advantages specified.

I now made the tests a bit stricter in checking the FPI contents
(tuple values) pulled from the WAL record with raw page contents
pulled from the table using the pageinspect extension. Please see the
attached v4 patch.



Thanks for updating the patch!

+-- Compare FPI from WAL record and page from table, they must be same

I think "must be the same" or "must be identical" sounds better (but not 100% 
sure).

Except this nit, V4 looks good to me.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-10 Thread Ankit Kumar Pandey





On 10/01/23 10:53, David Rowley wrote:



On Tue, 10 Jan 2023 at 06:15, Ankit Kumar Pandey  wrote:
> Do we have any pending items for this patch now?

I'm just wondering if not trying this when the query has a DISTINCT
clause is a copout.  What I wanted to avoid was doing additional
sorting work for WindowAgg just to have it destroyed by Hash
Aggregate.  I'm now wondering if adding both the original
slightly-less-sorted path plus the new slightly-more-sorted path then
if distinct decides to Hash Aggregate then it'll still be able to pick
the cheapest input path to do that on.  Unfortunately, our sort
costing just does not seem to be advanced enough to know that sorting
by fewer columns might be cheaper, so adding the additional path is
likely just going to result in add_path() ditching the old
slightly-less-sorted path due to the new slightly-more-sorted path
having better pathkeys. So, we'd probably be wasting our time if we
added both paths with the current sort costing code.



Maybe we should try and do this for DISTINCT queries if the
distinct_pathkeys match the orderby_pathkeys. That seems a little less
copout-ish. If the ORDER BY is the same as the DISTINCT then it seems
likely that the ORDER BY might opt to use the Unique path for DISTINCT
since it'll already have the correct pathkeys.  However, if the ORDER
BY has fewer columns then it might be cheaper to Hash Aggregate and
then sort all over again, especially so when the DISTINCT removes a
large proportion of the rows.

Ideally, our sort costing would just be better, but I think that
raises the bar a little too high to start thinking of making
improvements to that for this patch.


Let me take a stab at this. Depending on complexity, we can take

a call to address this in current patch or a follow up.

--
Regards,
Ankit Kumar Pandey

Re: [PATCH] random_normal function

2023-01-10 Thread Dean Rasheed

On Mon, 9 Jan 2023 at 23:38, Tom Lane  wrote:
>
> I tried this on some 32-bit big-endian hardware (NetBSD on macppc)
> to verify my thesis that the results of random() are now machine
> independent.  That part works, but the random_normal() tests didn't;

Ah yes, I wondered about that.

> I saw low-order-bit differences from the results on x86_64 Linux.
> Presumably, that's because one or more of sqrt()/log()/sin() are
> rounding off a bit differently there.  v2 attached deals with this by
> backing off to "extra_float_digits = 0" for that test.

Makes sense.

I double-checked the one-in-a-billion claim, and it looks about right
for each test.

The one I wasn't sure about was the chance of duplicates for
random_normal(). Analysing it more closely, it actually has a smaller
chance of duplicates, since the difference between 2 standard normal
distributions is another normal distribution with a standard deviation
of sqrt(2), and so the probability of a pair of random_normal()'s
being the same is about 2*sqrt(pi) ~ 3.5449 times lower than for
random(). So you can call random_normal() around 5600 times (rather
than 3000 times) before having a 1e-9 chance of duplicates. So, as
with the random() duplicates test, the probability of failure with
just 1000 values should be well below 1e-9. Intuitively, that was
always going to be true, but I wanted to know the details.

The rest looks good to me, except there's a random non-ASCII character
instead of a hyphen in "Kolmogorov-Smirnov" (because I copy-pasted the
name from some random website).

Regards,
Dean

Re: split TOAST support out of postgres.h

2023-01-10 Thread Peter Eisentraut


On 10.01.23 08:39, Noah Misch wrote:

On Tue, Jan 10, 2023 at 06:07:49AM +0100, Peter Eisentraut wrote:

On 30.12.22 17:50, Tom Lane wrote:

Peter Eisentraut  writes:

On 28.12.22 16:07, Tom Lane wrote:

I dunno, #3 seems kind of unprincipled.  Also, since fmgr.h is included
so widely, I doubt it is buying very much in terms of reducing header
footprint.  How bad is it to do #2?



See this incremental patch set.


Wow, 41 files requiring varatt.h is a lot fewer than I would have guessed.
I think that bears out my feeling that fmgr.h wasn't a great location:
I count 117 #includes of that, many of which are in .h files themselves
so that many more .c files would be required to read them.


committed


SET_VARSIZE alone appears in 74 pgxn distributions, so I predict extension
breakage en masse.  I would revert this.


Well, that was sort of my thinking, but people seemed to like this.  I'm 
happy to consider alternatives.

Re: Add SHELL_EXIT_CODE to psql

2023-01-10 Thread Maxim Orlov

On Mon, 9 Jan 2023 at 21:36, Corey Huinker  wrote:

>
> I chose a name that would avoid collisions with anything a user might
> potentially throw into their environment, so if the var "OS" is fairly
> standard is a reason to avoid using it. Also, going with our own env var
> allows us to stay in perfect synchronization with the build's #ifdef WIN32
> ... and whatever #ifdefs may come in the future for new OSes. If there is
> already an environment variable that does that for us, I would rather use
> that, but I haven't found it.
>
> Perhaps, I didn't make myself clear. Your solution is perfectly adapted to
our needs.
But all Windows since 2000 already have an environment variable
OS=Windows_NT. So, if env OS is defined and equal Windows_NT, this had to
be Windows.
May we use it in our case? I don't insist, just asking.

-- 
Best regards,
Maxim Orlov.

Re: Exposing the lock manager's WaitForLockers() to SQL

2023-01-10 Thread Marco Slot

On Fri, Dec 23, 2022 at 11:43 AM Will Mortensen  wrote:
> We'd like to be able to call the lock manager's WaitForLockers() and
> WaitForLockersMultiple() from SQL. Below I describe our use case, but
> basically I'm wondering if this:
>
> 1. Seems like a reasonable thing to do
>
> 2. Would be of interest upstream
>
> 3. Should be done with a new pg_foo() function (taking an
>oid?), or a whole new SQL command, or something else

Definitely +1 on adding a function/syntax to wait for lockers without
actually taking a lock. The get sequence value + lock-and-release
approach is still the only reliable scheme I've found for reliably and
efficiently processing new inserts in PostgreSQL. I'm wondering
whether it could be an option of the LOCK command. (LOCK WAIT ONLY?)

Marco

Re: Collation version tracking for macOS

2023-01-10 Thread Peter Eisentraut


On 05.12.22 22:33, Thomas Munro wrote:

On Tue, Dec 6, 2022 at 6:45 AM Joe Conway  wrote:

On 12/5/22 12:41, Jeff Davis wrote:

On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:

1.  I think we should seriously consider provider = ICU63.  I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work.  Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while
still
making it super clear what is going on.


As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

CREATE COLLATION PROVIDER icu63 TYPE icu
  AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.


+1

I like this. The provider kind defines which path we take in our code,
and the specific library unambiguously defines a specific collation
behavior (I think, ignoring bugs?)


OK, I'm going to see what happens if I try to wrangle that stuff into
a new catalogue table.


I'm reviewing the commit fest entry 
https://commitfest.postgresql.org/41/3956/, which points to this thread. 
 It appears that the above patch did not come about in time.  The patch 
of record is now Jeff's refactoring patch, which is also tracked in 
another commit fest entry (https://commitfest.postgresql.org/41/4058/). 
So as a matter of procedure, we should probably close this commit fest 
entry for now.  (Maybe we should also use a different thread subject in 
the future.)


I have a few quick comments on the above syntax example:

There is currently a bunch of locale-using code that selects different 
code paths by "collation provider", i.e., a libc-based code path and an 
ICU-based code path (and sometimes also a default provider path).  The 
above proposal would shift the terminology and would probably require 
some churn at those sites, in that they would now have to select by 
"collation provider type".  We could probably avoid that by shifting the 
terms a bit, so instead of the suggested


provider type -> provider

we could use

provider -> version of that provider

(or some other actual term), which would leave the meaning of "provider" 
unchanged as far as locale-using code is concerned.  At least that's my 
expectation, since no code for this has been seen yet.  We should keep 
this in mind in any case.


Also, the above example exposes a lot of operating system level details. 
 This creates issues with dump/restore, which some of the earlier 
patches avoided by using a path-based approach, and it would also 
require some thoughts about permissions.  We probably want 
non-superusers to be able to interact with this system somehow, for 
upgrading (for some meaning of that action) indexes etc. without 
superuser access.  The more stuff from the OS we expose, the more stuff 
we have to be able to lock down again in a usable manner.


(The search-by-collversion approach can probably avoid those issues better.)

Change xl_hash_vacuum_one_page.ntuples from int to uint16

2023-01-10 Thread Drouvot, Bertrand


Hi hackers,

While working on [1], I noticed that xl_hash_vacuum_one_page.ntuples is an int.

Unless I'm missing something, It seems to me that it would make more sense to 
use an uint16 (like this is done for
gistxlogDelete.ntodelete for example).

Please find attached a patch proposal to do so.

While that does not currently change the struct size:

No patch:

(gdb) ptype /o struct xl_hash_vacuum_one_page
/* offset  |size */  type = struct xl_hash_vacuum_one_page {
/*  0  |   4 */TransactionId snapshotConflictHorizon;
/*  4  |   4 */int ntuples;

   /* total size (bytes):8 */
 }
With patch:

(gdb) ptype /o struct xl_hash_vacuum_one_page
/* offset  |size */  type = struct xl_hash_vacuum_one_page {
/*  0  |   4 */TransactionId snapshotConflictHorizon;
/*  4  |   2 */uint16 ntuples;
/* XXX  2-byte padding   */

   /* total size (bytes):8 */
 }

It could reduce it when adding new fields (like this is is done in [1]).

We would get:

No patch:

(gdb) ptype /o struct xl_hash_vacuum_one_page
/* offset  |size */  type = struct xl_hash_vacuum_one_page {
/*  0  |   4 */TransactionId snapshotConflictHorizon;
/*  4  |   4 */int ntuples;
/*  8  |   1 */_Bool isCatalogRel;
/* XXX  3-byte padding   */

   /* total size (bytes):   12 */
 }

With patch:

(gdb) ptype /o struct xl_hash_vacuum_one_page
/* offset  |size */  type = struct xl_hash_vacuum_one_page {
/*  0  |   4 */TransactionId snapshotConflictHorizon;
/*  4  |   2 */uint16 ntuples;
/*  6  |   1 */_Bool isCatalogRel;
/* XXX  1-byte padding   */

   /* total size (bytes):8 */
 }

Means saving 4 bytes in that case.

Looking forward to your feedback,

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..9894ab9afe 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -251,13 +251,13 @@ typedef struct xl_hash_init_bitmap_page
 typedef struct xl_hash_vacuum_one_page
 {
TransactionId snapshotConflictHorizon;
-   int ntuples;
+   uint16  ntuples;
 
/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
 } xl_hash_vacuum_one_page;
 
 #define SizeOfHashVacuumOnePage \
-   (offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+   (offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2023-01-10 Thread Dean Rasheed

While doing some random testing, I noticed that the following is broken in HEAD:

SELECT COUNT(DISTINCT random()) FROM generate_series(1,10);

ERROR:  ORDER/GROUP BY expression not found in targetlist

It appears to have been broken by 1349d279, though I haven't looked at
the details.

I'm somewhat surprised that a case as simple as this wasn't covered by
any pre-existing regression tests.

Regards,
Dean

Re: [PATCH] random_normal function

2023-01-10 Thread Dean Rasheed

On Tue, 10 Jan 2023 at 08:33, Dean Rasheed  wrote:
>
> The rest looks good to me, except there's a random non-ASCII character
> instead of a hyphen in "Kolmogorov-Smirnov" (because I copy-pasted the
> name from some random website).
>

Oh, never mind. I see you already fixed that.
I should finish reading all my mail before hitting reply.

Regards,
Dean

Some revises in adding sorting path

2023-01-10 Thread Richard Guo

While reviewing [1], I visited other places where sorting is needed, and
have some findings.

In add_paths_with_pathkeys_for_rel, we do not try incremental sort atop
of the epq_path, which I think we can do.  I'm not sure how useful this
is in real world since the epq_path is used only for EPQ checks, but it
seems doing that doesn't cost too much.

In create_ordered_paths, we are trying to sort the cheapest partial path
and incremental sort on any partial paths with presorted keys, and then
use Gather Merge.  If the cheapest partial path is not completely sorted
but happens to have presorted keys, we would create a full sort path and
an incremental sort path on it.  I think this is not what we want.  We
are supposed to only create an incremental sort path if there are
presorted keys.

In gather_grouping_paths, we have the same issue.  In addition, for the
incremental sort paths created atop partial paths, we neglect to
calculate 'total_groups' before we use it in create_gather_merge_path.

[1]
https://www.postgresql.org/message-id/flat/CAApHDvo8Lz2H%3D42urBbfP65LTcEUOh288MT7DsG2_EWtW1AXHQ%40mail.gmail.com

Thanks
Richard


v1-0003-Revise-how-we-sort-partial-paths-in-gather_grouping_paths.patch
Description: Binary data


v1-0002-Revise-how-we-sort-partial-paths-in-create_ordered_paths.patch
Description: Binary data


v1-0001-postgres_fdw-Allow-incremental-sort-atop-of-the-epq_path.patch
Description: Binary data

Re: Support for dumping extended statistics

2023-01-10 Thread Hari krishna Maddileti

Thanks Team for showing interest.

Please find the attached patch, which uses the same approach as mentioned in 
previous email to implement input functions to parse pg_distinct, pg_dependency 
and pg_mcv_list strings.

Regards,
Hari
From: Bruce Momjian 
Date: Saturday, 7 January 2023 at 8:10 AM
To: Hari krishna Maddileti 
Cc: PostgreSQL Hackers 
Subject: Re: Support for dumping extended statistics
!! External Email

On Thu, Jan  5, 2023 at 06:29:03PM +, Hari krishna Maddileti wrote:
> Hi Team,
> In order to restore dumped extended statistics (stxdndistinct,
> stxddependencies, stxdmcv) we need to provide input functions to parse
> pg_distinct/pg_dependency/pg_mcv_list strings.
>
> Today we get the ERROR "cannot accept a value of type pg_ndistinct/
> pg_dependencies/pg_mcv_list" when we try to do an insert of any type.
>
> Approch tried:
>
> - Using yacc grammar file (statistics_gram.y) to parse the input string to its
> internal format for the types pg_distinct and pg_dependencies
>
> - We are just calling byteain() for serialized input text of type pg_mcv_list.
>
> Currently the changes are working locally,  I would like to push the commit
> changes to upstream if there any usecase for postgres.   Would like to know if
> there any interest from postgres side.

There is certainly interest in allowing the optimizer statistics to be
dumped and reloaded.  This could be used by pg_restore and pg_upgrade.

--
  Bruce Momjian  
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmomjian.us%2F&data=05%7C01%7Chmaddileti%40vmware.com%7C3eec45fa323646114b1b08daf0587937%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638086560027653219%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NGidyq8AYdqqAAjirIud%2FE2SD%2Bw4MWmdyFwIu2Bos4A%3D&reserved=0
  EDB  
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fenterprisedb.com%2F&data=05%7C01%7Chmaddileti%40vmware.com%7C3eec45fa323646114b1b08daf0587937%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638086560027653219%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XUv87gO4KT3W%2FJh17szMBUryZF5kB2hhkY8DD8HeAjE%3D&reserved=0

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

!! External Email: This email originated from outside of the organization. Do 
not click links or open attachments unless you recognize the sender.

v13-0001-Implement-input-functions-for-extended-statistic.patch
Description:  v13-0001-Implement-input-functions-for-extended-statistic.patch

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-10 Thread Dilip Kumar

On Tue, Jan 10, 2023 at 10:26 AM houzj.f...@fujitsu.com
 wrote:
>
> On Monday, January 9, 2023 4:51 PM Amit Kapila  
> wrote:
> >
> > On Sun, Jan 8, 2023 at 11:32 AM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Sunday, January 8, 2023 11:59 AM houzj.f...@fujitsu.com
> >  wrote:
> > > > Attach the updated patch set.
> > >
> > > Sorry, the commit message of 0001 was accidentally deleted, just
> > > attach the same patch set again with commit message.
> > >
> >
> > Pushed the first (0001) patch.
>
> Thanks for pushing, here are the remaining patches.
> I reordered the patch number to put patches that are easier to
> commit in the front of others.

I was looking into 0001, IMHO the pid should continue to represent the
main apply worker. So the pid will always show the main apply worker
which is actually receiving all the changes for the subscription (in
short working as logical receiver) and if it is applying changes
through a parallel worker then it should put the parallel worker pid
in a new column called 'parallel_worker_pid' or
'parallel_apply_worker_pid' otherwise NULL.  Thoughts?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-10 Thread Masahiko Sawada

On Mon, Jan 9, 2023 at 5:59 PM John Naylor  wrote:
>
> > [working on templating]
>
> In the end, I decided to base my effort on v8, and not v12 (based on one of 
> my less-well-thought-out ideas). The latter was a good experiment, but it did 
> not lead to an increase in readability as I had hoped. The attached v17 is 
> still rough, but it's in good enough shape to evaluate a mostly-complete 
> templating implementation.

I really appreciate your work!

>
> v13-0007 had some changes to the regression tests, but I haven't included 
> those. The tests from v13-0003 do pass, both locally and shared. I quickly 
> hacked together changing shared/local tests by hand (need to recompile), but 
> it would be good for maintainability if tests could run once each with local 
> and shmem, but use the same "expected" test output.

Agreed.

> Also, I didn't look to see if there were any changes in v14/15 that didn't 
> have to do with precise memory accounting.
>
> At this point, Masahiko, I'd appreciate your feedback on whether this is an 
> improvement at all (or at least a good base for improvement), especially for 
> integrating with the TID store. I think there are some advantages to the 
> template approach. One possible disadvantage is needing separate functions 
> for each local and shared memory.
>
> If we go this route, I do think the TID store should invoke the template as 
> static functions. I'm not quite comfortable with a global function that may 
> not fit well with future use cases.

It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:

/* Per-backend state for a TidStore */
struct TidStore
{
/*
 * Control object. This is allocated in DSA area 'area' in the shared
 * case, otherwise in backend-local memory.
 */
TidStoreControl *control;

/* Storage for Tids */
union tree
{
local_radix_tree*local;
shared_radix_tree   *shared;
};

/* DSA area for TidStore if used */
dsa_area*area;
};

In the functions of TID store, we need to call either local or shared
radix tree functions depending on whether TID store is shared or not.
We need if-branch for each key-value pair insertion, but I think it
would not be a big performance problem in TID store use cases, since
vacuum is an I/O intensive operation in many cases. Overall, I think
there is no problem and I'll investigate it in depth.

Apart from that, I've been considering the lock support for shared
radix tree. As we discussed before, the current usage (i.e, only
parallel index vacuum) doesn't require locking support at all, so it
would be enough to have a single lock for simplicity. If we want to
use the shared radix tree for other use cases such as the parallel
heap vacuum or the replacement of the hash table for shared buffers,
we would need better lock support. For example, if we want to support
Optimistic Lock Coupling[1], we would need to change not only the node
structure but also the logic. Which probably leads to widen the gap
between the code for non-shared and shared radix tree. In this case,
once we have a better radix tree optimized for shared case, perhaps we
can replace the templated shared radix tree with it. I'd like to hear
your opinion on this line.

>
> One review point I'll mention: Somehow I didn't notice there is no use for 
> the "chunk" field in the rt_node type -- it's only set to zero and copied 
> when growing. What is the purpose? Removing it would allow the smallest node 
> to take up only 32 bytes with a fanout of 3, by eliminating padding.

Oh, I didn't notice that. The chunk field was originally used when
redirecting the child pointer in the parent node from old to new
(grown) node. When redirecting the pointer, since the corresponding
chunk surely exists on the parent we can skip existence checks.
Currently we use RT_NODE_UPDATE_INNER() for that (see
RT_REPLACE_NODE()) but having a dedicated function to update the
existing chunk and child pointer might improve the performance. Or
reducing the node size by getting rid of the chunk field might be
better.

> Also, v17-0005 has an optimization/simplification for growing into node125 
> (my version needs an assertion or fallback, but works well now), found by 
> another reading of Andres' prototype There is a lot of good engineering 
> there, we should try to preserve it.

Agreed.

Regards,

[1] https://db.in.tum.de/~leis/papers/artsync.pdf

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Allow +group in pg_ident.conf

2023-01-10 Thread Jelte Fennema

Having looked closer now, I'm pretty sure you should base this patch
on top of my patch: https://commitfest.postgresql.org/41/4081/
Mainly because you also need the token version of pg_role, which is
one of the things my patch adds.

> if (regexp_pgrole[0] == '+')

For these lines you'll need to check if the original token was quoted.
If it's quoted it shouldn't use the group behaviour, and instead
compare the + character as part of the literal role.

> if (is_member(roleid, regexp_pgrole +1))
> if (is_member(roleid, ++map_role))

You use these two checks to do the same, so it's best if they are
written consistently.

> if (regexp_pgrole[0] == '+')

This check can be moved before the following line and do an early
return (like I do for "all" in my patch). Since if the first character
is a + we know that it's not \1 and thus we don't have to worry about
getting the regex match.

> if ((ofs = strstr(identLine->pg_role->string, "\\1")) != NULL)

Use windows VMs instead of windows containers on the CI

2023-01-10 Thread Nazir Bilal Yavuz


Hi,

I propose using windows VMs instead of containers, the patch is 
attached. Currently, windows containers are used on the CI, but these 
container images are needs to get pulled on every CI run, also they are 
slow to run.


These VM images are created in the same way how container images are 
created [1].


The comparison between VMs and containers are (based on d952373a98 and 
with same numbers of CPU and memory):


Scheduling step:


VS 2019
MinGW64
VM [2]
00:17m
00:16m
Container [3]
03:51m  04:28m

Execution step:


VS 2019
MinGW64
VM [2]
12:16m
07.55m
Container [3]
26:02m  16:34m

There is more than 2x speed gain when VMs are used.

[1] 
https://github.com/anarazel/pg-vm-images/blob/main/packer/windows.pkr.hcl

[2] https://cirrus-ci.com/build/4720774045499392
[3] https://cirrus-ci.com/build/5468256027279360

Regards,
Nazir Bilal Yavuz
Microsoft
From 6981319d054f0736e474bc315ab094094d159979 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 10 Jan 2023 13:58:39 +0300
Subject: [PATCH] Use windows VMs instead of windows containers

---
 .cirrus.yml | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index 69837bcd5a..5700b8cd66 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -549,8 +549,10 @@ task:
   depends_on: SanityCheck
   only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || 
$CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*windows.*'
 
-  windows_container:
-image: $CONTAINER_REPO/windows_ci_vs_2019:latest
+  compute_engine_instance:
+image_project: $IMAGE_PROJECT
+image: family/pg-ci-windows-ci-vs-2019
+platform: windows
 cpu: $CPUS
 memory: 4G
 
@@ -589,8 +591,10 @@ task:
   # otherwise it'll be sorted before other tasks
   depends_on: SanityCheck
 
-  windows_container:
-image: $CONTAINER_REPO/windows_ci_mingw64:latest
+  compute_engine_instance:
+image_project: $IMAGE_PROJECT
+image: family/pg-ci-windows-ci-mingw64
+platform: windows
 cpu: $CPUS
 memory: 4G
 
-- 
2.25.1

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-01-10 Thread Jakub Wartak

On Wed, Jan 4, 2023 at 6:38 PM Robert Haas  wrote:
>
> On Wed, Jan 4, 2023 at 11:36 AM Tom Lane  wrote:
> > As you well know, psql's FETCH_COUNT mechanism is far older than
> > single-row mode.  I don't think anyone's tried to transpose it
> > onto that.  I agree that it seems like a good idea to try.
> > There will be more per-row overhead, but the increase in flexibility
> > is likely to justify that.
>
> Yeah, I was vaguely worried that there might be more per-row overhead,
> not that I know a lot about this topic. I wonder if there's a way to
> mitigate that. I'm a bit suspicious that what we want here is really
> more of an incremental mode than a single-row mode i.e. yeah, you want
> to fetch rows without materializing the whole result, but maybe not in
> batches of exactly size one.

Given the low importance and very low priority of this, how about
adding it as a TODO wiki item then and maybe adding just some warning
instead? I've intentionally avoided parsing grammar and regexp so it's
not perfect (not that I do care about this too much either, as web
crawlers already have indexed this $thread). BTW I've found two
threads if know what are you looking for [1][2]

-Jakub Wartak.

[1] - 
https://www.postgresql.org/message-id/flat/a0a854b6-563c-4a11-bf1c-d6c6f924004d%40manitou-mail.org
[2] - 
https://www.postgresql.org/message-id/flat/1274761885.4261.233.camel%40minidragon


0001-psql-warn-about-CTE-queries-to-be-executed-without-u.patch
Description: Binary data

Re: [PATCH]Feature improvement for MERGE tab completion

2023-01-10 Thread Dean Rasheed

On Tue, 3 Jan 2023 at 12:30, vignesh C  wrote:
>
> The patch does not apply on top of HEAD as in [1], please post a rebased 
> patch:
>

This is because 0001 has been committed.
Re-uploading 0002, to keep the CF-bot happy.

Reviewing 0002...

I'm not entirely convinced that the PartialMatches() changes are
really necessary. As far as I can see USING followed by ON doesn't
appear anywhere else in the grammar, and the later completions
involving WHEN [NOT] MATCHED are definitely unique to MERGE.
Nonetheless, I guess it's not a bad thing to check that it really is a
MERGE. Also the new matching function might prove useful for other
cases.

Some more detailed code comments:

I find the name PartialMatches() a little off, since "partial" doesn't
really accurately describe what it's doing. HeadMatches() and
TailMatches() are also partial matches (matching the head and tail
parts). So perhaps MidMatches() would be a better name.

I also found the comment description of PartialMatchesImpl() misleading:

/*
 * Implementation of PartialMatches and PartialMatchesCS macros: do parts of
 * the words in previous_words match the variadic arguments?
 */

I think a more accurate description would be:

/*
 * Implementation of MidMatches and MidMatchesCS macros: do any N consecutive
 * words in previous_words match the variadic arguments?
 */

Similarly, instead of:

/* Match N words on the line partially, case-insensitively. */

how about:

/* Match N consecutive words anywhere on the line, case-insensitively. */

In PartialMatchesImpl()'s main loop:

if (previous_words_count - startpos < narg)
{
va_end(args);
return false;
}

couldn't that just be built into the loop's termination clause (i.e.,
loop while startpos <= previous_words_count - narg)?

For the first block of changes using the new function:

else if (PartialMatches("MERGE", "INTO", MatchAny, "USING") ||
 PartialMatches("MERGE", "INTO", MatchAny, "AS", MatchAny,
"USING") ||
 PartialMatches("MERGE", "INTO", MatchAny, MatchAny, "USING"))
{
/* Complete MERGE INTO ... ON with target table attributes */
if (TailMatches("INTO", MatchAny, "USING", MatchAny, "ON"))
COMPLETE_WITH_ATTR(prev4_wd);
else if (TailMatches("INTO", MatchAny, "AS", MatchAny,
"USING", MatchAny, "AS", MatchAny, "ON"))
COMPLETE_WITH_ATTR(prev8_wd);
else if (TailMatches("INTO", MatchAny, MatchAny, "USING",
MatchAny, MatchAny, "ON"))
COMPLETE_WITH_ATTR(prev6_wd);

wouldn't it be simpler to just include "MERGE" in the TailMatches()
arguments, and leave these 3 cases outside the new code block. I.e.:

/* Complete MERGE INTO ... ON with target table attributes */
else if (TailMatches("MERGE", "INTO", MatchAny, "USING", MatchAny, "ON"))
COMPLETE_WITH_ATTR(prev4_wd);
else if (TailMatches("MERGE", "INTO", MatchAny, "AS", MatchAny,
"USING", MatchAny, "AS", MatchAny, "ON"))
COMPLETE_WITH_ATTR(prev8_wd);
else if (TailMatches("MERGE", "INTO", MatchAny, MatchAny, "USING",
MatchAny, MatchAny, "ON"))
COMPLETE_WITH_ATTR(prev6_wd);

Regards,
Dean
From 8c7e25e4a2d939a32751bd9a0d487c510ec66191 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Wed, 21 Sep 2022 13:58:50 +0900
Subject: [PATCH 2/2] psql: Add PartialMatches() macro for better
 tab-completion.

---
 src/bin/psql/tab-complete.c | 152 +++-
 1 file changed, 113 insertions(+), 39 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 820f47d23a..0b8c252615 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1582,6 +1582,65 @@ HeadMatchesImpl(bool case_sensitive,
 	return true;
 }

+/*
+ * Implementation of PartialMatches and PartialMatchesCS macros: do parts of
+ * the words in previous_words match the variadic arguments?
+ */
+static bool
+PartialMatchesImpl(bool case_sensitive,
+   int previous_words_count, char **previous_words,
+   int narg,...)
+{
+	va_list		args;
+	char	   *firstarg = NULL;
+
+	if (previous_words_count < narg)
+		return false;
+
+	for (int startpos = 0; startpos < previous_words_count; startpos++)
+	{
+		int			argno;
+
+		if (firstarg == NULL)
+		{
+			va_start(args, narg);
+			firstarg = va_arg(args, char *);
+		}
+
+		if (!word_matches(firstarg,
+		  previous_words[previous_words_count - startpos - 1],
+		  case_sensitive))
+			continue;
+
+		if (previous_words_count - startpos < narg)
+		{
+			va_end(args);
+			return false;
+		}
+
+		for (argno = 1; argno < narg; argno++)
+		{
+			const char *arg = va_arg(args, const char *);
+
+			if (!word_matches(arg,
+			  previous_words[previous_words_count - argno - startpos - 1],
+			  case_sensitive))
+break;
+		}
+
+		va_end(args);
+		firstarg = NULL;
+
+		if (argno == narg)
+			return true;
+	}
+
+	if (firstarg != NULL)
+		va_end(args);
+
+	return false;
+}
+
 /*
  * Check if th

Re: Use windows VMs instead of windows containers on the CI

2023-01-10 Thread Nazir Bilal Yavuz


Hi,

Tables didn't seem nice on web interface. Re-sending with correct 
formatting.


Scheduling step:

| VS 2019 | MinGW64
--
VM | 00:17m | 00:16m
--
Container | 03:51m | 04:28m
Execution step:

| VS 2019 | MinGW64
--
VM | 12:16m| 07:55m
--
Container | 26:02m | 16:34m


Regards,
Nazir Bilal Yavuz
Microsoft

Re: Resolve UNKNOWN type to relevant type instead of text type while bulk update using values

2023-01-10 Thread Ashutosh Bapat

On Fri, Jan 6, 2023 at 8:28 PM Tom Lane  wrote:
>
> Ashutosh Bapat  writes:
> > we cast a quoted value to UNKNOWN type, but this is a special value
> > null which can be casted to any SQL data type. Probably we could add a
> > ANYNULLTYPE or some such generic null type which can be casted to any
> > data type. Then a null value without any type is labeled as
> > ANYNULLTYPE if specific type information is not available.
>
> And ... how does that differ from the existing behavior of UNKNOWN?
>

>From the below comment
 /*
  * If all the inputs were UNKNOWN type --- ie, unknown-type literals ---
  * then resolve as type TEXT.  This situation comes up with constructs
  * like SELECT (CASE WHEN foo THEN 'bar' ELSE 'baz' END); SELECT 'foo'
  * UNION SELECT 'bar'; It might seem desirable to leave the construct's
  * output type as UNKNOWN, but that really doesn't work, because we'd
  * probably end up needing a runtime coercion from UNKNOWN to something
  * else, and we usually won't have it.  We need to coerce the unknown
  * literals while they are still literals, so a decision has to be made
  * now.
  */

A constant null can be coerced to be null of any data type. So it
doesn't need to be coerced to text or anything for the reason
mentioned in the comment. Using UNKNOWN type, we have problem of not
being able to coerce it to another type. But ANYNULLVALUE can be
coerced to anything and thus can continue to be used till a point
where we know the data type it needs to be coerced to.

-- 
Best Wishes,
Ashutosh Bapat

Re: SQL/JSON revisited

2023-01-10 Thread Elena Indrupskaya


Hi,

The Postgres Pro documentation team prepared another SQL/JSON 
documentation patch (attached), to apply on top of 
v1-0009-Documentation-for-SQL-JSON-features.patch.

The new patch:
- Fixes minor typos
- Does some rewording agreed with Nikita Glukhov
- Updates Docbook markup to make tags consistent across SQL/JSON 
documentation and across func.sgml, and in particular, consistent with 
the XMLTABLE function, which resembles SQL/JSON functions pretty much.


--
Elena Indrupskaya
Lead Technical Writer
Postgres Professional http://www.postgrespro.com

On 28.12.2022 10:28, Amit Langote wrote:

Hi,

Rebased the SQL/JSON patches over the latest HEAD.  I've decided to
keep the same division of code into individual commits as that
mentioned in the revert commit 2f2b18bd3f, squashing fixup commits in
that list into the appropriate feature commits.

The main difference from the patches as they were committed into v15
is that JsonExpr evaluation no longer needs to use sub-transactions,
thanks to the work done recently to handle type errors softly.  I've
made the new code pass an ErrorSaveContext into the type-conversion
related functions as needed and also added an ExecEvalExprSafe() to
evaluate sub-expressions of JsonExpr that might contain expressions
that call type-conversion functions, such as CoerceViaIO contained in
JsonCoercion nodes.  ExecExprEvalSafe() is based on one of the patches
that Nikita Glukhov had submitted in a previous discussion about
redesigning SQL/JSON expression evaluation [1].  Though, I think that
new interface will become unnecessary after I have finished rebasing
my patches to remove subsidiary ExprStates of JsonExprState that we
had also discussed back in [2].

Adding this to January CF.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index cc72b9c2f6d..8614a26fe95 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17675,18 +17675,18 @@ $.* ? (@ like_regex "^\\d+$")
   
 json constructor
   json (
-  expression
+  expression
FORMAT JSON  ENCODING UTF8 
{ WITH | WITHOUT } UNIQUE  KEYS 
RETURNING json_data_type )


-The expression can be any text type or a
+The expression can be any text type or a
 bytea in UTF8 encoding. If the
-expression is NULL, an
+expression is NULL, an
 SQL null value is returned.
 If WITH UNIQUE is specified, the
-expression must not contain any duplicate
+expression must not contain any duplicate
 object keys.


@@ -17701,12 +17701,12 @@ $.* ? (@ like_regex "^\\d+$")
  
   
 json_scalar
-json_scalar (expression
+json_scalar (expression
  RETURNING json_data_type )


 Returns a JSON scalar value representing
-expression.
+expression.
 If the input is NULL, an SQL NULL is returned. If the input is a number
 or a boolean value, a corresponding JSON number or boolean value is
 returned. For any other value a JSON string is returned.
@@ -17724,8 +17724,8 @@ $.* ? (@ like_regex "^\\d+$")
   
 json_object
 json_object (
- { key_expression { VALUE | ':' }
- value_expression  FORMAT JSON  ENCODING UTF8   }, ... 
+ { key_expression { VALUE | ':' }
+ value_expression  FORMAT JSON  ENCODING UTF8   }, ... 
  { NULL | ABSENT } ON NULL 
  { WITH | WITHOUT } UNIQUE  KEYS  
  RETURNING data_type  FORMAT JSON  ENCODING UTF8   )
@@ -17733,15 +17733,15 @@ $.* ? (@ like_regex "^\\d+$")

 Constructs a JSON object of all the key value pairs given,
 or an empty object if none are given.
-key_expression is a scalar expression
+key_expression is a scalar expression
 defining the JSON key, which is
 converted to the text type.
 It cannot be NULL nor can it
 belong to a type that has a cast to the json.
 If WITH UNIQUE is specified, there must not
-be any duplicate key_expression.
+be any duplicate key_expression.
 If ABSENT ON NULL is specified, the entire
-pair is omitted if the value_expression
+pair is omitted if the value_expression
 is NULL.


@@ -17753,7 +17753,7 @@ $.* ? (@ like_regex "^\\d+$")
   
 json_objectagg
 json_objectagg (
- { key_expression { VALUE | ':' } value_expression } 
+ { key_expression { VALUE | ':' } value_expression } 
  { NULL | ABSENT } ON NULL 
  { WITH | WITHOUT } UNIQUE  KEYS  
  RETURNING data_type  FORMAT JSON  ENCODING UTF8   )
@@ -17761,8 +17761,8 @@ $.* ? (@ like_regex "^\\d+$")

 Behaves like json_object above, but as an
 aggregate function, so it only takes one
-key_expression and one
-value_expression parameter.
+

Re: Use windows VMs instead of windows containers on the CI

2023-01-10 Thread Nazir Bilal Yavuz


Hi,

It didn't work again. Sending numbers until I figure out how to solve this.

Scheduling Step:

VM + VS 2019: 00.17m
Container + VS 2019: 03.51m

VM + MinGW64: 00.16m
Container + MinGW64: 04.28m


Execution step:

VM + VS 2019: 12.16m
Container + VS 2019: 26.02m

VM + MinGW64: 07.55m
Container + MinGW64: 16.34m


Sorry for the multiple mails.


Regards,
Nazir Bilal Yavuz
Microsoft

Re: [PATCH] support tab-completion for single quote input with equal sign

2023-01-10 Thread torikoshia

On Thursday, July 22, 2021 1:05 PM, tanghy(dot)fnst(at)fujitsu(dot)com 
 wrote

I found a problem when using tab-completion as follows:

CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=localhost port=5432 dbname=postgres' [TAB]

The word 'PUBLICATION' couldn't be auto completed as expected.


I too wondered about this behavior.


v3-0001-support-tab-completion-for-CONNECTION-string-with.patch


I applied the patch and succeeded in the above case, but failed in the 
below case.


  =# CREATE SUBSCRIPTION s1 CONNECTION 'a=' PUBLICATION p1 

Before applying the patch, 'WITH (' was completed, but now it completes 
nothing since it matches the below condition:


   18 +   else if ((HeadMatches("CREATE", "SUBSCRIPTION", MatchAny, 
"CONNECTION", MatchAny)))

   19 +   {


I updated the patch going along with the v3 direction.

What do you think?


--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom 0e2612a1c7762b64357c85ce04e62b5ba0cdb4f7 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Tue, 10 Jan 2023 09:51:30 +0900
Subject: [PATCH v4] Support tab-completion after conninfo using '='

Previously, tab-completion after CONNECTION 'conninfo' in CREATE
SUBSCRIPTION didn't work when conninfo contained '=' since conninfo
was considered multiple words.

---
 src/bin/psql/t/010_tab_completion.pl |  8 
 src/bin/psql/tab-complete.c  | 12 +---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/bin/psql/t/010_tab_completion.pl b/src/bin/psql/t/010_tab_completion.pl
index 7746c75e0c..4a1e8c8c32 100644
--- a/src/bin/psql/t/010_tab_completion.pl
+++ b/src/bin/psql/t/010_tab_completion.pl
@@ -288,6 +288,14 @@ check_completion(
 # broken versions of libedit require clear_line not clear_query here
 clear_line();
 
+# check tab-completion for CONNECTION string with equal sign.
+check_completion(
+   "CREATE SUBSCRIPTION my_sub CONNECTION 'host=localhost port=5432 dbname=postgres' \t",
+   qr|PUBLICATION|,
+   "tab-completion for CONNECTION string with equal sign");
+
+clear_line();
+
 # COPY requires quoting
 # note: broken versions of libedit want to backslash the closing quote;
 # not much we can do about that
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 23750ea5fb..dd66f1c7a2 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -3235,10 +3235,16 @@ psql_completion(const char *text, int start, int end)
 /* CREATE SUBSCRIPTION */
 	else if (Matches("CREATE", "SUBSCRIPTION", MatchAny))
 		COMPLETE_WITH("CONNECTION");
-	else if (Matches("CREATE", "SUBSCRIPTION", MatchAny, "CONNECTION", MatchAny))
+	/*
+		conninfo is considered as multiple words because it contains '='
+		and it makes conninfo doesn't match MatchAny.
+		Here we match conninfo noting that it ends with a single quotation.
+	*/
+	else if ((HeadMatches("CREATE", "SUBSCRIPTION", MatchAny, "CONNECTION",
+	 MatchAny)) && ends_with(prev_wd, '\''))
 		COMPLETE_WITH("PUBLICATION");
-	else if (Matches("CREATE", "SUBSCRIPTION", MatchAny, "CONNECTION",
-	 MatchAny, "PUBLICATION"))
+	else if ((HeadMatches("CREATE", "SUBSCRIPTION", MatchAny, "CONNECTION",
+	 MatchAny)) && TailMatches("PUBLICATION"))
 	{
 		/* complete with nothing here as this refers to remote publications */
 	}

base-commit: 57d11ef028d126f95595c08c62ffb4c5147d0f86
-- 
2.25.1

Re: split TOAST support out of postgres.h

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 3:48 AM Peter Eisentraut
 wrote:
> >>> Wow, 41 files requiring varatt.h is a lot fewer than I would have guessed.
> >>> I think that bears out my feeling that fmgr.h wasn't a great location:
> >>> I count 117 #includes of that, many of which are in .h files themselves
> >>> so that many more .c files would be required to read them.
> >>
> >> committed
> >
> > SET_VARSIZE alone appears in 74 pgxn distributions, so I predict extension
> > breakage en masse.  I would revert this.
>
> Well, that was sort of my thinking, but people seemed to like this.  I'm
> happy to consider alternatives.

I don't think that the number of extensions that get broken is really
the right metric. It's not fantastic to break large numbers of
extensions, of course, but if the solution is merely to add an #if
PG_VERSION_NUM >= whatever #include "newstuff" #endif then I don't
think it's really an issue. If an extension doesn't have an author who
can do at least that much updating when a new PG release comes out,
then it's basically unmaintained, and I just don't feel that bad about
breaking unmaintained extensions now and then, even annually.

Of course, if we go and remove something that's very widely used and
for which there's no simple workaround, that sucks. Say, removing
LWLocks entirely. But we don't usually do that sort of thing unless
there's a good reason and significant benefits.

I don't think it would be very nice to do something like this in a
minor release. But in a new major release, I think it's fine. I've
been on the hook to maintain extensions in the face of these kinds of
changes at various times over the years, and it's never taken me much
time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgbench - adding pl/pgsql versions of tests

2023-01-10 Thread Fabien COELHO




Hello,


The attached patch adds pl/pgsql versions of "tpcb-like" and
"simple-update" internal test scripts


Why not, it makes sense because it is relevant to some usage patterns.

Why not having the select version as a version as well?

If we are going to follow this road, we could also consider
"combined" queries with \; as well?


$ pgbench -b list
Available builtin scripts:
 tpcb-like: 
 plpgsql-tpcb-like: 
 simple-update: 
 plpgsql-simple-update: 
   select-only: 

which one can run  using the -b / --builtin= option


ISTM that the -b had a fast selection so that only a prefix was enough to 
select a script (-b se = -b select-only). Maybe such convenient shortcut 
should be preserved, it seems that the long name will be needed for the pl 
versions.



And a flag --no-functions which lets you not to create the functions at init


Hmmm. Not so sure.


there are also character flags to -I / --init ,
-- Y to drop the functions and
-- y to create the functions. Creating is default behaviour, but can
be disabled fia long flag --no-functions )


Ok.


I selected Yy as they were unused and can be thought of as "inverted
lambda symbol" :)


:-)


If there are no strong objections, I'll add it to the commitfest as well


Please do that.

--
Fabien Coelho.

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-10 Thread Matthias van de Meent

On Mon, 9 Jan 2023 at 20:34, Andres Freund  wrote:
> On 2023-01-09 17:50:10 +0100, Matthias van de Meent wrote:
> > Wouldn't it be enough to only fix the constructions in
> > FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
> > not occur with the patch), and (optionally) bump the first XID
> > available for any cluster to (FirstNormalXid + 1) to retain the 'older
> > than any running transaction' property?
>
> It's not too hard to fix in individual places, but I suspect that we'll
> introduce the bug in future places without some more fundamental protection.
>
> Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable value
> in ComputeXidHorizons() and GetSnapshotData().

I don't think that clamping the value with oldestXid (as seen in patch
0001, in GetSnapshotData) is right.
It would clamp the value relative to the oldest frozen xid of all
databases, which can be millions of transactions behind oldestXmin,
and thus severely skew the amount of transaction's changes you keep on
disk (that is, until oldestXid moves past 1000_000).
A similar case can be made for the changes in ComputeXidHorizons - for
the described behaviour of vacuum_defer_cleanup_age we would need to
clamp the used offset separately for each of the fields in the horizon
result to retain all transaction data for the first 1 million
transactions, and the ones that may still see these transactions.

> Fixing it in FullXidRelativeTo() doesn't seem quite right, while it's ok to
> just return FirstNormalTransactionId in the case of vacuum_defer_cleanup_age,
> it doesn't see necessarily correct for other cases.

Understood.

Kind regards,

Matthias van de Meent

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-10 Thread Takamichi Osumi (Fujitsu)

On Tuesday, January 3, 2023 4:01 PM vignesh C  wrote:
Hi, thanks for your review !


> 1) This global variable can be removed as it is used only in send_feedback 
> which
> is called from maybe_delay_apply so we could pass it as a function argument:
> + * delay, avoid having positions of the flushed and apply LSN
> +overwritten by
> + * the latest LSN.
> + */
> +static bool in_delaying_apply = false;
> +static XLogRecPtr last_received = InvalidXLogRecPtr;
> +
I have removed the first variable and make it one of the arguments for 
send_feedback().

> 2) -1 gets converted to -1000
> 
> +int64
> +interval2ms(const Interval *interval)
> +{
> +   int64   days;
> +   int64   ms;
> +   int64   result;
> +
> +   days = interval->month * INT64CONST(30);
> +   days += interval->day;
> +
> +   /* Detect whether the value of interval can cause an overflow. */
> +   if (pg_mul_s64_overflow(days, MSECS_PER_DAY, &result))
> +   ereport(ERROR,
> +
> (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
> +errmsg("bigint out of range")));
> +
> +   /* Adds portion time (in ms) to the previous result. */
> +   ms = interval->time / INT64CONST(1000);
> +   if (pg_add_s64_overflow(result, ms, &result))
> +   ereport(ERROR,
> +
> (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
> +errmsg("bigint out of range")));
> 
> create subscription sub7 connection 'dbname=regression host=localhost
> port=5432' publication pub1 with (min_apply_delay = '-1');
> ERROR:  -1000 ms is outside the valid range for parameter "min_apply_delay"
Good catch! Fixed in order to make input '-1' interpretted as -1 ms.

> 3) This can be slightly reworded:
> +  
> +   The length of time (ms) to delay the application of changes.
> +  
> to:
> Delay applying the changes by a specified amount of time(ms).
This has been suggested in [1] by Peter Smith. So, I'd like to keep the current 
patch's description.
Then, I didn't change this.

> 4)  maybe_delay_apply can be moved from apply_handle_stream_prepare to
> apply_spooled_messages so that it is consistent with
> maybe_start_skipping_changes:
> @@ -1120,6 +1240,19 @@ apply_handle_stream_prepare(StringInfo s)
> 
> elog(DEBUG1, "received prepare for streamed transaction %u",
> prepare_data.xid);
> 
> +   /*
> +* Should we delay the current prepared transaction?
> +*
> +* Although the delay is applied in BEGIN PREPARE messages,
> streamed
> +* prepared transactions apply the delay in a STREAM PREPARE
> message.
> +* That's ok because no changes have been applied yet
> +* (apply_spooled_messages() will do it). The STREAM START message
> does
> +* not contain a prepare time (it will be available when the 
> in-progress
> +* prepared transaction finishes), hence, it was not possible to 
> apply a
> +* delay at that time.
> +*/
> +   maybe_delay_apply(prepare_data.prepare_time);
> 
> 
> That way the call from apply_handle_stream_commit can also be removed.
Sounds good. I moved the call of maybe_delay_apply() to the 
apply_spooled_messages().
Now it's aligned with maybe_start_skipping_changes().

> 5) typo transfering should be transferring
> +  publisher and the current time on the subscriber. Time
> spent in logical
> +  decoding and in transfering the transaction may reduce the
> actual wait
> +  time. If the system clocks on publisher and subscriber are
> + not
Fixed.

> 6) feedbacks can be changed to feedback messages
> + * it's necessary to keep sending feedbacks during the delay from the
> + worker
> + * process. Meanwhile, the feature delays the apply before starting the
Fixed.

> 7)
> + /*
> + * Suppress overwrites of flushed and writtten positions by the lastest
> + * LSN in send_feedback().
> + */
> 
> 7a) typo writtten should be written
> 
> 7b) lastest should latest
I have removed this sentence. So, those typos are removed.

Please have a look at the updated patch.

[1] - 
https://www.postgresql.org/message-id/CAHut%2BPttQdFMNM2c6WqKt2c9G6r3ZKYRGHm04RR-4p4fyA4WRg%40mail.gmail.com


Best Regards,
Takamichi Osumi



v13-0001-Time-delayed-logical-replication-subscriber.patch
Description:  v13-0001-Time-delayed-logical-replication-subscriber.patch

Re: Bug in check for unreachable MERGE WHEN clauses

2023-01-10 Thread Dean Rasheed

On Mon, 2 Jan 2023 at 12:13, Dean Rasheed  wrote:
>
> Re-reading my latest MERGE patch, I realised there is a trivial,
> pre-existing bug in the check for unreachable WHEN clauses, which
> means it won't spot an unreachable WHEN clause if it doesn't have an
> AND condition.
>
> So the checks need to be re-ordered, as in the attached.
>

Pushed and back-patched.

Regards,
Dean

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-10 Thread Takamichi Osumi (Fujitsu)

On Tuesday, January 3, 2023 8:22 PM shveta malik  wrote:
> Please find a few minor comments.
Thanks for your review !

> 1.
> +  diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
> +
> 
>TimestampTzPlusMilliseconds(ts, MySubscription->minapplydelay));  on
> unix, above code looks unaligned (copied from unix)
> 
> 2. same with:
> +  interval = DatumGetIntervalP(DirectFunctionCall3(interval_in,
> +
> 
>CStringGetDatum(val),
> +
> 
>ObjectIdGetDatum(InvalidOid),
> +
> 
>Int32GetDatum(-1)));
> perhaps due to tabs?
Those patches indentation look OK. I checked them
by pgindent and less command described in [1]. So, I didn't change those.


> 2. comment not clear:
> +  * During the time delayed replication, avoid reporting the suspended
> +  * latest LSN are already flushed and written, to the publisher.
You are right. I fixed this part to make it clearer.
Could you please check ?

> 3.
> +  * Call send_feedback() to prevent the publisher from exiting by
> +  * timeout during the delay, when wal_receiver_status_interval is
> +  * available. The WALs for this delayed transaction is neither
> +  * written nor flushed yet, Thus, we don't make the latest LSN
> +  * overwrite those positions of the update message for this delay.
> 
>  yet, Thus, we -->  yet, thus, we/   yet. Thus, we
Yeah, you are right. But, I have removed the last sentence, because the last one
explains some internals of send_feedback(). I judged that it would be awkward
to describe it in maybe_delay_apply(). Now this part has become concise.

> 4.
> +  /* Adds portion time (in ms) to the previous result. */
> +  ms = interval->time / INT64CONST(1000);
> Is interval->time always in micro-seconds here?
Yeah, it seems so. Some internal codes indicate it. Kindly have a look at 
functions
such as make_interval() and interval2itm().

Please have a look at the latest patch v12 in [2].

[1] - https://www.postgresql.org/docs/current/source-format.html
[2] - 
https://www.postgresql.org/message-id/TYCPR01MB837340F78F4A16F542589195EDFF9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi

Re: [PATCH] random_normal function

2023-01-10 Thread Tom Lane

Dean Rasheed  writes:
> I double-checked the one-in-a-billion claim, and it looks about right
> for each test.

Thanks for double-checking my arithmetic.

> The rest looks good to me, except there's a random non-ASCII character
> instead of a hyphen in "Kolmogorov-Smirnov" (because I copy-pasted the
> name from some random website).

Yeah, I caught that before committing.

The AIX buildfarm members were still showing low-order diffs in the
random_normal results at extra_float_digits = 0, but they seem OK
after reducing it to -1.

regards, tom lane

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-10 Thread Takamichi Osumi (Fujitsu)

On Tuesday, January 10, 2023 11:28 AM I wrote:
> On Tuesday, December 27, 2022 6:29 PM Tuesday, December 27, 2022 6:29 PM
> wrote:
> > Thanks for reviewing our patch! PSA new version patch set.
> Now, the patches fails to apply to the HEAD, because of recent commits
> (c6e1f62e2c and 216a784829c) as reported in [1].
> 
> I'll rebase the patch with other changes when I post a new version.
This is done in the patch in [1].
Please note that because of the commit c6e1f62e2c,
we don't need the 1st patch we borrowed from another thread in [2] any more.

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB837340F78F4A16F542589195EDFF9%40TYCPR01MB8373.jpnprd01.prod.outlook.com
[2] - 
https://www.postgresql.org/message-id/flat/20221122004119.GA132961%40nathanxps13



Best Regards,
Takamichi Osumi

Re: Allow +group in pg_ident.conf

2023-01-10 Thread Andrew Dunstan

On 2023-01-10 Tu 07:09, Jelte Fennema wrote:
> Having looked closer now, I'm pretty sure you should base this patch
> on top of my patch: https://commitfest.postgresql.org/41/4081/
> Mainly because you also need the token version of pg_role, which is
> one of the things my patch adds.

Ok, that sounds reasonable, but the cfbot doesn't like patches that
depend on other patches in a different email. Maybe you can roll this up
as an extra patch in your next version? It's pretty small.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: MERGE ... WHEN NOT MATCHED BY SOURCE

2023-01-10 Thread Dean Rasheed

Rebased version attached.

Regards,
Dean
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
new file mode 100644
index b87ad5c..1482ede
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -396,8 +396,8 @@
 originally matched appears later in the list of actions.
 On the other hand, if the row is concurrently updated or deleted so
 that the join condition fails, then MERGE will
-evaluate the condition's NOT MATCHED actions next,
-and execute the first one that succeeds.
+evaluate the condition's NOT MATCHED [BY TARGET]
+actions next, and execute the first one that succeeds.
 If MERGE attempts an INSERT
 and a unique index is present and a duplicate row is concurrently
 inserted, then a uniqueness violation error is raised;
diff --git a/doc/src/sgml/ref/merge.sgml b/doc/src/sgml/ref/merge.sgml
new file mode 100644
index 0995fe0..8ef121a
--- a/doc/src/sgml/ref/merge.sgml
+++ b/doc/src/sgml/ref/merge.sgml
@@ -33,7 +33,8 @@ USING dat
 and when_clause is:
 
 { WHEN MATCHED [ AND condition ] THEN { merge_update | merge_delete | DO NOTHING } |
-  WHEN NOT MATCHED [ AND condition ] THEN { merge_insert | DO NOTHING } }
+  WHEN NOT MATCHED BY SOURCE [ AND condition ] THEN { merge_update | merge_delete | DO NOTHING } |
+  WHEN NOT MATCHED [BY TARGET] [ AND condition ] THEN { merge_insert | DO NOTHING } }
 
 and merge_insert is:
 
@@ -70,7 +71,9 @@ DELETE
from data_source to
target_table_name
producing zero or more candidate change rows.  For each candidate change
-   row, the status of MATCHED or NOT MATCHED
+   row, the status of MATCHED,
+   NOT MATCHED BY SOURCE,
+   or NOT MATCHED [BY TARGET]
is set just once, after which WHEN clauses are evaluated
in the order specified.  For each candidate change row, the first clause to
evaluate as true is executed.  No more than one WHEN
@@ -226,16 +229,37 @@ DELETE
   At least one WHEN clause is required.
  
  
+  The WHEN clause may specify WHEN MATCHED,
+  WHEN NOT MATCHED BY SOURCE, or
+  WHEN NOT MATCHED [BY TARGET].
+  Note that the SQL standard only defines
+  WHEN MATCHED and WHEN NOT MATCHED
+  (which is defined to mean no matching target row).
+  WHEN NOT MATCHED BY SOURCE is an extension to the
+  SQL standard, as is the option to append
+  BY TARGET to WHEN NOT MATCHED, to
+  make its meaning more explicit.
+ 
+ 
   If the WHEN clause specifies WHEN MATCHED
-  and the candidate change row matches a row in the
+  and the candidate change row matches a row in the source to a row in the
   target_table_name,
   the WHEN clause is executed if the
   condition is
   absent or it evaluates to true.
  
  
-  Conversely, if the WHEN clause specifies
-  WHEN NOT MATCHED
+  If the WHEN clause specifies
+  WHEN NOT MATCHED BY SOURCE and the candidate change
+  row represents a row in the
+  target_table_name that does
+  not match a source row, the WHEN clause is executed
+  if the condition is
+  absent or it evaluates to true.
+ 
+ 
+  If the WHEN clause specifies
+  WHEN NOT MATCHED [BY TARGET]
   and the candidate change row does not match a row in the
   target_table_name,
   the WHEN clause is executed if the
@@ -257,7 +281,10 @@ DELETE
  
   A condition on a WHEN MATCHED clause can refer to columns
   in both the source and the target relations. A condition on a
-  WHEN NOT MATCHED clause can only refer to columns from
+  WHEN NOT MATCHED BY SOURCE clause can only refer to
+  columns from the target relation, since by definition there is no matching
+  source row. A condition on a WHEN NOT MATCHED [BY TARGET]
+  clause can only refer to columns from
   the source relation, since by definition there is no matching target row.
   Only the system attributes from the target table are accessible.
  
@@ -382,8 +409,10 @@ DELETE
   WHEN MATCHED clause, the expression can use values
   from the original row in the target table, and values from the
   data_source row.
-  If used in a WHEN NOT MATCHED clause, the
-  expression can use values from the data_source.
+  If used in a WHEN NOT MATCHED BY SOURCE clause, the
+  expression can only use values from the original row in the target table.
+  If used in a WHEN NOT MATCHED [BY TARGET] clause, the
+  expression can only use values from the data_source.
  
 

@@ -452,8 +481,9 @@ MERGE tot

 
  
-  Evaluate whether each row is MATCHED or
-  NOT MATCHED.
+  Evaluate whether each row is MATCHED,
+  NOT MATCHED BY SOURCE, or
+  NOT MATCHED [BY TARGET].
  
 
 
@@ -528,7 +558,8 @@ MERGE tot
   
If a WHEN clause omits an AND
sub-clause, it becomes the final reachable clause of that
-   kind (MATCHED or NOT MATCHE

Re: split TOAST support out of postgres.h

2023-01-10 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jan 10, 2023 at 3:48 AM Peter Eisentraut
>  wrote:
>>> SET_VARSIZE alone appears in 74 pgxn distributions, so I predict extension
>>> breakage en masse.  I would revert this.

>> Well, that was sort of my thinking, but people seemed to like this.  I'm
>> happy to consider alternatives.

> I don't think it would be very nice to do something like this in a
> minor release. But in a new major release, I think it's fine. I've
> been on the hook to maintain extensions in the face of these kinds of
> changes at various times over the years, and it's never taken me much
> time.

Yeah, that was my thinking.  We could never do any header refactoring
at all if the standard is "will some extension author need to add a #if".
In practice, we make bigger adjustments than this all the time,
both in header layout and in individual function APIs.

Now, there is a fair question whether splitting this code out of
postgres.h is worth any trouble at all.  TBH my initial reaction
had been "no".  But once we found that only 40-ish backend files
need to read this new header, I became a "yes" vote because it
seems clear that there will be a total-compilation-time benefit.

regards, tom lane

Re: SQL/JSON revisited

2023-01-10 Thread Andrew Dunstan

On 2023-01-10 Tu 07:51, Elena Indrupskaya wrote:
> Hi,
>
> The Postgres Pro documentation team prepared another SQL/JSON
> documentation patch (attached), to apply on top of
> v1-0009-Documentation-for-SQL-JSON-features.patch.
> The new patch:
> - Fixes minor typos
> - Does some rewording agreed with Nikita Glukhov
> - Updates Docbook markup to make tags consistent across SQL/JSON
> documentation and across func.sgml, and in particular, consistent with
> the XMLTABLE function, which resembles SQL/JSON functions pretty much.
>

That's nice, but please don't post incremental patches like this. It
upsets the cfbot. (I wish there were a way to tell the cfbot to ignore
patches)

Also, I'm fairly certain that a good many of your changes are not
according to project style. The rule as I understand it is that
 is used for things that are parameters and  is
only used for things that are not parameters. (I'm not sure where that's
documented other than the comment on commit 47046763c3, but it's what I
attempted to do with the earlier doc tidy up.)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Handle infinite recursion in logical replication setup

2023-01-10 Thread Amit Kapila

On Tue, Jan 10, 2023 at 8:13 AM Jonathan S. Katz  wrote:
>
> On 9/12/22 1:23 AM, vignesh C wrote:
> > On Fri, 9 Sept 2022 at 11:12, Amit Kapila  wrote:
> >
> > Thanks for pushing the patch. I have closed this entry in commitfest.
> > I will wait for some more time and see the response regarding the
> > documentation patch and then start a new thread if required.
>
> I've been testing this patch in advancing of working on the
> documentation and came across a behavior I wanted to note. Specifically,
> I am hitting a deadlock while trying to synchronous replicate between
> the two instances at any `synchronous_commit` level above `local`.
>
> Here is my set up. I have two instances, "A" and "B".
>
> On A and B, run:
>
>CREATE TABLE sync (id int PRIMARY KEY, info float);
>CREATE PUBLICATION sync FOR TABLE sync;
>
> On A, run:
>
>CREATE SUBSCRIPTION sync
>CONNECTION 'connstr-to-B'
>PUBLICATION sync
>WITH (
>  streaming=true, copy_data=false,
>  origin=none, synchronous_commit='on');
>
> On B, run:
>
>CREATE SUBSCRIPTION sync
>CONNECTION 'connstr-to-A'
>PUBLICATION sync
>WITH (
>  streaming=true, copy_data=false,
>  origin=none, synchronous_commit='on');
>
> On A and B, run:
>
>ALTER SYSTEM SET synchronous_standby_names TO 'sync';
>SELECT pg_reload_conf();
>
> Verify on A and B that pg_stat_replication.sync_state is set to "sync"
>
>SELECT application_name, sync_state = 'sync' AS is_sync
>FROM pg_stat_replication
>WHERE application_name = 'sync';
>
> The next to commands should be run simultaneously on A and B:
>
> -- run this on A
> INSERT INTO sync
> SELECT x, random() FROM generate_series(1,200, 2) x;
>
> -- run this on B
> INSERT INTO sync
> SELECT x, random() FROM generate_series(2,200, 2) x;
>
> This consistently created the deadlock in my testing.
>
> Discussing with Masahiko off-list, this is due to a deadlock from 4
> processes: the walsenders on A and B, and the apply workers on A and B.
> The walsenders are waiting for feedback from the apply workers, and the
> apply workers are waiting for the walsenders to synchronize (I may be
> oversimplifying).
>
> He suggested I try the above example instead with `synchronous_commit`
> set to `local`. In this case, I verified that there is no more deadlock,
> but he informed me that we would not be able to use cascading
> synchronous replication when "origin=none".
>

This has nothing to do with the origin feature. I mean this should
happen with origin=any or even in PG15 without using origin at all.
Am, I missing something? One related point to note is that in physical
replication cascading replication is asynchronous. See docs [1]
(Cascading replication is currently asynchronous)

> If we decide that this is a documentation issue, I'd suggest we improve
> the guidance around using `synchronous_commit`[1] on the CREATE
> SUBSCRIPTION page, as the GUC page[2] warns against using `local`:
>

Yeah, but on Create Subscription page, we have mentioned that it is
safe to use off for logical replication. One can use local or higher
for reducing the latency for COMMIT when synchronous replication is
used in the publisher. Won't using 'local' while creating subscription
would suffice the need to consistently replicate the data? I mean it
is equivalent to somebody using levels greater than local in case of
physical replication. I think in the case of physical replication, we
won't wait for standby to replicate to another node before sending a
response, so why to wait in the case of logical replication? If this
understanding is correct, then probably it is sufficient to support
'local' for a subscription.

[1] - https://www.postgresql.org/docs/devel/warm-standby.html

-- 
With Regards,
Amit Kapila.

Re: Announcing Release 15 of the PostgreSQL Buildfarm client

2023-01-10 Thread Andrew Dunstan

On 2022-12-31 Sa 10:02, Andrew Dunstan wrote:
> Changes
>
>
>   * check if a branch is up to date before trying to run it
> This only applies if the |branches_to_build| setting is a keyword
> rather than a list of branches. It reduces the number of useless
> calls to |git pull| to almost zero.

Occasionally things go wrong. It turns out this code was a bit too eager
and ignored the force_every settings in the config file.

There's a hot fix at

and I will push out a new release shortly.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Use windows VMs instead of windows containers on the CI

2023-01-10 Thread Justin Pryzby

On Tue, Jan 10, 2023 at 03:20:18PM +0300, Nazir Bilal Yavuz wrote:
> Hi,
> 
> I propose using windows VMs instead of containers, the patch is attached.
> Currently, windows containers are used on the CI, but these container images
> are needs to get pulled on every CI run, also they are slow to run.

+1

> There is more than 2x speed gain when VMs are used.

One consideration is that if windows runs twice as fast, we'll suddenly
start using twice as many resources at cirrus/google/amazon - the
windows task has been throttling everything else.  Not sure if we should
to do anything beyond the limits that cfbot already uses.

-- 
Justin

Re: [PATCH] pgbench: add multiconnect option

2023-01-10 Thread Fabien COELHO




Hello Jelte,


This patch seems to have quite some use case overlap with my patch which
adds load balancing to libpq itself:
https://www.postgresql.org/message-id/flat/pr3pr83mb04768e2ff04818eeb2179949f7...@pr3pr83mb0476.eurprd83.prod.outlook.com


Thanks for the pointer.

The end purpose of the patch is to allow pgbench to follow a failover at 
some point, at the client level, AFAICR.



My patch is only able to add "random" load balancing though, not
"round-robin". So this patch still definitely seems useful, even when mine
gets merged.


Yep. I'm not sure the end purpose is the same, but possibly the pgbench 
patch could take advantage of libpq extension.



I'm not sure that the support for the "working" connection is necessary
from a feature perspective though (usability/discoverability is another
question). It's already possible to achieve the same behaviour by simply
providing multiple host names in the connection string. You can even tell
libpq to connect to a primary or secondary by using the
target_session_attrs option.


--
Fabien.

Re: allowing for control over SET ROLE

2023-01-10 Thread Robert Haas

On Sat, Jan 7, 2023 at 12:00 AM Noah Misch  wrote:
> The docs are silent on the SET / OWNER TO connection.  Hence,

Reviewing the documentation again today, I realized that the
documentation describes the rules for changing the ownership of an
object in a whole bunch of places which this patch failed to update.
Here's a patch to update all of the places I found.

I suspect that these changes will mean that we don't also need to
adjust the discussion of the SET option itself, but let me know what
you think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Update-ALTER-OWNER-documentation-for-GRANT-.-WITH.patch
Description: Binary data

Re: [PATCH] random_normal function

2023-01-10 Thread Paul Ramsey

On Tue, Jan 10, 2023 at 6:34 AM Tom Lane  wrote:
>
> Dean Rasheed  writes:
> > I double-checked the one-in-a-billion claim, and it looks about right
> > for each test.
>
> Thanks for double-checking my arithmetic.
>
> > The rest looks good to me, except there's a random non-ASCII character
> > instead of a hyphen in "Kolmogorov-Smirnov" (because I copy-pasted the
> > name from some random website).
>
> Yeah, I caught that before committing.
>
> The AIX buildfarm members were still showing low-order diffs in the
> random_normal results at extra_float_digits = 0, but they seem OK
> after reducing it to -1.

I should leave the country more often... thanks for cleaning up my
patch and committing it Tom! It's a Christmas miracle (at least, for
me :)
P.

can while loop in ClockSweepTick function be kind of infinite loop in some cases?

2023-01-10 Thread 斯波隼斗

This question is about ClockSweepTick function and the code is below.
https://github.com/postgres/postgres/blob/24d2b2680a8d0e01b30ce8a41c4eb3b47aca5031/src/backend/storage/buffer/freelist.c#L146-L165

 The value of expected, NBuffers, wrapped variable is fixed in the while
loop, so that when the value of expected variable is not equal to
StrategyControl->nextVictimBuffer, CAS operation fails and the while loop
will be run kind-of infinitely.
It is possible for this problem to occur when ClockSweepTick function is
concurrently called and nextVictimBuffer is incremented by other process
before CAS operation in the loop (ex: in this case, the value of expected
variable is NBuffers+1 while the value of nextVictimBuffer variable is
NBuffers+2. so CAS operation fails)
I think. `expected = originalVictim + 1;` line should be in while loop
(before acquiring spin lock) so that, even in the case above, expected
variable is incremented for each loop and CAS operation will be successful
at some point.
Is my understanding correct? If so, I will send PR for fixing this issue.

Thank you in advance
Hayato Shiba

RE: [Proposal] Add foreign-server health checks infrastructure

2023-01-10 Thread Hayato Kuroda (Fujitsu)

Dear tom,

> I think that it's a really bad idea to require postgres_fdw.sql
> to have two expected-files: that will be a maintenance nightmare.
> Please put whatever it is that needs a variant expected-file
> into its own, hopefully very small and seldom-changed, test script.
> Or rethink whether you really need a test case that has
> platform-dependent output.

Thank you for giving the suggestion. I agreed your saying and modifed that.

I added new functions on the libpq and postgres-fdw layer that check whether the
checking works well or not. In the test, at first, the platform is checked and
the checking function is called only when it is supported.

An alternative approach is that PQCanConncheck() can be combined with 
PQConncheck().
This can reduce the libpq function, but we must define another returned value to
the function like -2. I was not sure which approach was better.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v23-0001-Add-PQConncheck-to-libpq.patch
Description: v23-0001-Add-PQConncheck-to-libpq.patch


v23-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch
Description:  v23-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch


v23-0003-add-test.patch
Description: v23-0003-add-test.patch

Re: Strengthen pg_waldump's --save-fullpage tests

2023-01-10 Thread Drouvot, Bertrand


Hi,

On 1/10/23 2:22 AM, Michael Paquier wrote:

On Mon, Jan 09, 2023 at 08:30:00AM +0530, Bharath Rupireddy wrote:

A recent commit [1] added --save-fullpage option to pg_waldump to
extract full page images (FPI) from WAL records and save them into
files (one file per FPI) under a specified directory. While it added
tests to check the LSN from the FPI file name and the FPI file
contents, it missed to further check the FPI contents like the tuples
on the page. I'm attaching a patch that basically reads the FPI file
(saved by pg_waldump) contents and raw page from the table file (using
pageinspect extension) and compares the tuples from both of them. This
test ensures that the pg_waldump outputs the correct FPI. This idea is
also discussed elsewhere [2].

Thoughts?


I am not sure that it is necessary to expand this set of tests to have
dependencies on heap and pageinspect (if we do so, what of index AMs)
and spend more cycles on that, while we already have something in
place to cross-check ReadRecPtr with what's stored in the page header
written on top of the block size.


I like the idea of comparing the full page (and not just the LSN) but
I'm not sure that adding the pageinspect dependency is a good thing.

What about extracting the block directly from the relation file and
comparing it with the one extracted from the WAL? (We'd need to skip the
first 8 bytes to skip the LSN though).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [Proposal] Add foreign-server health checks infrastructure

2023-01-10 Thread Ted Yu

On Tue, Jan 10, 2023 at 8:26 AM Hayato Kuroda (Fujitsu) <
kuroda.hay...@fujitsu.com> wrote:

> Dear tom,
>
> > I think that it's a really bad idea to require postgres_fdw.sql
> > to have two expected-files: that will be a maintenance nightmare.
> > Please put whatever it is that needs a variant expected-file
> > into its own, hopefully very small and seldom-changed, test script.
> > Or rethink whether you really need a test case that has
> > platform-dependent output.
>
> Thank you for giving the suggestion. I agreed your saying and modifed that.
>
> I added new functions on the libpq and postgres-fdw layer that check
> whether the
> checking works well or not. In the test, at first, the platform is checked
> and
> the checking function is called only when it is supported.
>
> An alternative approach is that PQCanConncheck() can be combined with
> PQConncheck().
> This can reduce the libpq function, but we must define another returned
> value to
> the function like -2. I was not sure which approach was better.
>
> Best Regards,
> Hayato Kuroda
> FUJITSU LIMITED
>

Hi,

+   /* quick exit if connection cache has been not initialized yet. */

been not initialized -> not been initialized

+
(errcode(ERRCODE_CONNECTION_FAILURE),
+errmsg("could not connect
to server \"%s\"",

Currently each server which is not connected would log a warning.
Is it better to concatenate names for such servers and log one line ? This
would be cleaner when there are multiple such servers.

Cheers

Re: allowing for control over SET ROLE

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 2:28 AM Jeff Davis  wrote:
> The risks of SECURITY INVOKER are more serious. It inherently means
> that one user is writing code, and another is executing it. And in the
> SQL world of triggers, views, expression indexes and logical
> replication, the invoker often doesn't know what they are invoking.
> There are search path risks, risks associated with resolving the right
> function/operator/cast, risks of concurrent DDL (i.e. changing a
> function definition right before a superuser executes it), etc. It
> severely limits the kinds of trust models you can use in logical
> replication. And SECURITY INVOKER weirdly inverts the trust
> relationship of a GRANT: if A grants to B, then B must *completely*
> trust A in order to exercise that new privilege because A can inject
> arbitrary SECURITY INVOKER code in front of the object.

Yes. I think it's extremely difficult to operate a PostgreSQL database
with mutually untrusting users. If the high-privilege users don't
trust the regular users, they must also make very little use of the
database and only in carefully circumscribed ways. If not, the whole
security model unravels really fast. It would certainly be nice to do
better here.

> UNIX basically operates on a SECURITY INVOKER model, so I guess that
> means that it can work. But then again, grepping a file doesn't execute
> arbitrary code from inside that file (though there are bugs
> sometimes... see [1]). It just seems like the wrong model for SQL.

I often think about the UNIX model to better understand the problems
we have in PostgreSQL. I don't think that there's any real theoretical
difference between the cases, but there are practical differences
nearly all of which are unfavorable to PostgreSQL. For example, when
you log into your UNIX account, you have a home directory which is
pre-created. Your path is likely configured to contain only root-owned
directories and perhaps directories within your home directory that
are controlled by you, and the permissions on the root-owned
directories are locked down tight. That's because people figured out
in the 1970s and 1980s that if other people could write executable
code into a path you were likely to search, your account was probably
going to get compromised.

Now, in PostgreSQL, the equivalent of a home directory is a user
schema. We set things up to search those by default if they exist, but
we don't create them by default. We also put the public schema in the
default search path and, up until very recently, it was writeable by
default. In practice, many users probably put back write permission on
that schema, partly because if they don't, unprivileged users can't
create database objects anywhere at all. The practical effect of this
is that, when you log into a UNIX system, you're strongly encouraged
to access only things that are owned by you or root, and any new stuff
you create will be in a location where nobody but you is likely to
touch it. On the other hand, when you log into a PostgreSQL system,
you're set up by default to access objects created by other
unprivileged users and you may have nowhere to put your own objects
where those users won't also be accessing your stuff.

So the risks, which in theory are all very similar, are in practice
far greater in the PostgreSQL context, basically because our default
setup is about 40 years behind the times in terms of implementing best
practices. At least we've locked down write permission on pg_catalog.
I think some early UNIX systems didn't even do that, or not well. But
that's about the end of the good things that I have to say about what
we're doing in this area.

To be fair, I think many security people also consider it wise to
assume that a local unprivileged UNIX user can probably find a way to
escalate to root. There are a lot of setuid binaries on a
normally-configured UNIX system, and you only need to find one of them
that has an exploitable vulnerability. Those are the equivalent of
SECURITY DEFINER privileges, and I don't think we ship any of those in
a default configuration. In that regard, we're perhaps better-secured
than UNIX. Unfortunately, I think it is probably still wise to assume
that an unprivileged PostgreSQL user can find some way of getting
superuser if they want -- not only because of Trojan horse attacks
based on leaving security-invoker functions or procedures or operators
lying around, but also because I strongly suspect there are more
escalate-to-superuser bugs in the code than we've found yet. Those
we've not found, or have found but have not fixed, may still be known
to bad actors.

> [ Aside: that probably explains why the SQL spec defaults to SECURITY
> DEFINER. ]

I doubt that SECURITY DEFINER is safer in general than SECURITY
INVOKER. That'd be the equivalent of having binaries installed setuid
by default, which would be insane. I think it is right to regard
SECURITY DEFINER as the bigger threat by far. The reason it doesn't
always seem that wa

Re: split TOAST support out of postgres.h

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 9:46 AM Tom Lane  wrote:
> Now, there is a fair question whether splitting this code out of
> postgres.h is worth any trouble at all.  TBH my initial reaction
> had been "no".  But once we found that only 40-ish backend files
> need to read this new header, I became a "yes" vote because it
> seems clear that there will be a total-compilation-time benefit.

I wasn't totally about this, either, but I think on balance it's
probably a smart thing to do. I always found it kind of weird that we
put that stuff in postgres.h. It seems to privilege the TOAST
mechanism to an undue degree; what's the argument, for example, that
TOAST macros are more generally relevant than CHECK_FOR_INTERRUPTS()
or LWLockAcquire or HeapTuple? It felt to me like we'd just decided
that one subsystem gets to go into the main header file and everybody
else just had to have their own headers.

One thing that's particularly awkward about that is that if you want
to write some front-end code that knows about how varlenas are stored
on disk, it was very awkward with the old structure. You're not
supposed to include "postgres.h" into frontend code, but if the stuff
you need is defined there then what else can you do? So I generally
think that anything that's in postgres.h should have a strong claim to
be not only widely-needed in the backend, but also never needed at all
in any other executable.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL

2023-01-10 Thread Ted Yu

Hi,
I was reading src/backend/replication/logical/applyparallelworker.c .
In `pa_allocate_worker`, when pa_launch_parallel_worker returns NULL, I
think the `ParallelApplyTxnHash` should be released.

Please see the patch.

Thanks

Re: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL

2023-01-10 Thread Ted Yu

On Tue, Jan 10, 2023 at 9:25 AM Ted Yu  wrote:

> Hi,
> I was reading src/backend/replication/logical/applyparallelworker.c .
> In `pa_allocate_worker`, when pa_launch_parallel_worker returns NULL, I
> think the `ParallelApplyTxnHash` should be released.
>
> Please see the patch.
>
> Thanks
>
Here is the patch :-)


destroy-parallel-apply-txn-hash-when-worker-not-launched.patch
Description: Binary data

Re: Transparent column encryption

2023-01-10 Thread Mark Dilger

> On Dec 31, 2022, at 6:17 AM, Peter Eisentraut 
>  wrote:
> 
> Another update, with some merge conflicts resolved. 

Hi Peter, thanks for the patch!

I wonder if logical replication could be made to work more easily with this 
feature.  Specifically, subscribers of encrypted columns will need the 
encrypted column encryption key (CEK) and the name of the column master key 
(CMD) as exists on the publisher, but getting access to that is not automated 
as far as I can see. It doesn't come through automatically as part of a 
subscription, and publisher's can't publish the pg_catalog tables where the 
keys are kept (because publishing system tables is not supported.)  Is it 
reasonable to make available the CEK and CMK to subscribers in an automated 
fashion, to facilitate setting up logical replication with less manual 
distribution of key information?  Is this already done, and I'm just not 
recognizing that you've done it?

Can we do anything about the attack vector wherein a malicious DBA simply 
copies the encrypted datum from one row to another?  Imagine the DBA Alice 
wants to murder a hospital patient Bob by removing the fact that Bob is deathly 
allergic to latex.  She cannot modify the Bob's encrypted and authenticated 
record, but she can easily update Bob's record with the encrypted record of a 
different patient Charlie.  Likewise, if she want's Bob to pay Charlie's bill, 
she can replace Charlie's encrypted credit card number with Bob's, and once Bob 
is dead, he won't dispute the charges.

An encrypted-and-authenticated column value should be connected with its row in 
some way that Alice cannot circumvent.  In the patch as you have it written, 
the client application can include row information in the patient record 
(specifically, the patient's name, ssn, etc) and verify when any patient record 
is retrieved that this information matches.  But that's hardly "transparent" to 
the client.  It's something all clients will have to do, and easy to forget to 
do in some code path.  Also, for encrypted fixed-width columns, it is not an 
option.  So it seems the client needs to "salt" (maybe not the right term for 
what I have in mind) the encryption with some relevant other columns, and 
that's something the libpq client would need to understand, and something the 
patch's syntax needs to support.  Something like:

CREATE TABLE patient_records (
-- Cryptographically connected to the encrypted record
patient_id  BIGINT NOT NULL,
patient_ssn CHAR(11),

-- The encrypted record
patient_record TEXT ENCRYPTED WITH (column_encryption_key = cek1,
column_encryption_salt = (patient_id, 
patient_ssn)),

-- Extra stuff, not cryptographically connected to anything
next_of_kin TEXT,
phone_number BIGINT,
...
);

I have not selected any algorithms that include such "salt"ing (again, maybe 
the wrong word) because I'm just trying to discuss the general feature, not get 
into the weeds about which cryptographic algorithm to select.

Thoughts?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: can while loop in ClockSweepTick function be kind of infinite loop in some cases?

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-11 01:25:06 +0900, 斯波隼斗 wrote:
> This question is about ClockSweepTick function and the code is below.
> https://github.com/postgres/postgres/blob/24d2b2680a8d0e01b30ce8a41c4eb3b47aca5031/src/backend/storage/buffer/freelist.c#L146-L165
> 
>  The value of expected, NBuffers, wrapped variable is fixed in the while
> loop, so that when the value of expected variable is not equal to
> StrategyControl->nextVictimBuffer, CAS operation fails and the while loop
> will be run kind-of infinitely.
> It is possible for this problem to occur when ClockSweepTick function is
> concurrently called and nextVictimBuffer is incremented by other process
> before CAS operation in the loop (ex: in this case, the value of expected
> variable is NBuffers+1 while the value of nextVictimBuffer variable is
> NBuffers+2. so CAS operation fails)
> I think. `expected = originalVictim + 1;` line should be in while loop
> (before acquiring spin lock) so that, even in the case above, expected
> variable is incremented for each loop and CAS operation will be successful
> at some point.
> Is my understanding correct? If so, I will send PR for fixing this issue.

Yes, I think your understanding might be correct. Interesting that this
apparently has never occurred.

Yes, please send a patch.

Thanks,

Andres

Re: Add 64-bit XIDs into PostgreSQL 15

2023-01-10 Thread Aleksander Alekseev

Hi Maxim,

> Anyway. Let's discuss on-disk page format, shall we?

Here are my two cents.

> AFAICS, we have a following options:
> [...]
> 2. Put special in every page where base for XIDs are stored. This is what we 
> have done in the current patch set.

The approach of using special space IMO is fine. I'm still a bit
sceptical about the need to introduce a new entity "64-bit base XID"
while we already have 32-bit XID epochs that will do the job. I
suspect that having fewer entities helps to reason about the code and
that it is important in the long run, but maybe it's just me. In any
case, I don't have a strong opinion here.

Additionally, I think we should be clear about the long term goals. As
Peter G. pointed out above:

> I think that we'll be able to get rid of freezing in a few years time.

IMO eventually getting rid of freezing and "magic" XIDs will simplify
the maintenance of the project and also make the behaviour of the
system much more predictable. The user will have to worry only about
the disk space reclamation.

If we have a consensus that this is the final goal then we should
definitely be moving toward 64-bit XIDs and perhaps even include a
corresponding PoC to the patchset. If we want to keep freezing
indefinitely then, as Chris Travers argued, 64-bit XIDs don't bring
that much value and maybe the community should be focusing on
something else.

-- 
Best regards,
Aleksander Alekseev

Re: wake up logical workers after ALTER SUBSCRIPTION

2023-01-10 Thread Nathan Bossart

On Tue, Jan 10, 2023 at 10:59:14AM +0530, Amit Kapila wrote:
> I haven't looked in detail but isn't it better to explain somewhere in
> the comments that it achieves to rate limit the restart of workers in
> case of error and allows them to restart immediately in case of
> subscription parameter change?

I expanded one of the existing comments to make this clear.

> Another minor point: Don't we need to set the launcher's latch after
> removing the entry from the hash table to avoid the launcher waiting
> on the latch for a bit longer?

The launcher's latch should be set when the apply worker exits.  The apply
worker's notify_pid is set to the launcher, which means the launcher
will be sent SIGUSR1 on exit.  The launcher's SIGUSR1 handler sets its
latch.

Of course, if the launcher restarts, then the notify_pid will no longer be
accurate.  However, I see that workers also register a before_shmem_exit
callback that will send SIGUSR1 to the launcher_pid currently stored in
shared memory.  (I wonder if there is a memory ordering bug here.)

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..3cf8dfa11e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2006,6 +2006,16 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   Waiting to read or update information
about heavyweight locks.
  
+ 
+  LogicalRepLauncherDSA
+  Waiting for logical replication launcher dynamic shared memory
+  allocator access
+ 
+ 
+  LogicalRepLauncherHash
+  Waiting for logical replication launcher shared memory hash table
+  access
+ 
  
   LogicalRepWorker
   Waiting to read or update the state of logical replication
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index baff00dd74..468e5e0bf1 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -1504,6 +1504,14 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 	}
 	list_free(subworkers);
 
+	/*
+	 * Clear the last-start time for the apply worker to free up space.  If
+	 * this transaction rolls back, the launcher might restart the apply worker
+	 * before wal_retrieve_retry_interval milliseconds have elapsed, but that's
+	 * probably okay.
+	 */
+	logicalrep_launcher_delete_last_start_time(subid);
+
 	/*
 	 * Cleanup of tablesync replication origins.
 	 *
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index afb7acddaa..ecf239e4c2 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -25,6 +25,7 @@
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
 #include "funcapi.h"
+#include "lib/dshash.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -57,6 +58,25 @@ int			max_logical_replication_workers = 4;
 int			max_sync_workers_per_subscription = 2;
 int			max_parallel_apply_workers_per_subscription = 2;
 
+/* an entry in the last-start times hash table */
+typedef struct LauncherLastStartTimesEntry
+{
+	Oid		subid;
+	TimestampTz last_start_time;
+} LauncherLastStartTimesEntry;
+
+/* parameters for the last-start times hash table */
+static const dshash_parameters dsh_params = {
+	sizeof(Oid),
+	sizeof(LauncherLastStartTimesEntry),
+	dshash_memcmp,
+	dshash_memhash,
+	LWTRANCHE_LAUNCHER_HASH
+};
+
+static dsa_area *last_start_times_dsa = NULL;
+static dshash_table *last_start_times = NULL;
+
 LogicalRepWorker *MyLogicalRepWorker = NULL;
 
 typedef struct LogicalRepCtxStruct
@@ -64,6 +84,10 @@ typedef struct LogicalRepCtxStruct
 	/* Supervisor process. */
 	pid_t		launcher_pid;
 
+	/* hash table for last-start times */
+	dsa_handle	last_start_dsa;
+	dshash_table_handle last_start_dsh;
+
 	/* Background workers. */
 	LogicalRepWorker workers[FLEXIBLE_ARRAY_MEMBER];
 } LogicalRepCtxStruct;
@@ -76,6 +100,9 @@ static void logicalrep_worker_onexit(int code, Datum arg);
 static void logicalrep_worker_detach(void);
 static void logicalrep_worker_cleanup(LogicalRepWorker *worker);
 static int	logicalrep_pa_worker_count(Oid subid);
+static void logicalrep_launcher_attach_dshmem(void);
+static void logicalrep_launcher_set_last_start_time(Oid subid, TimestampTz start_time);
+static TimestampTz logicalrep_launcher_get_last_start_time(Oid subid);
 
 static bool on_commit_launcher_wakeup = false;
 
@@ -902,6 +929,9 @@ ApplyLauncherShmemInit(void)
 			memset(worker, 0, sizeof(LogicalRepWorker));
 			SpinLockInit(&worker->relmutex);
 		}
+
+		LogicalRepCtx->last_start_dsa = DSM_HANDLE_INVALID;
+		LogicalRepCtx->last_start_dsh = DSM_HANDLE_INVALID;
 	}
 }
 
@@ -947,8 +977,6 @@ ApplyLauncherWakeup(void)
 void
 ApplyLauncherMain(Datum main_arg)
 {
-	TimestampTz last_start_time = 0;
-
 	ereport(DEBUG1,
 			(e

Re: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL

2023-01-10 Thread Ted Yu

On Tue, Jan 10, 2023 at 9:26 AM Ted Yu  wrote:

>
>
> On Tue, Jan 10, 2023 at 9:25 AM Ted Yu  wrote:
>
>> Hi,
>> I was reading src/backend/replication/logical/applyparallelworker.c .
>> In `pa_allocate_worker`, when pa_launch_parallel_worker returns NULL, I
>> think the `ParallelApplyTxnHash` should be released.
>>
>> Please see the patch.
>>
>> Thanks
>>
> Here is the patch :-)
>

In `pa_process_spooled_messages_if_required`, the `pa_unlock_stream` call
immediately follows `pa_lock_stream`.
I assume the following is the intended sequence of calls. If this is the
case, I can add it to the patch.

Cheers

diff --git a/src/backend/replication/logical/applyparallelworker.c
b/src/backend/replication/logical/applyparallelworker.c
index 2e5914d5d9..9879b3fff2 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -684,9 +684,9 @@ pa_process_spooled_messages_if_required(void)
 if (fileset_state == FS_SERIALIZE_IN_PROGRESS)
 {
 pa_lock_stream(MyParallelShared->xid, AccessShareLock);
-pa_unlock_stream(MyParallelShared->xid, AccessShareLock);

 fileset_state = pa_get_fileset_state();
+pa_unlock_stream(MyParallelShared->xid, AccessShareLock);
 }

 /*

Re: fixing CREATEROLE

2023-01-10 Thread Robert Haas

On Thu, Jan 5, 2023 at 2:53 PM Robert Haas  wrote:
> On Tue, Jan 3, 2023 at 3:11 PM Robert Haas  wrote:
> > Committed and back-patched 0001 with fixes for the issues that you pointed 
> > out.
> >
> > Here's a trivial rebase of the rest of the patch set.
>
> I committed 0001 and 0002 after improving the commit messages a bit.
> Here's the remaining two patches back. I've done a bit more polishing
> of these as well, specifically in terms of fleshing out the regression
> tests. I'd like to move forward with these soon, if nobody's too
> vehemently opposed to that.

Done now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)

2023-01-10 Thread Nathan Bossart

On Tue, Jan 10, 2023 at 02:57:43AM -0500, Tom Lane wrote:
> Michael Paquier  writes:
>> Any objections about getting 947789f applied to REL_13_STABLE and
>> REL_12_STABLE and see this issue completely gone from all the versions
>> supported?
> 
> No objections to back-patching the fix...

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Handle infinite recursion in logical replication setup

2023-01-10 Thread Jonathan S. Katz


On 1/10/23 10:17 AM, Amit Kapila wrote:

On Tue, Jan 10, 2023 at 8:13 AM Jonathan S. Katz  wrote:



This consistently created the deadlock in my testing.

Discussing with Masahiko off-list, this is due to a deadlock from 4
processes: the walsenders on A and B, and the apply workers on A and B.
The walsenders are waiting for feedback from the apply workers, and the
apply workers are waiting for the walsenders to synchronize (I may be
oversimplifying).

He suggested I try the above example instead with `synchronous_commit`
set to `local`. In this case, I verified that there is no more deadlock,
but he informed me that we would not be able to use cascading
synchronous replication when "origin=none".



This has nothing to do with the origin feature. I mean this should
happen with origin=any or even in PG15 without using origin at all.
Am, I missing something? One related point to note is that in physical
replication cascading replication is asynchronous. See docs [1]
(Cascading replication is currently asynchronous)


This is not directly related to the origin feature, but the origin 
feature makes it apparent. There is a common use-case where people want 
to replicate data synchronously between two tables, and this is enabled 
by the "origin" feature.


To be clear, my bigger concern is that it's fairly easy for users to 
create a deadlock situation based on how they set "synchronous_commit" 
with the origin feature -- this is the main reason why I brought it up. 
I'm less concerned about the "cascading" case, though I want to try out 
sync rep between 3 instances to see what happens.



If we decide that this is a documentation issue, I'd suggest we improve
the guidance around using `synchronous_commit`[1] on the CREATE
SUBSCRIPTION page, as the GUC page[2] warns against using `local`:



Yeah, but on Create Subscription page, we have mentioned that it is
safe to use off for logical replication.


While I think you can infer that it's "safe" for synchronous 
replication, I don't think it's clear.


We say it's "safe to use `off` for logical replication", but provide a 
lengthy explanation around synchronous logical replication that 
concludes that it's "advantageous to set synchronous_commit to local or 
higher" but does not address safety. The first line in the explanation 
of the parameter links to the `synchronous_commit` GUC which 
specifically advises against using "local" for synchronous replication.


Additionally, because we say "local" or higher, we increase the risk of 
the aforementioned in HEAD with the origin feature.


I know I'm hammering on this point a bit, but it feels like this is 
relatively easy to misconfigure with the upcoming "origin" change (I did 
so myself from reading the devel docs) and we should ensure we guide our 
users appropriately.



One can use local or higher
for reducing the latency for COMMIT when synchronous replication is
used in the publisher. Won't using 'local' while creating subscription
would suffice the need to consistently replicate the data? I mean it
is equivalent to somebody using levels greater than local in case of
physical replication. I think in the case of physical replication, we
won't wait for standby to replicate to another node before sending a
response, so why to wait in the case of logical replication? If this
understanding is correct, then probably it is sufficient to support
'local' for a subscription.


I think this is a good explanation. We should incorporate some version 
of this into the docs, and add some clarity around the use of 
`synchronous_commit` option in `CREATE SUBSCRIPTION` in particular with 
the origin feature. I can make an attempt at this.


Perhaps another question: based on this, should we disallow setting 
values of `synchronous_commit` greater than `local` when using 
"origin=none"? That might be too strict, but maybe we should warn around 
the risk of deadlock either in the logs or in the docs.


Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature

Re: can while loop in ClockSweepTick function be kind of infinite loop in some cases?

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 12:40 PM Andres Freund  wrote:
> > I think. `expected = originalVictim + 1;` line should be in while loop
> > (before acquiring spin lock) so that, even in the case above, expected
> > variable is incremented for each loop and CAS operation will be successful
> > at some point.
> > Is my understanding correct? If so, I will send PR for fixing this issue.
>
> Yes, I think your understanding might be correct. Interesting that this
> apparently has never occurred.

Doesn't pg_atomic_compare_exchange_u32 set expected if it fails?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL

2023-01-10 Thread Ted Yu

On Tue, Jan 10, 2023 at 9:43 AM Ted Yu  wrote:

>
>
> On Tue, Jan 10, 2023 at 9:26 AM Ted Yu  wrote:
>
>>
>>
>> On Tue, Jan 10, 2023 at 9:25 AM Ted Yu  wrote:
>>
>>> Hi,
>>> I was reading src/backend/replication/logical/applyparallelworker.c .
>>> In `pa_allocate_worker`, when pa_launch_parallel_worker returns NULL, I
>>> think the `ParallelApplyTxnHash` should be released.
>>>
>>> Please see the patch.
>>>
>>> Thanks
>>>
>> Here is the patch :-)
>>
>
> In `pa_process_spooled_messages_if_required`, the `pa_unlock_stream` call
> immediately follows `pa_lock_stream`.
> I assume the following is the intended sequence of calls. If this is the
> case, I can add it to the patch.
>
> Cheers
>
> diff --git a/src/backend/replication/logical/applyparallelworker.c
> b/src/backend/replication/logical/applyparallelworker.c
> index 2e5914d5d9..9879b3fff2 100644
> --- a/src/backend/replication/logical/applyparallelworker.c
> +++ b/src/backend/replication/logical/applyparallelworker.c
> @@ -684,9 +684,9 @@ pa_process_spooled_messages_if_required(void)
>  if (fileset_state == FS_SERIALIZE_IN_PROGRESS)
>  {
>  pa_lock_stream(MyParallelShared->xid, AccessShareLock);
> -pa_unlock_stream(MyParallelShared->xid, AccessShareLock);
>
>  fileset_state = pa_get_fileset_state();
> +pa_unlock_stream(MyParallelShared->xid, AccessShareLock);
>  }
>
>  /*
>
Looking closer at the comment above this code and other part of the file,
it seems the order is intentional.

Please disregard my email about `pa_process_spooled_messages_if_required`.

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Robert Haas

On Mon, Jan 9, 2023 at 5:59 PM Peter Geoghegan  wrote:
> On Mon, Jan 9, 2023 at 12:51 PM Robert Haas  wrote:
> > I feel that you should at least have a reproducer for these problems
> > posted to the thread, and ideally a regression test, before committing
> > things. I think it's very hard to understand what the problems are
> > right now.
>
> Hard to understand relative to what, exactly? We're talking about a
> very subtle race condition here.
>
> I'll try to come up with a reproducer, but I *utterly* reject your
> assertion that it's a hard requirement, sight unseen. Why should those
> be the parameters of the discussion?
>
> For one thing I'm quite confident that I'm right, with or without a
> reproducer. And my argument isn't all that hard to follow, if you have
> relevant expertise, and actually take the time.

Look, I don't want to spend time arguing about what seem to me to be
basic principles of good software engineering. When I don't put test
cases into my patches, people complain at me and tell me that I'm a
bad software engineer because I didn't include test cases. Your
argument here seems to be that you're such a good software engineer
that you don't need any test cases to know what the bug is or that
you've fixed it correctly. That seems like a surprising argument, but
even if it's true, test cases can have considerable value to future
code authors, because it allows them to avoid reintroducing bugs that
have previously been fixed. In my opinion, it's not worth trying to
have automated test cases for absolutely every bug we fix, because
many of them would be really hard to develop and executing all of them
every time we do anything would be unduly time-consuming. But I can't
remember the last time before this that someone wanted to commit a
patch for a data corruption issue without even providing a test case
that other people can run manually. If you think that is or ought to
be standard practice, I can only say that I disagree.

I don't particularly appreciate the implication that I either lack
relevant or expertise or don't actually take time, either. I spent an
hour yesterday looking at your patches yesterday and didn't feel I was
very close to understanding 0002 in that time. I feel that if the
patches were better-written, with relevant comments and test cases and
really good commit messages and a lack of extraneous changes, I
believe I probably would have gotten a lot further in the same amount
of time. There is certainly an alternate explanation, which is that I
am stupid. I'm inclined to think that's not the correct explanation,
but most stupid people believe that they aren't, so that doesn't
really prove anything.

> But even this is
> unlikely to matter much. Even if I somehow turn out to have been
> completely wrong about the race condition, it is still self-evident
> that the problem of uselessly WAL logging non-changes to the VM
> exists. That doesn't require any concurrent access at all. It's a
> natural consequence of calling visibilitymap_set() with
> VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code
> for 2 minutes to see it.

Apparently not, because I spent more time than that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: can while loop in ClockSweepTick function be kind of infinite loop in some cases?

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-10 13:11:35 -0500, Robert Haas wrote:
> On Tue, Jan 10, 2023 at 12:40 PM Andres Freund  wrote:
> > > I think. `expected = originalVictim + 1;` line should be in while loop
> > > (before acquiring spin lock) so that, even in the case above, expected
> > > variable is incremented for each loop and CAS operation will be successful
> > > at some point.
> > > Is my understanding correct? If so, I will send PR for fixing this issue.
> >
> > Yes, I think your understanding might be correct. Interesting that this
> > apparently has never occurred.
>
> Doesn't pg_atomic_compare_exchange_u32 set expected if it fails?

Indeed, so there's no problem.

I wonder if we should make ->nextVictimBuffer a 64bit atomic. At the time the
changes went in we didn't (or rather, couldn't) rely on it, but these days we
could.  I think with a 64bit number we could get rid of ->completePasses and
just infer it from ->nextVictimBuffer/NBuffers, removing th need for the
spinlock.  It's very unlikely that 64bit would ever wrap, and even if, it'd
just be a small inaccuracy in the assumed number of passes. OTOH, it's
doubtful the overflow handling / the spinlock matters performance wise.

Greetings,

Andres Freund

Re: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-10 02:57:43 -0500, Tom Lane wrote:
> No objections to back-patching the fix, but I wonder if we can find
> some mechanical way to prevent this sort of error in future.

What about a define that forces external toasting very aggressively for
catalog tables, iff they have a toast table? I suspect doing so for
non-catalog tables as well would trigger test changes. Running a buildfarm
animal with that would at least make issues like this much easier to discover.

Greetings,

Andres Freund

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote:
> On Mon, 9 Jan 2023 at 20:34, Andres Freund  wrote:
> > On 2023-01-09 17:50:10 +0100, Matthias van de Meent wrote:
> > > Wouldn't it be enough to only fix the constructions in
> > > FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
> > > not occur with the patch), and (optionally) bump the first XID
> > > available for any cluster to (FirstNormalXid + 1) to retain the 'older
> > > than any running transaction' property?
> >
> > It's not too hard to fix in individual places, but I suspect that we'll
> > introduce the bug in future places without some more fundamental protection.
> >
> > Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable 
> > value
> > in ComputeXidHorizons() and GetSnapshotData().
> 
> I don't think that clamping the value with oldestXid (as seen in patch
> 0001, in GetSnapshotData) is right.

I agree that using oldestXid to clamp is problematic.


> It would clamp the value relative to the oldest frozen xid of all
> databases, which can be millions of transactions behind oldestXmin,
> and thus severely skew the amount of transaction's changes you keep on
> disk (that is, until oldestXid moves past 1000_000).

What precisely do you mean with "skew" here? Do you just mean that it'd take a
long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like
you might mean more than that?


I'm tempted to go with reinterpreting 64bit xids as signed. Except that it
seems like a mighty invasive change to backpatch.


Greetings,

Andres Freund

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-10 Thread Ankit Kumar Pandey





On 10/01/23 10:53, David Rowley wrote:



the total cost is the same for both of these, but the execution time
seems to vary quite a bit.


This is really weird, I tried it different ways (to rule out any issues 
due to


caching) and execution time varied in spite of having same cost.


Maybe we should try and do this for DISTINCT queries if the
distinct_pathkeys match the orderby_pathkeys. That seems a little less
copout-ish. If the ORDER BY is the same as the DISTINCT then it seems
likely that the ORDER BY might opt to use the Unique path for DISTINCT
since it'll already have the correct pathkeys.



However, if the ORDER BY has fewer columns then it might be cheaper to Hash 
Aggregate and
then sort all over again, especially so when the DISTINCT removes a
large proportion of the rows.


Isn't order by pathkeys are always fewer than distinct pathkeys?

distinct pathkeys = order by pathkeys + window functions pathkeys

Again, I got your point which that it is okay to pushdown order by clause

if distinct is doing unique sort. But problem is (atleast from what I am 
facing),


distinct is not caring about pushed down sortkeys, it goes with hashagg 
or unique


with some other logic (mentioned below).


Consider following (with distinct clause restriction removed)

if (parse->distinctClause)
{
List* distinct_pathkeys = make_pathkeys_for_sortclauses(root, 
parse->distinctClause, root->processed_tlist);
if (!compare_pathkeys(distinct_pathkeys, orderby_pathkeys)==1) // distinct 
key > order by key
skip = true; // this is used to skip order by pushdown


}

CASE #1:

explain (costs off) select distinct a,b, min(a) over (partition by a), sum (a) 
over (partition by a) from abcd order by a,b;
QUERY PLAN
---
 Sort
   Sort Key: a, b
   ->  HashAggregate
 Group Key: a, b, min(a) OVER (?), sum(a) OVER (?)
 ->  WindowAgg
   ->  Sort
 Sort Key: a, b
 ->  Seq Scan on abcd
(8 rows)

explain (costs off) select distinct a,b,c, min(a) over (partition by a), sum 
(a) over (partition by a) from abcd order by a,b,c;
  QUERY PLAN
--
 Sort
   Sort Key: a, b, c
   ->  HashAggregate
 Group Key: a, b, c, min(a) OVER (?), sum(a) OVER (?)
 ->  WindowAgg
   ->  Sort
 Sort Key: a, b, c
 ->  Seq Scan on abcd
(8 rows)

No matter how many columns are pushed down, it does hashagg.

On the other hand:

CASE #2:

EXPLAIN (costs off) SELECT DISTINCT depname, empno, min(salary) OVER (PARTITION 
BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname) depsalary
FROM empsalary
ORDER BY depname, empno;
QUERY PLAN
--
 Unique
   ->  Sort
 Sort Key: depname, empno, (min(salary) OVER (?)), (sum(salary) OVER 
(?))
 ->  WindowAgg
   ->  Sort
 Sort Key: depname, empno
 ->  Seq Scan on empsalary
(7 rows)

EXPLAIN (costs off) SELECT DISTINCT depname, empno, enroll_date, min(salary) 
OVER (PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY 
depname) depsalary
FROM empsalary
ORDER BY depname, empno, enroll_date;
  QUERY PLAN
---
 Unique
   ->  Sort
 Sort Key: depname, empno, enroll_date, (min(salary) OVER (?)), 
(sum(salary) OVER (?))
 ->  WindowAgg
   ->  Sort
 Sort Key: depname, empno, enroll_date
 ->  Seq Scan on empsalary
(7 rows)

It keeps doing Unique.

In both of the cases, compare_pathkeys(distinct_pathkeys, 
orderby_pathkeys) returns 1


Looking bit further, planner is choosing things correctly.

I could see cost of unique being higher in 1st case and lower in 2nd case.

But the point is, if sort for orderby is pushdown, shouldn't there be 
some discount on


cost of Unique sort (so that there is more possibility of it being 
favorable compared to HashAgg in certain cases)?


Again, cost of Unqiue node is taken as cost of sort node as it is, but

for HashAgg, new cost is being computed. If we do incremental sort here 
(for unique node),


as we have pushed down orderby's, unique cost could be reduced and our 
optimization could


be made worthwhile (I assume this is what you intended here) in case of 
distinct.


Eg:

EXPLAIN SELECT DISTINCT depname, empno, enroll_date, min(salary) OVER 
(PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname) 
depsalary
FROM empsalary
ORDER BY depname, empno, enroll_date;
  QUERY PLAN
--

Re: psql: Add role's membership options to the \du+ command

2023-01-10 Thread Pavel Luzanov


Added the patch to the open commitfest:
https://commitfest.postgresql.org/42/4116/

Feel free to reject if it's not interesting.

--
Pavel Luzanov

Re: Use windows VMs instead of windows containers on the CI

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-10 09:22:12 -0600, Justin Pryzby wrote:
> > There is more than 2x speed gain when VMs are used.
> 
> One consideration is that if windows runs twice as fast, we'll suddenly
> start using twice as many resources at cirrus/google/amazon - the
> windows task has been throttling everything else.  Not sure if we should
> to do anything beyond the limits that cfbot already uses.

I'm not sure we would. cfbot has a time based limit for how often it tries to
rebuild entries, and I think we were just about keeping up with that. In which
case we shouldn't, on average, schedule more jobs than we currently
do. Although peak "job throughput" would be higher.

Thomas?

Greetings,

Andres Freund

Re: Show various offset arrays for heap WAL records

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-09 19:59:42 -0800, Peter Geoghegan wrote:
> On Mon, Jan 9, 2023 at 1:58 PM Andres Freund  wrote:
> > A couple times when investigating data corruption issues, the last time just
> > yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM
> > records. As that's probably not just me, I think we should make that change
> > in-tree.
> 
> I remember how useful this was when we were investigating that early
> bug in 14, that turned out to be in parallel VACUUM. So I'm all in
> favor of it.

Cool.


> > The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
> > XLOG_HEAP2_FREEZE_PAGE.
> 
> I'm bound to end up doing the same in index access methods. Might make
> sense for the utility routines to live somewhere more centralized, at
> least when code reuse is likely. Practically every index AM has WAL
> records that include a sorted page offset number array, just like
> these ones. It's a very standard thing, obviously.

Hm, there doesn't seem to be a great location for them today. I guess we could
add something like src/include/access/rmgrdesc_utils.h? And put the
implementation in src/backend/access/rmgrdesc/rmgrdesc_utils.c? I first was
thinking of just rmgrdesc.[ch], but custom rmgrs added
src/bin/pg_waldump/rmgrdesc.[ch] ...


> > I chose to include infomask[2] for the different freeze plans mainly because
> > it looks odd to see different plans without a visible reason. But I'm not 
> > sure
> > that's the right choice.
> 
> I don't think that it is particularly necessary to do so in order for
> the output to make sense -- pg_waldump is inherently a tool for
> experts. What it comes down to for me is whether or not this
> information is sufficiently useful to display, and/or can be (or needs
> to be) controlled via some kind of verbosity knob.

It seemed useful enough to me, but I likely also stare more at this stuff than
most. Compared to the list of offsets it's not that much content.


> How hard would it be to invent a general mechanism to control the verbosity
> of what we'll show for each WAL record?

Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
void(*rm_desc) (StringInfo buf, XLogReaderState *record);

so we'd need to patch all of them. That might be worth doing at some point,
but I don't want to tackle it right now.

Greetings,

Andres Freund

Re: RFC: logical publication via inheritance root?

2023-01-10 Thread Jacob Champion

On Mon, Jan 9, 2023 at 12:41 AM Aleksander Alekseev
 wrote:
> I would like to point out that we shouldn't necessarily support
> multiple inheritance in all the possible cases, at least not in the
> first implementation. Supporting simple cases of inheritance would be
> already a valuable feature even if it will have certain limitations.

I agree. What I'm trying to avoid is the case where replication works
nicely for a table until someone performs an ALTER TABLE ... [NO]
INHERIT, and then Something Bad happens because we can't support the
new edge case. If every inheritance tree is automatically opted into
this new publication behavior, I think it'll be easier to hit that by
accident, making the whole thing feel brittle.

By contrast, if we have to opt specific tables into this feature by
marking them in the catalog, then not only will it be harder to hit by
accident (because we can document the requirements for the marker
function, and then it's up to the callers/extension authors/DBAs to
maintain those requirements), but we even have the chance to bail out
during an inheritance change if we see that the table is marked in
this way.

Two general pieces of progress to report:

1) I'm playing around with a marker in pg_inherits, where the inhseqno
is set to a sentinel value (0) for an inheritance relationship that
has been marked for logical publication. The intent is that the
pg_inherits helpers will prevent further inheritance relationships
when they see that marker, and reusing inhseqno means we can make use
of the existing index to do the lookups. An example:

=# CREATE TABLE root (a int);
=# CREATE TABLE root_p1 () INHERITS (root);
=# SELECT pg_set_logical_root('root_p1', 'root');

and then any data written to root_p1 gets replicated via root instead,
if publish_via_partition_root = true. If root_p1 is set up with extra
columns, they'll be omitted from replication.

2) While this strategy works well for ongoing replication, it's not
enough to get the initial synchronization correct. The subscriber
still does a COPY of the root table directly, missing out on all the
logical descendant data. The publisher will have to tell the
subscriber about the relationship somehow, and older subscriber
versions won't understand how to use that (similar to how old
subscribers can't correctly handle row filters).

--Jacob

Re: Transparent column encryption

2023-01-10 Thread Mark Dilger

> On Jan 10, 2023, at 9:26 AM, Mark Dilger  wrote:
> 
>-- Cryptographically connected to the encrypted record
>patient_id  BIGINT NOT NULL,
>patient_ssn CHAR(11),
> 
>-- The encrypted record
>patient_record TEXT ENCRYPTED WITH (column_encryption_key = cek1,
>column_encryption_salt = (patient_id, 
> patient_ssn)),

As you mention upthread, tying columns together creates problems for statements 
that only operate on a subset of columns.  Allowing schema designers a choice 
about tying the encrypted column to zero or more other columns allows them to 
choose which works best for their security needs.

The example above would make a statement like "UPDATE patient_record SET 
patient_record = $1 \bind '{some json whatever}'" raise an exception at the 
libpq client level, but maybe that's what schema designers wants it to do.  If 
not, they should omit the column_encryption_salt option in the create table 
statement; but if so, they should expect to have to specify the other columns 
as part of the update statement, possibly as part of the where clause, like

UPDATE patient_record
SET patient_record = $1
WHERE patient_id = 12345
  AND patient_ssn = '111-11-' 
\bind '{some json record}'

and have the libpq get the salt column values from the where clause (which may 
be tricky to implement), or perhaps use some new bind syntax like

UPDATE patient_record
SET patient_record = ($1:$2,$3)   -- new, wonky syntax
WHERE patient_id = $2
  AND patient_ssn = $3 
\bind '{some json record}' 12345 '111-11-'

which would be error prone, since the sql statement could specify the 
($1:$2,$3) inconsistently with the where clause, or perhaps specify the "new" 
salt columns even when not changed, like

UPDATE patient_record
SET patient_record = $1, patient_id = 12345, patient_ssn = 
"111-11-"
WHERE patient_id = 12345
  AND patient_ssn = "111-11-"
\bind '{some json record}'

which looks kind of nuts at first glance, but is grammatically consistent with 
cases where one or both of the patient_id or patient_ssn are also being 
changed, like

UPDATE patient_record
SET patient_record = $1, patient_id = 98765, patient_ssn = 
"999-99-"
WHERE patient_id = 12345
  AND patient_ssn = "111-11-"
\bind '{some json record}'

Or, of course, you can ignore these suggestions or punt them to some future 
patch that extends the current work, rather than trying to get it all done in 
the first column encryption commit.  But it seems useful to think about what 
future directions would be, to avoid coding ourselves into a corner, making 
such future work harder.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 10:50 AM Robert Haas  wrote:
> Look, I don't want to spend time arguing about what seem to me to be
> basic principles of good software engineering. When I don't put test
> cases into my patches, people complain at me and tell me that I'm a
> bad software engineer because I didn't include test cases. Your
> argument here seems to be that you're such a good software engineer
> that you don't need any test cases to know what the bug is or that
> you've fixed it correctly.

That's not what I said. This is a straw man.

What I actually said was that there is no reason to declare up front
that the only circumstances under which a fix could be committed is
when a clean repro is available. I never said that a test case has
little or no value, and I certainly didn't assert that we definitely
don't need a test case to proceed with a commit -- since I am not in
the habit of presumptuously attaching conditions to such things well
in advance.

> I don't particularly appreciate the implication that I either lack
> relevant or expertise or don't actually take time, either.

The implication was only that you didn't take the time. Clearly you
have the expertise. Obviously you're very far from stupid.

I have been unable to reproduce the problem, and think it's possible
that the issue cannot be triggered in practice. Though only through
sheer luck. Here's why that is:

While pruning will remove aborted dead tuples, freezing will not
remove the xmax of an aborted update unless the XID happens to be <
OldestXmin. With my problem scenario, the page will be all_visible in
prunestate, but not all_frozen -- so it dodges the relevant
visibilitymap_set() call site.

That just leaves inserts that abort, I think. An aborted insert will
be totally undone by pruning, but that does still leave behind an
LP_DEAD item that needs to be vacuumed in the second heap pass. This
means that we can only set the page all-visible/all-frozen in the VM
in the second heap pass, which also dodges the relevant
visibilitymap_set() call site.

In summary, I think that there is currently no way that we can have
the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving
the page all_frozen. It can happen and leave the page all_visible, but
not all_frozen, due to these very fine details. (Assuming I haven't
missed another path to the problem with aborted Multis or something,
but looks like I haven't.)

-- 
Peter Geoghegan

Re: logical decoding and replication of sequences, take 2

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 1:32 PM Tomas Vondra
 wrote:
> 0001 is a fix for the pre-existing issue in logicalmsg_decode,
> attempting to build a snapshot before getting into a consistent state.
> AFAICS this only affects assert-enabled builds and is otherwise
> harmless, because we are not actually using the snapshot (apply gets a
> valid snapshot from the transaction).
>
> This is mostly the fix I shared in November, except that I kept the call
> in decode.c (per comment from Andres). I haven't added any argument to
> SnapBuildProcessChange because we may need to backpatch this (and it
> didn't seem much simpler, IMHO).

I tend to associate transactional behavior with snapshots, so it looks
odd to see code that builds a snapshot only when the message is
non-transactional. I think that a more detailed comment spelling out
the reasoning would be useful here.

> This however brings me to the original question what's the purpose of
> this patch - and that's essentially keeping sequences up to date to make
> them usable after a failover. We can't generate values from the sequence
> on the subscriber, because it'd just get overwritten. And from this
> point of view, it's also fine that the sequence is slightly ahead,
> because that's what happens after crash recovery anyway. And we're not
> guaranteeing the sequences to be gap-less.

I agree that it's fine for the sequence to be slightly ahead, but I
think that it can't be too far ahead without causing problems. Suppose
for example that transaction #1 creates a sequence. Transaction #2
does nextval on the sequence a bunch of times and inserts rows into a
table using the sequence values as the PK. It's fine if the nextval
operations are replicated ahead of the commit of transaction #2 -- in
fact I'd say it's necessary for correctness -- but they can't precede
the commit of transaction #1, since then the sequence won't exist yet.
Likewise, if there's an ALTER SEQUENCE that creates a new relfilenode,
I think that needs to act as a barrier: non-transactional changes that
happened before that transaction must also be replicated before that
transaction is replicated, and those that happened after that
transaction is replicated must be replayed after that transaction is
replicated. Otherwise, at the very least, there will be states visible
on the standby that were never visible on the origin server, and maybe
we'll just straight up get the wrong answer. For instance:

1. nextval, setting last_value to 3
2. ALTER SEQUENCE, getting a new relfilenode, and also set last_value to 19
3. nextval, setting last_value to 20

If 3 happens before 2, the sequence ends up in the wrong state.

Maybe you've already got this and similar cases totally correctly
handled, I'm not sure, just throwing it out there.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 11:47 AM Peter Geoghegan  wrote:
> In summary, I think that there is currently no way that we can have
> the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving
> the page all_frozen. It can happen and leave the page all_visible, but
> not all_frozen, due to these very fine details. (Assuming I haven't
> missed another path to the problem with aborted Multis or something,
> but looks like I haven't.)

Actually, FreezeMultiXactId() can fully remove an xmax that has some
member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
possible, at least when no individual member is still running. Doesn't
have to involve transaction aborts at all.

Let me go try to break it that way...

-- 
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 2:48 PM Peter Geoghegan  wrote:
> What I actually said was that there is no reason to declare up front
> that the only circumstances under which a fix could be committed is
> when a clean repro is available. I never said that a test case has
> little or no value, and I certainly didn't assert that we definitely
> don't need a test case to proceed with a commit -- since I am not in
> the habit of presumptuously attaching conditions to such things well
> in advance.

I don't understand what distinction you're making. It seems like
hair-splitting to me. We should be able to reproduce problems like
this reliably, at least with the aid of a debugger and some
breakpoints, before we go changing the code. The risk of being wrong
is quite high because the code is subtle, and the consequences of
being wrong are potentially very bad because the code is critical to
data integrity. If the reproducer doesn't require a debugger or other
extreme contortions, then we should consider reducing it to a TAP test
that can be committed. If you agree with that, then I'm not sure what
your last email was complaining about. If you disagree, then I don't
know why.

> I have been unable to reproduce the problem, and think it's possible
> that the issue cannot be triggered in practice. Though only through
> sheer luck. Here's why that is:
>
> While pruning will remove aborted dead tuples, freezing will not
> remove the xmax of an aborted update unless the XID happens to be <
> OldestXmin. With my problem scenario, the page will be all_visible in
> prunestate, but not all_frozen -- so it dodges the relevant
> visibilitymap_set() call site.
>
> That just leaves inserts that abort, I think. An aborted insert will
> be totally undone by pruning, but that does still leave behind an
> LP_DEAD item that needs to be vacuumed in the second heap pass. This
> means that we can only set the page all-visible/all-frozen in the VM
> in the second heap pass, which also dodges the relevant
> visibilitymap_set() call site.

I guess I'm not very sure that this is sheer luck. It seems like we
could equally well suppose that the people who wrote the code
correctly understood the circumstances under which we needed to avoid
calling visibilitymap_set(), and wrote the code in a way that
accomplished that purpose. Maybe there's contrary evidence or maybe it
is actually broken somehow, but that's not currently clear to me.

For the purposes of clarifying my understanding, is this the code
you're principally worried about?

/*
 * If the all-visible page is all-frozen but not marked as such yet,
 * mark it as all-frozen.  Note that all_frozen is only valid if
 * all_visible is true, so we must check both prunestate fields.
 */
else if (all_visible_according_to_vm && prunestate.all_visible &&
 prunestate.all_frozen &&
 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
{
/*
 * We can pass InvalidTransactionId as the cutoff XID here,
 * because setting the all-frozen bit doesn't cause recovery
 * conflicts.
 */
visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
  vmbuffer, InvalidTransactionId,
  VISIBILITYMAP_ALL_FROZEN);
}

Or maybe this one?

if (PageIsAllVisible(page))
{
uint8   flags = 0;
uint8   vm_status = visibilitymap_get_status(vacrel->rel,
 blkno, vmbuffer);

/* Set the VM all-frozen bit to flag, if needed */
if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
flags |= VISIBILITYMAP_ALL_VISIBLE;
if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
flags |= VISIBILITYMAP_ALL_FROZEN;

Assert(BufferIsValid(*vmbuffer));
if (flags != 0)
visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
  *vmbuffer, visibility_cutoff_xid, flags);
}

These are the only two call sites in vacuumlazy.c where I can see
there being a theoretical risk of the kind of problem that you're
describing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: heapgettup refactoring

2023-01-10 Thread Melanie Plageman

On Thu, Jan 5, 2023 at 8:52 AM Peter Eisentraut
 wrote:
>
> Ok, let's look through these patches starting from the top then.
>
> v4-0001-Add-no-movement-scan-helper.patch
>
> This makes sense overall; there is clearly some duplicate code that can
> be unified.
>
> It appears that during your rebasing you have effectively reverted your
> earlier changes that have been committed as
> 8e1db29cdbbd218ab6ba53eea56624553c3bef8c.  You should undo that.

Thanks. I think I have addressed this.
I've attached a rebased v5.

> I don't understand the purpose of the noinline maker.  If it's not
> necessary to inline, we can just leave it off, but there is no need to
> outright prevent inlining AFAICT.
>

I have removed it.

> I don't know why you changed the if/else sequences.  Before, the
> sequence was effectively
>
> if (forward)
> {
>  ...
> }
> else if (backward)
> {
>  ...
> }
> else
> {
>  /* it's no movement */
> }
>
> Now it's changed to
>
> if (no movement)
> {
>  ...
>  return;
> }
>
> if (forward)
> {
>  ...
> }
> else
> {
>  Assert(backward);
>  ...
> }
>
> Sure, that's the same thing, but it looks less elegant to me.

In this commit, you could keep the original ordering of if statements. I
preferred no movement scan first because then backwards and forwards
scans' code is physically closer to the rest of the code without the
intrusion of the no movement scan code.

Ultimately, the refactor (in later patches) flips the ordering of if
statements at the top from
if (scan direction)
to
if (initial or continue)
and this isn't a very interesting distinction for no movement scans. By
dealing with no movement scan at the top, I didn't have to handle no
movement scans in the initial and continue branches in the new structure.

Also, I will note that patches 4-6 at least and perhaps 4-7 do not make
sense as separate commits. I separated them for ease of review. Each is
correct and passes tests but is not really an improvement without the
others.

- Melanie
From 9b51959c483812c30ca848b66a0e6e3a807ab03f Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 9 Jan 2023 17:33:43 -0500
Subject: [PATCH v5 3/7] Add heapgettup_initial_block() helper

---
 src/backend/access/heap/heapam.c | 212 ---
 1 file changed, 82 insertions(+), 130 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c80b547809..b9a1aac3ca 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -521,6 +521,57 @@ heapgettup_no_movement(HeapScanDesc scan)
 }
 
 
+static inline BlockNumber
+heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
+{
+	Assert(!ScanDirectionIsNoMovement(dir));
+	Assert(!scan->rs_inited);
+
+	/* return null immediately if relation is empty */
+	if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
+		return InvalidBlockNumber;
+
+	scan->rs_inited = true;
+
+	/* forward and serial */
+	if (ScanDirectionIsForward(dir) && scan->rs_base.rs_parallel == NULL)
+		return scan->rs_startblock;
+
+	/* forward and parallel */
+	if (ScanDirectionIsForward(dir))
+	{
+		table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+ scan->rs_parallelworkerdata,
+ (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+
+		return table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+ scan->rs_parallelworkerdata,
+ (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+	}
+
+	/* backward parallel scan not supported */
+	Assert(scan->rs_base.rs_parallel == NULL);
+
+	/*
+	 * Disable reporting to syncscan logic in a backwards scan; it's not very
+	 * likely anyone else is doing the same thing at the same time, and much
+	 * more likely that we'll just bollix things for forward scanners.
+	 */
+	scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
+
+	/*
+	 * Start from last page of the scan.  Ensure we take into account
+	 * rs_numblocks if it's been adjusted by heap_setscanlimits().
+	 */
+	if (scan->rs_numblocks != InvalidBlockNumber)
+		return (scan->rs_startblock + scan->rs_numblocks - 1) % scan->rs_nblocks;
+
+	if (scan->rs_startblock > 0)
+		return scan->rs_startblock - 1;
+
+	return scan->rs_nblocks - 1;
+}
+
 /* 
  *		heapgettup - fetch next heap tuple
  *
@@ -574,41 +625,16 @@ heapgettup(HeapScanDesc scan,
 	{
 		if (!scan->rs_inited)
 		{
-			/*
-			 * return null immediately if relation is empty
-			 */
-			if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
+			block = heapgettup_initial_block(scan, dir);
+
+			if (block == InvalidBlockNumber)
 			{
 Assert(!BufferIsValid(scan->rs_cbuf));
 tuple->t_data = NULL;
 return;
 			}
-			if (scan->rs_base.rs_parallel != NULL)
-			{
-ParallelBlockTableScanDesc pbscan =
-(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
-ParallelBlockTableScanWorker pbscanwork =
-scan->rs_parallelworkerdata;
-
-table_block_parallelscan_sta

Re: Add SHELL_EXIT_CODE to psql

2023-01-10 Thread Corey Huinker

On Tue, Jan 10, 2023 at 3:54 AM Maxim Orlov  wrote:

>
>
> On Mon, 9 Jan 2023 at 21:36, Corey Huinker 
> wrote:
>
>>
>> I chose a name that would avoid collisions with anything a user might
>> potentially throw into their environment, so if the var "OS" is fairly
>> standard is a reason to avoid using it. Also, going with our own env var
>> allows us to stay in perfect synchronization with the build's #ifdef WIN32
>> ... and whatever #ifdefs may come in the future for new OSes. If there is
>> already an environment variable that does that for us, I would rather use
>> that, but I haven't found it.
>>
>> Perhaps, I didn't make myself clear. Your solution is perfectly adapted
> to our needs.
> But all Windows since 2000 already have an environment variable
> OS=Windows_NT. So, if env OS is defined and equal Windows_NT, this had to
> be Windows.
> May we use it in our case? I don't insist, just asking.
>

Ah, that makes more sense. I don't have a strong opinion on which we should
use, and I would probably defer to someone who does windows (and possibly
cygwin) builds more often than I do.

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-10 Thread Matthias van de Meent

On Tue, 10 Jan 2023 at 20:14, Andres Freund  wrote:
>
> Hi,
>
> On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote:
> > On Mon, 9 Jan 2023 at 20:34, Andres Freund  wrote:
> > > It's not too hard to fix in individual places, but I suspect that we'll
> > > introduce the bug in future places without some more fundamental 
> > > protection.
> > >
> > > Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable 
> > > value
> > > in ComputeXidHorizons() and GetSnapshotData().
> >
> > I don't think that clamping the value with oldestXid (as seen in patch
> > 0001, in GetSnapshotData) is right.
>
> I agree that using oldestXid to clamp is problematic.
>
>
> > It would clamp the value relative to the oldest frozen xid of all
> > databases, which can be millions of transactions behind oldestXmin,
> > and thus severely skew the amount of transaction's changes you keep on
> > disk (that is, until oldestXid moves past 1000_000).
>
> What precisely do you mean with "skew" here? Do you just mean that it'd take a
> long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like
> you might mean more than that?

h->oldest_considered_running can be extremely old due to the global
nature of the value and the potential existence of a snapshot in
another database that started in parallel to a very old running
transaction.

Example: With vacuum_defer_cleanup_age set to 100, it is possible
that a snapshot in another database (thus another backend) would
result in a local intermediate status result of h->o_c_r = 20,
h->s_o_n = 20, h->d_o_n = 10030. The clamped offset would then be 20
(clamped using h->o_c_r), which updates h->data_oldest_nonremovable to
10010. The obvious result is that all but the last 20 transactions
from this database's data files are available for cleanup, which
contradicts with the intention of the vacuum_defer_cleanup_age GUC.

> I'm tempted to go with reinterpreting 64bit xids as signed. Except that it
> seems like a mighty invasive change to backpatch.

I'm not sure either. Protecting against underflow by halving the
effective valid value space is quite the intervention, but if it is
necessary to make this work in a performant manner, it would be worth
it. Maybe someone else with more experience can provide their opinion
here.

Kind regards,

Matthias van de Meent

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 12:19 PM Robert Haas  wrote:
> I don't understand what distinction you're making. It seems like
> hair-splitting to me. We should be able to reproduce problems like
> this reliably, at least with the aid of a debugger and some
> breakpoints, before we go changing the code.

So we can *never* change something defensively, on the basis of a
suspected or theoretical hazard, either in backbranches or just on
HEAD? Not under any circumstances, ever?

> The risk of being wrong
> is quite high because the code is subtle, and the consequences of
> being wrong are potentially very bad because the code is critical to
> data integrity. If the reproducer doesn't require a debugger or other
> extreme contortions, then we should consider reducing it to a TAP test
> that can be committed. If you agree with that, then I'm not sure what
> your last email was complaining about.

I was complaining about your prescribing conditions on proceeding with
a commit, based on an understanding of things that you yourself
acknowledged as incomplete. I cannot imagine how you read that as an
unwillingness to test the issue, especially given that I agreed to
work on that before you chimed in.

> > I have been unable to reproduce the problem, and think it's possible
> > that the issue cannot be triggered in practice. Though only through
> > sheer luck. Here's why that is:

> I guess I'm not very sure that this is sheer luck.

That's just my characterization. Other people can make up their own minds.

> For the purposes of clarifying my understanding, is this the code
> you're principally worried about?

> visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
>   vmbuffer, InvalidTransactionId,
>   VISIBILITYMAP_ALL_FROZEN);

Obviously I meant this call site, since it's the only one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE -- in vacuumlazy.c, and in general.

The other visibilitymap_set() callsite that you quoted is from the
second heap pass, where LP_DEAD items are vacuumed and become
LP_UNUSED items. That isn't buggy, but it is a silly approach, in that
it cares about what the visibility map says about the page being
all-visible, as if it might take a dissenting view that needs to be
taken into consideration (obviously we know what's going on with the
page because we just scanned it ourselves, and determined that it was
at least all-visible).

-- 
Peter Geoghegan

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-01-10 Thread Sandro Santilli

On Mon, Jan 09, 2023 at 05:51:49PM -0500, Tom Lane wrote:

> Have you considered the idea of instead inventing a "\include" facility

[...]

> cases, you still need one script file for each supported upgrade step

That's exactly the problem we're trying to solve here.
The include support is nice on itself, but won't solve our problem.

--strk;

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-01-10 Thread Jacob Champion

On Mon, Jan 9, 2023 at 7:40 AM Andrew Dunstan  wrote:
> I'm confused. A client cert might not have a hostname at all, and isn't
> used to verify the connecting address, but to verify the username. It
> needs to have a CN/DN equal to the user name of the connection, or that
> maps to that name via pg_ident.conf.

Right. But I don't know anything about the security model for using a
publicly issued server certificate as a client certificate. So if you
tell me that your only requirement is that the hostname/CN matches an
entry in your ident file, and that you don't need to verify that the
certificate identifying example.org is actually coming from
example.org, or do any sort of online revocation processing to help
mitigate the risks from that, or even handle wildcards or SANs in the
cert -- fine, but I don't know the right questions to ask to review
that case for safety or correctness. It'd be better to ask someone who
is already comfortable with it.

>From my perspective, ssl_ca_file=system sure *looks* like something
reasonable for me to choose as a DBA, but I'm willing to guess it's
not actually reasonable for 99% of people. (If you get your pg_ident
rule wrong, for example, the number of people who can attack you is
scoped by the certificates issued by your CA... which for 'system'
would be the entire world.) By contrast I would have no problem
recommending sslrootcert=system as a default: a certificate can still
be misissued, but a would-be attacker would still have to get you to
connect to them. That model and its risks are, I think, generally well
understood by the community.

--Jacob

Re: allowing for control over SET ROLE

2023-01-10 Thread Jeff Davis

On Tue, 2023-01-10 at 11:45 -0500, Robert Haas wrote:
> So the risks, which in theory are all very similar, are in practice
> far greater in the PostgreSQL context, basically because our default
> setup is about 40 years behind the times in terms of implementing
> best
> practices.

I agree that huge improvements could be made with improvements to best
practices/defaults.

But there are some differences that are harder to fix that way. In
postgres, one can attach arbitrary code to pretty much anything, so you
need to trust everything you touch. There is no safe postgres
equivalent to grepping an untrusted file.


> It might be best to repost some of these ideas on a new thread with a
> relevant subject line, but I agree that there's some potential here.
> Your first idea reminds me a lot of the proposal Tom made in
> https://www.postgresql.org/message-id/19327.1533748...@sss.pgh.pa.us
> -- except that his mechanism is more general, since you can say whose
> code you trust and whose code you don't trust. Noah had a competing
> version of that patch, too. But we never settled on an approach. I
> still think something like this would be a good idea, and the fact
> that you've apparently-independently come up with a similar notion
> just reinforces that.

Will do, thank you for the reference.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Allow DISTINCT to use Incremental Sort

2023-01-10 Thread David Rowley

On Tue, 10 Jan 2023 at 16:07, Richard Guo  wrote:
> Sorry I didn't make myself clear.  I mean currently on HEAD in planner.c
> from line 4847 to line 4857, we have the code to make sure we always use
> the more rigorous clause for explicit-sort case.  I think this code is
> not necessary, because we have already done that in line 4791 to 4796,
> for both DISTINCT ON and standard DISTINCT.

Thanks for explaining.  I'm unsure if that code ever did anything
useful, but I agree that it does nothing now.

>> I've attached an updated patch
>
>
> The patch looks good to me.

Thanks for having another look. I've now pushed the patch.

David

Wasted Vacuum cycles when OldestXmin is not moving

2023-01-10 Thread sirisha chamarthi

Hi Hackers,

vacuum is not able to clean up dead tuples when OldestXmin is not moving
(because of a long running transaction or when hot_standby_feedback is
behind). Even though OldestXmin is not moved from the last time it checked,
it keeps retrying every autovacuum_naptime and wastes CPU cycles and IOs
when pages are not in memory. Can we not bypass the dead tuple collection
and cleanup step until OldestXmin is advanced? Below log shows the vacuum
running every 1 minute.

2023-01-09 08:13:01.364 UTC [727219] LOG:  automatic vacuum of table
"postgres.public.t1": index scans: 0
pages: 0 removed, 6960 remain, 6960 scanned (100.00% of total)
tuples: 0 removed, 1572864 remain, 786432 are dead but not yet
removable
removable cutoff: 852, which was 2 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 13939 hits, 0 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 0.15 s, system: 0.00 s, elapsed: 0.29 s
2023-01-09 08:14:01.363 UTC [727289] LOG:  automatic vacuum of table
"postgres.public.t1": index scans: 0
pages: 0 removed, 6960 remain, 6960 scanned (100.00% of total)
tuples: 0 removed, 1572864 remain, 786432 are dead but not yet
removable
removable cutoff: 852, which was 2 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 13939 hits, 0 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 0.14 s, system: 0.00 s, elapsed: 0.29 s

Thanks,
Sirisha

Re: ATTACH PARTITION seems to ignore column generation status

2023-01-10 Thread Tom Lane

I wrote:
> Amit Langote  writes:
>> Thanks for the patch.  It looks good, though I guess you said that we
>> should also change the error message that CREATE TABLE ... PARTITION
>> OF emits to match the other cases while we're here.  I've attached a
>> delta patch.

> Thanks.  I hadn't touched that issue because I wasn't entirely sure
> which error message(s) you were unhappy with.  These changes look
> fine offhand.

So, after playing with that a bit ... removing the block in
parse_utilcmd.c allows you to do

regression=# CREATE TABLE gtest_parent (f1 date NOT NULL, f2 bigint, f3 bigint 
GENERATED ALWAYS AS (f2 * 2) STORED) PARTITION BY RANGE (f1);
CREATE TABLE
regression=# CREATE TABLE gtest_child PARTITION OF gtest_parent (
regression(#f3 WITH OPTIONS GENERATED ALWAYS AS (f2 * 3) STORED
regression(# ) FOR VALUES FROM ('2016-07-01') TO ('2016-08-01');
CREATE TABLE
regression=# \d gtest_child
  Table "public.gtest_child"
 Column |  Type  | Collation | Nullable |   Default   
++---+--+-
 f1 | date   |   | not null | 
 f2 | bigint |   |  | 
 f3 | bigint |   |  | generated always as (f2 * 3) stored
Partition of: gtest_parent FOR VALUES FROM ('2016-07-01') TO ('2016-08-01')

regression=# insert into gtest_parent values('2016-07-01', 42);
INSERT 0 1
regression=# table gtest_parent;
 f1 | f2 | f3  
++-
 2016-07-01 | 42 | 126
(1 row)

That is, you can make a partition with a different generated expression
than the parent has.  Do we really want to allow that?  I think it works
as far as the backend is concerned, but it breaks pg_dump, which tries
to dump this state of affairs as

CREATE TABLE public.gtest_parent (
f1 date NOT NULL,
f2 bigint,
f3 bigint GENERATED ALWAYS AS ((f2 * 2)) STORED
)
PARTITION BY RANGE (f1);

CREATE TABLE public.gtest_child (
f1 date NOT NULL,
f2 bigint,
f3 bigint GENERATED ALWAYS AS ((f2 * 3)) STORED
);

ALTER TABLE ONLY public.gtest_parent ATTACH PARTITION public.gtest_child FOR 
VALUES FROM ('2016-07-01') TO ('2016-08-01');

and that fails at reload because the ATTACH PARTITION code path
checks for equivalence of the generation expressions.

This different-generated-expression situation isn't really morally
different from different ordinary DEFAULT expressions, which we
do endeavor to support.  So maybe we should deem this a supported
case and remove ATTACH PARTITION's insistence that the generation
expressions match ... which I think would be a good thing anyway,
because that check-for-same-reverse-compiled-expression business
is pretty cheesy in itself.  AFAIK, 3f7836ff6 took care of the
only problem that the backend would have with this, and pg_dump
looks like it will work as long as the backend will take the
ATTACH command.

I also looked into making CREATE TABLE ... PARTITION OF reject
this case; but that's much harder than it sounds, because what we
have at the relevant point is a raw (unanalyzed) expression for
the child's generation expression but a cooked one for the
parent's, so there is no easy way to match the two.

In short, it's seeming like the rule for both partitioning and
traditional inheritance ought to be "a child column must have
the same generated-or-not property as its parent, but their
generation expressions need not be the same".  Thoughts?

regards, tom lane

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-01-10 Thread Jacob Champion

On Mon, Jan 9, 2023 at 7:07 AM Jelte Fennema  wrote:
> I also took a closer look at the code, and the only comment I have is:
>
> > appendPQExpBuffer(&conn->errorMessage,
>
> These calls can all be replaced by the recently added libpq_append_conn_error

Argh, thanks for the catch. Fixed.

> Finally I tested this against a Postgres server I created on Azure and
> the new value works as expected. The only thing that I think would be
> good to change is the error message when sslmode=verify-full, and
> sslrootcert is not provided, but ~/.postgresql/root.crt is also not available.
> I think it would be good for the error to mention sslrootcert=system

Good idea. The wording I chose in v6 is

Either provide the file, use the system's trusted roots with
sslrootcert=system, or change sslmode to disable server certificate
verification.

What do you think?

Thanks!
--Jacob
1:  7618037c86 ! 1:  e7e2d43b18 libpq: add sslrootcert=system to use default CAs
@@ src/interfaces/libpq/fe-secure-openssl.c: initialize_SSL(PGconn *conn)
 +  {
 +  char   *err = SSLerrmessage(ERR_get_error());
 +
-+  appendPQExpBuffer(&conn->errorMessage,
-+libpq_gettext("could 
not load system root certificate paths: %s\n"),
-+err);
++  libpq_append_conn_error(conn, "could not load system 
root certificate paths: %s",
++  err);
 +  SSLerrfree(err);
 +  SSL_CTX_free(SSL_context);
 +  return -1;
@@ src/interfaces/libpq/fe-secure-openssl.c: initialize_SSL(PGconn *conn)
{
X509_STORE *cvstore;
  
+@@ src/interfaces/libpq/fe-secure-openssl.c: initialize_SSL(PGconn *conn)
+*/
+   if (fnbuf[0] == '\0')
+   libpq_append_conn_error(conn, "could not get 
home directory to locate root certificate file\n"
+- "Either 
provide the file or change sslmode to disable server certificate 
verification.");
++ "Either 
provide the file, use the system's trusted roots with sslrootcert=system, or 
change sslmode to disable server certificate verification.");
+   else
+   libpq_append_conn_error(conn, "root certificate 
file \"%s\" does not exist\n"
+- "Either 
provide the file or change sslmode to disable server certificate 
verification.", fnbuf);
++ "Either 
provide the file, use the system's trusted roots with sslrootcert=system, or 
change sslmode to disable server certificate verification.", fnbuf);
+   SSL_CTX_free(SSL_context);
+   return -1;
+   }
 
  ## src/test/ssl/ssl/server-cn-only+server_ca.crt (new) ##
 @@
2:  c392e1796e ! 2:  e4d9731e1e libpq: force sslmode=verify-full for system CAs
@@ src/interfaces/libpq/fe-connect.c: connectOptions2(PGconn *conn)
 +  && strcmp(conn->sslrootcert, "system") == 0)
 +  {
 +  conn->status = CONNECTION_BAD;
-+  appendPQExpBuffer(&conn->errorMessage,
-+libpq_gettext("sslrootcert 
value \"%s\" invalid when SSL support is not compiled in\n"),
-+conn->sslrootcert);
++  libpq_append_conn_error(conn, "sslrootcert value \"%s\" invalid 
when SSL support is not compiled in",
++  
conn->sslrootcert);
 +  return false;
 +  }
 +#endif
@@ src/interfaces/libpq/fe-connect.c: connectOptions2(PGconn *conn)
 +  && strcmp(conn->sslmode, "verify-full") != 0)
 +  {
 +  conn->status = CONNECTION_BAD;
-+  appendPQExpBuffer(&conn->errorMessage,
-+libpq_gettext("weak sslmode 
\"%s\" may not be used with sslrootcert=system\n"),
-+conn->sslmode);
++  libpq_append_conn_error(conn, "weak sslmode \"%s\" may not be 
used with sslrootcert=system",
++  conn->sslmode);
 +  return false;
 +  }
 +#endif
From e7e2d43b186a32a414783baa2ca55d88a5d7fe9e Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 24 Oct 2022 15:30:25 -0700
Subject: [PATCH v6 1/2] libpq: add sslrootcert=system to use default CAs

Based on a patch by Thomas Habets.

Note the workaround in .cirrus.yml for a strange interaction between
brew a

wal_compression = method:level

2023-01-10 Thread Justin Pryzby

Is it desirable to support specifying a level ?

Maybe there's a concern about using high compression levels, but 
I'll start by asking if the feature is wanted at all.

Previous discussion at: 20210614012412.ga31...@telsasoft.com
>From cb30e17cf19fffa370a887d28d6d7e683d588b71 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 14 Mar 2021 17:12:07 -0500
Subject: [PATCH] Use GUC hooks to support compression:level..

..which is useful for zstd, but less so for lz4.

TODO:

windows, pglz, zlib case
---
 doc/src/sgml/config.sgml|   6 +-
 src/backend/access/transam/xlog.c   |  20 
 src/backend/access/transam/xloginsert.c |   6 +-
 src/backend/access/transam/xlogreader.c | 124 
 src/backend/utils/misc/guc_tables.c |  39 ++
 src/common/compression.c|   3 -
 src/include/access/xlog.h   |   6 +
 src/include/access/xlogreader.h |   2 +
 src/include/utils/guc_hooks.h   |   3 +
 src/test/regress/expected/compression.out   |  19 +++
 src/test/regress/expected/compression_1.out |  19 +++
 src/test/regress/sql/compression.sql|  10 ++
 12 files changed, 220 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 77574e2d4ec..7b34ed630d5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3161,7 +3161,7 @@ include_dir 'conf.d'
  
 
  
-  wal_compression (enum)
+  wal_compression (string)
   
wal_compression configuration parameter
   
@@ -3169,7 +3169,7 @@ include_dir 'conf.d'
   

 This parameter enables compression of WAL using the specified
-compression method.
+compression method and optional compression level.
 When enabled, the PostgreSQL
 server compresses full page images written to WAL when
  is on or during a base backup.
@@ -3179,6 +3179,8 @@ include_dir 'conf.d'
 was compiled with --with-lz4) and
 zstd (if PostgreSQL
 was compiled with --with-zstd).
+A compression level may optionally be specified, by appending the level
+number after a colon (:)
 The default value is off.
 Only superusers and users with the appropriate SET
 privilege can change this setting.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 74bd1f5fbe2..e2d81129549 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -125,6 +125,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 int			wal_compression = WAL_COMPRESSION_NONE;
+char		*wal_compression_string = "no"; /* Overwritten by GUC */
+int			wal_compression_level = 1;
 char	   *wal_consistency_checking_string = NULL;
 bool	   *wal_consistency_checking = NULL;
 bool		wal_init_zero = true;
@@ -8118,6 +8120,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 	}
 }
 
+bool
+check_wal_compression(char **newval, void **extra, GucSource source)
+{
+	int tmp;
+	if (get_compression_level(*newval, &tmp) != -1)
+		return true;
+
+	return false;
+}
+
+/* Parse the GUC into integers for wal_compression and wal_compression_level */
+void
+assign_wal_compression(const char *newval, void *extra)
+{
+	wal_compression = get_compression_level(newval, &wal_compression_level);
+	Assert(wal_compression >= 0);
+}
+
 
 /*
  * Issue appropriate kind of fsync (if any) for an XLOG output file.
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 47b6d16eaef..93003744ed0 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -906,8 +906,8 @@ XLogCompressBackupBlock(char *page, uint16 hole_offset, uint16 hole_length,
 
 		case WAL_COMPRESSION_LZ4:
 #ifdef USE_LZ4
-			len = LZ4_compress_default(source, dest, orig_len,
-	   COMPRESS_BUFSIZE);
+			len = LZ4_compress_fast(source, dest, orig_len,
+	   COMPRESS_BUFSIZE, wal_compression_level);
 			if (len <= 0)
 len = -1;		/* failure */
 #else
@@ -918,7 +918,7 @@ XLogCompressBackupBlock(char *page, uint16 hole_offset, uint16 hole_length,
 		case WAL_COMPRESSION_ZSTD:
 #ifdef USE_ZSTD
 			len = ZSTD_compress(dest, COMPRESS_BUFSIZE, source, orig_len,
-ZSTD_CLEVEL_DEFAULT);
+wal_compression_level);
 			if (ZSTD_isError(len))
 len = -1;		/* failure */
 #else
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a61b4a99219..728ca9f18c4 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,7 +17,9 @@
  */
 #include "postgres.h"
 
+#include 
 #include 
+
 #ifdef USE_LZ4
 #include 
 #endif
@@ -30,6 +32,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
 #include "catalog/pg_control.h"
+#include "common/compression.h"
 #include "common/pg_l

Re: verbose mode for pg_input_error_message?

2023-01-10 Thread Nathan Bossart

On Wed, Jan 04, 2023 at 04:18:59PM -0500, Andrew Dunstan wrote:
> On 2023-01-02 Mo 10:44, Tom Lane wrote:
>> I don't think that just concatenating those strings would make for a
>> pleasant API.  More sensible, perhaps, to have a separate function
>> that returns a record.  Or we could redefine the existing function
>> that way, but I suspect that "just the primary error" will be a
>> principal use-case.
>>
>> Being able to get the SQLSTATE is likely to be interesting too.
> 
> OK, here's a patch along those lines.

My vote would be to redefine the existing pg_input_error_message() function
to return a record, but I recognize that this would inflate the patch quite
a bit due to all the existing uses in the tests.  If this is the only
argument against this approach, I'm happy to help with the patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-01-10 Thread Tom Lane

Sandro Santilli  writes:
> On Mon, Jan 09, 2023 at 05:51:49PM -0500, Tom Lane wrote:
>> ... you still need one script file for each supported upgrade step

> That's exactly the problem we're trying to solve here.
> The include support is nice on itself, but won't solve our problem.

The script-file-per-upgrade-path aspect solves a problem that you
have, whether you admit it or not; I think you simply aren't realizing
that because you have not had to deal with the consequences of
your proposed feature.  Namely that you won't have any control
over what the backend will try to do in terms of upgrade paths.

As an example, suppose that a database has foo 4.0 installed, and
the DBA decides to try to downgrade to 3.0.  With the system as it
stands, if you've provided foo--4.0--3.0.sql then the conversion
will go through, and presumably it will work because you tested that
that script does what it is intended to.  If you haven't provided
any such downgrade script, then ALTER EXTENSION UPDATE will say
"Sorry Dave, I'm afraid I can't do that" and no harm is done.

With the proposed % feature, if foo--%--3.0.sql exists then the
system will invoke it and expect the end result to be a valid
3.0 installation, whether or not the script actually has any
ability to do a downgrade.  Moreover, there isn't any very
good way to detect or prevent unsupported version transitions.
(I suppose you could add code to look at pg_extension.extversion,
but I'm not sure if that works: it looks to me like we update that
before we run the extension script.  Besides which, if you have
to add such code is that really better than having a number of
one-liner scripts implementing the same check declaratively?)

It gets worse though, because above I'm supposing that 4.0 at
least existed when this copy of foo--%--3.0.sql was made.
Suppose that somebody fat-fingered a package upgrade, such that
the extension fileset available to a database containing foo 4.0
now corresponds to foo 3.0, and there's no knowledge of 4.0 at all
in the extension scripts.  The DBA trustingly issues ALTER EXTENSION
UPDATE, which will conclude from foo.control that it should update to
3.0, and invoke foo--%--3.0.sql to do it.  Maybe the odds of success
are higher than zero, but not by much; almost certainly you are
going to end with an extension containing some leftover 4.0
objects, some 3.0 objects, and maybe some objects with properties
that don't exactly match either 3.0 or 4.0.  Even if that state
of affairs manages not to cause immediate problems, it'll surely
be a mess whenever somebody tries to re-update to 4.0 or later.

So I really think this is a case of "be careful what you ask
for, you might get it".  Even if PostGIS is willing to put in
the amount of infrastructure legwork needed to make such a
design bulletproof, I'm quite sure nobody else will manage
to use such a thing successfully.  I'd rather spend our
development effort on a feature that has more than one use-case.

regards, tom lane

Can we let extensions change their dumped catalog schemas?

2023-01-10 Thread Jacob Champion

Hi all,

I've been talking to other Timescale devs about a requested change to
pg_dump, and there's been quite a bit of back-and-forth to figure out
what, exactly, we want. Any mistakes here are mine, but I think we've
been able to distill it down to the following request:

We'd like to be allowed to change the schema for a table that's been
marked in the past with pg_extension_config_dump().

Unless I'm missing something obvious (please, let it be that) there's no
way to do this safely. Once you've marked an internal table as dumpable,
its schema is effectively frozen if you want your dumps to work across
versions, because otherwise you'll try to restore that "catalog" data
into a table that has different columns. And while sometimes you can
make that work, it doesn't in the general case.

We (Timescale) do already change the schemas today, but we pay the
associated costs in that dump/restore doesn't work without manual
version bookkeeping and user fiddling -- and in the worst cases, it
appears to "work" across versions but leaves our catalog tables in an
inconsistent state. So the request is to come up with a way to support
this case.

Some options that have been proposed so far:

1) Don't ask for a new feature, and instead try to ensure infinite
backwards compatibility for those tables.

For extension authors who have already done this -- and have likely done
some heavy architectural lifting to make it work -- this is probably the
first thing that will come to mind, and it was the first thing I said,
too.

But the more I say it, the less acceptable it feels. Not even Postgres
is expected to maintain infinite catalog compatibility into the future.
We need to evolve our catalogs, too -- and we already provide the
standard update scripts to perform migrations of those tables, but a
dump/restore doesn't have any way to use them today.

2) Provide a way to record the exact version of an extension in a dump.

Brute-force, but pretty much guaranteed to fix the cross-version
problem, because the dump can't be accidentally restored to an extension
version with a different catalog schema. Users then manually ALTER
EXTENSION ... UPDATE (or we could even include that in the dump itself,
as the final action). Doing this by default would punish extensions that
don't have this problem, so it would have to be opt-in in some way.

It's also unnecessarily strict IMO -- even if we don't have a config
table change in a new version, we'll still require the old extension
version to be available alongside the new version during a restore.
Maybe a tweak on this idea would be to introduce a catversion for
extensions.

3) Provide a way to record the entire internal state of an extension in
a dump.

Every extension is already expected to handle the case where the
internal state is at version X but the installed extension is at version
X+N, and the update scripts we provide will perform the necessary
migrations. But there's no way to reproduce this case using
dump/restore, because dumping an extension omits its internals.

If a dump could instead include the entire internal state of an
extension, then we'd be guaranteed to reproduce the exact situation that
we already have to support for an in-place upgrade. After a restore, the
SQL is at version X, the installed extension is some equal or later
version, and all that remains is to run the update scripts, either
manually or within the dump itself.

Like (2), I think there's no way you'd all accept this cost for every
extension. It'd have to be opt-in.

--

Hopefully that makes a certain amount of sense. Does it seem like a
reasonable thing to ask?

I'm happy to clarify anything above, and if you know of an obvious
solution I'm missing, I would love to be corrected. :D

Thanks,
--Jacob

Re: Use windows VMs instead of windows containers on the CI

2023-01-10 Thread Thomas Munro

On Wed, Jan 11, 2023 at 8:20 AM Andres Freund  wrote:
> On 2023-01-10 09:22:12 -0600, Justin Pryzby wrote:
> > > There is more than 2x speed gain when VMs are used.
> >
> > One consideration is that if windows runs twice as fast, we'll suddenly
> > start using twice as many resources at cirrus/google/amazon - the
> > windows task has been throttling everything else.  Not sure if we should
> > to do anything beyond the limits that cfbot already uses.
>
> I'm not sure we would. cfbot has a time based limit for how often it tries to
> rebuild entries, and I think we were just about keeping up with that. In which
> case we shouldn't, on average, schedule more jobs than we currently
> do. Although peak "job throughput" would be higher.
>
> Thomas?

It currently tries to re-test each patch every 24 hours, but doesn't
achieve that.  It looks like it's currently re-testing every ~30
hours.  Justin's right, we'll consume more non-Windows resources if
Windows speeds up, but not 2x, more like 1.25x when cfbot's own
throttling kicks in.  Or I could change the cycle target to 36 or 48
hours, to spread the work out more.

Back-of-a-napkin maths:

 * there are currently 240 entries in a testable status
 * it takes ~0.5 hours to test (because that's the slow Windows time)
 * therefore it takes ~120 hours to test them all
 * but we can do 4 at a time, so that's ~30 hours to get through them
all and start again
 * that matches what we see:

cfbot=> select created - lag(created) over (order by created) from
branch where submission_id = 4068;
   ?column?
---

 1 day 06:30:00.265047
 1 day 05:43:59.978949
 1 day 04:13:59.754048
 1 day 05:28:59.811916
 1 day 07:00:00.651655
(6 rows)

If, with this change, we can test in only ~0.25 hours, then we'll only
need 60 hours of Cirrus time to test them all.  With a target of
re-testing every 24 hours, it should now only have to run ~2.5 jobs at
all times.  Having free slots would be kind to Cirrus, and also lower
the latency when a new patch is posted (which currently has to wait
for a free slot before it can begin).  Great news.

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 12:08 PM Peter Geoghegan  wrote:
> Actually, FreezeMultiXactId() can fully remove an xmax that has some
> member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
> possible, at least when no individual member is still running. Doesn't
> have to involve transaction aborts at all.
>
> Let me go try to break it that way...

Attached patch shows how this could break.

It adds an assertion that checks that the expected
PD_ALL_VISIBLE/VM_ALL_VISIBLE() conditions hold at the right point. It
also comments out FreezeMultiXactId()'s FRM_NOOP handling.

The FRM_NOOP case is really just an optimization, and shouldn't be
needed for correctness. This is amply demonstrated by running "meston
test" with the patch applied, which will pass without incident.

I can get the PD_ALL_VISIBLE assertion to fail by following this
procedure with the patch applied:

* Run a plain VACUUM to set all the pages from a table all-visible,
but not all-frozen.

* Set a breakpoint that will hit after all_visible_according_to_vm is
set to true, for an interesting blkno.

* Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
relevant visibilitymap_set() call site (the one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE).

Now all_visible_according_to_vm is set to true, but we don't have a
lock/pin on the same heap page just yet.

* Acquire several non-conflicting row locks on a row on the block in
question, so that a new multi is allocated.

* End every session whose XID is stored in our multi (commit/abort).

* Within GDB, continue from before -- observe assertion failure.

Obviously this scenario doesn't demonstrate the presence of a bug --
not quite. But it does prove that we rely on FRM_NOOP to not allow the
VM to become corrupt, which just doesn't make any sense, and can't
have been intended. At a minimum, it strongly suggests that the
current approach is very fragile.

-- 
Peter Geoghegan

0001-Debug-freeze-map-issue.patch
Description: Binary data

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-10 Thread David Rowley

On Wed, 11 Jan 2023 at 08:17, Ankit Kumar Pandey  wrote:
>
>
> > On 10/01/23 10:53, David Rowley wrote:
>
> > the total cost is the same for both of these, but the execution time
> > seems to vary quite a bit.
>
> This is really weird, I tried it different ways (to rule out any issues
> due to
>
> caching) and execution time varied in spite of having same cost.
>
> > Maybe we should try and do this for DISTINCT queries if the
> > distinct_pathkeys match the orderby_pathkeys. That seems a little less
> > copout-ish. If the ORDER BY is the same as the DISTINCT then it seems
> > likely that the ORDER BY might opt to use the Unique path for DISTINCT
> > since it'll already have the correct pathkeys.
>
> > However, if the ORDER BY has fewer columns then it might be cheaper to Hash 
> > Aggregate and
> > then sort all over again, especially so when the DISTINCT removes a
> > large proportion of the rows.
>
> Isn't order by pathkeys are always fewer than distinct pathkeys?

Just thinking about this again, I remembered why I thought DISTINCT
was uninteresting to start with.  The problem is that if the query has
WindowFuncs and also has a DISTINCT clause, then the WindowFunc
results *must* be in the DISTINCT clause and, optionally also in the
ORDER BY clause.  There's no other place to write WindowFuncs IIRC.
Since we cannot pushdown the sort when the more strict version of the
pathkeys have WindowFuncs, then we must still perform the additional
sort if the planner chooses to do a non-hashed DISTINCT.  The aim of
this patch is to reduce the total number of sorts, and I don't think
that's possible in this case as you can't have WindowFuncs in the
ORDER BY when they're not in the DISTINCT clause:

postgres=# select distinct relname from pg_Class order by row_number()
over (order by oid);
ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 1: select distinct relname from pg_Class order by row_number() ...

Another type of query which is suboptimal still is when there's a
DISTINCT and WindowClause but no ORDER BY. We'll reorder the DISTINCT
clause so that the leading columns of the ORDER BY come first in
transformDistinctClause(), but we've nothing to do the same for
WindowClauses.  It can't happen around when transformDistinctClause()
is called  as we've yet to decide the WindowClause evaluation order,
so if we were to try to make that better it would maybe have to do in
the upper planner somewhere. It's possible it's too late by that time
to adjust the DISTINCT clause.

Here's an example of it.

# set enable_hashagg=0;
# explain select distinct relname,relkind,count(*) over (partition by
relkind) from pg_Class;
 QUERY PLAN

 Unique  (cost=61.12..65.24 rows=412 width=73)
   ->  Sort  (cost=61.12..62.15 rows=412 width=73)
 Sort Key: relname, relkind, (count(*) OVER (?))
 ->  WindowAgg  (cost=36.01..43.22 rows=412 width=73)
   ->  Sort  (cost=36.01..37.04 rows=412 width=65)
 Sort Key: relkind
 ->  Seq Scan on pg_class  (cost=0.00..18.12
rows=412 width=65)
(7 rows)

We can simulate the optimisation by swapping the column order in the
targetlist. Note the planner can use Incremental Sort (at least since
3c6fc5820, from about 2 hours ago)

# explain select distinct relkind,relname,count(*) over (partition by
relkind) from pg_Class;
 QUERY PLAN

 Unique  (cost=41.26..65.32 rows=412 width=73)
   ->  Incremental Sort  (cost=41.26..62.23 rows=412 width=73)
 Sort Key: relkind, relname, (count(*) OVER (?))
 Presorted Key: relkind
 ->  WindowAgg  (cost=36.01..43.22 rows=412 width=73)
   ->  Sort  (cost=36.01..37.04 rows=412 width=65)
 Sort Key: relkind
 ->  Seq Scan on pg_class  (cost=0.00..18.12
rows=412 width=65)
(8 rows)

Not sure if we should be trying to improve that in this patch. I just
wanted to identify it as something else that perhaps could be done.
I'm not really all that sure the above query shape makes much sense in
the real world. Would anyone ever want to use DISTINCT on some results
containing WindowFuncs?

David

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-10 Thread David Rowley

On Tue, 10 Jan 2023 at 18:36, Tom Lane  wrote:
>
> David Rowley  writes:
> > Ideally, our sort costing would just be better, but I think that
> > raises the bar a little too high to start thinking of making
> > improvements to that for this patch.
>
> It's trickier than it looks, cf f4c7c410e.  But if you just want
> to add a small correction based on number of columns being sorted
> by, that seems within reach.  See the comment for cost_sort though.
> Also, I suppose for incremental sorts we'd want to consider only
> the number of newly-sorted columns, but I'm not sure if that info
> is readily at hand either.

Yeah, I had exactly that in mind when I mentioned about setting the
bar higher. It seems like a worthy enough goal to improve the sort
costs separately from this work. I'm starting to consider if we might
need to revisit cost_sort() anyway. There's been quite a number of
performance improvements made to sort in the past few years and I
don't recall if anything has been done to check if the sort costs are
still realistic. I'm aware that it's a difficult problem as the number
of comparisons is highly dependent on the order of the input rows.

David

Re: [PATCH] support tab-completion for single quote input with equal sign

2023-01-10 Thread Tom Lane

torikoshia  writes:
> I updated the patch going along with the v3 direction.

I think this adds about as many failure modes as it removes,
if not more.

* The connection string doesn't necessarily end with "'"; it could
be a dollar-quoted string.

* If it is a dollar-quoted string, there could be "'" marks internal
to it, allowing PUBLICATION to be falsely offered when we're really
still within the connection string.

* The following WITH options could contain "'", allowing PUBLICATION
to be falsely offered within that clause.

I've spent some effort previously on getting tab-completion to deal
sanely with single-quoted strings, but everything I've tried has
crashed and burned :-(, mainly because it's not clear when to take
the whole literal as one "word" and when not.  A relevant example
here is that somebody might wish that we could tab-complete within
the connection string, e.g. that

CREATE SUBSCRIPTION sub CONNECTION 'db

would complete with "name=".  We have the info available from libpq
to do that, if only we could figure out when to apply it.  I think
we need some pretty fundamental design work to figure out what we
want to do in this area, and that in the meantime putting band-aids
on specific cases is probably not very productive.

regards, tom lane

Re: Strengthen pg_waldump's --save-fullpage tests

2023-01-10 Thread Michael Paquier

On Tue, Jan 10, 2023 at 05:25:44PM +0100, Drouvot, Bertrand wrote:
> I like the idea of comparing the full page (and not just the LSN) but
> I'm not sure that adding the pageinspect dependency is a good thing.
> 
> What about extracting the block directly from the relation file and
> comparing it with the one extracted from the WAL? (We'd need to skip the
> first 8 bytes to skip the LSN though).

Byte-by-byte counting for the page hole?  The page checksum would
matter as well, FWIW, as it is not set in WAL and a FPW logged in WAL
means that the page got modified.  It means that it could have been
flushed, updating its pd_lsn and its pd_checksum on the way.
--
Michael


signature.asc
Description: PGP signature

delay starting WAL receiver

2023-01-10 Thread Nathan Bossart

I discussed this a bit in a different thread [0], but I thought it deserved
its own thread.

After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed
that the recovery tests consistently take much longer.  Upon further
inspection, it looks like a similar race condition to the one described in
e5d494d's commit message.  With some added debug logs, I see that all of
the callers of MaybeStartWalReceiver() complete before SIGCHLD is
processed, so ServerLoop() waits for a minute before starting the WAL
receiver.

The attached patch fixes this by adjusting DetermineSleepTime() to limit
the sleep to at most 100ms when WalReceiverRequested is set, similar to how
the sleep is limited when background workers must be restarted.

[0] https://postgr.es/m/20221215224721.GA694065%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6f32238f119236322dfb16262a88c3a9c5141413 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 15 Dec 2022 14:11:43 -0800
Subject: [PATCH v1 1/1] handle race condition when restarting wal receivers

---
 src/backend/postmaster/postmaster.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index eac3450774..9d6f58e0b3 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1601,9 +1601,9 @@ checkControlFile(void)
  *
  * In normal conditions we wait at most one minute, to ensure that the other
  * background tasks handled by ServerLoop get done even when no requests are
- * arriving.  However, if there are background workers waiting to be started,
- * we don't actually sleep so that they are quickly serviced.  Other exception
- * cases are as shown in the code.
+ * arriving.  However, if there are background workers or a WAL receiver
+ * waiting to be started, we make sure they are quickly serviced.  Other
+ * exception cases are as shown in the code.
  */
 static void
 DetermineSleepTime(struct timeval *timeout)
@@ -1611,11 +1611,12 @@ DetermineSleepTime(struct timeval *timeout)
 	TimestampTz next_wakeup = 0;
 
 	/*
-	 * Normal case: either there are no background workers at all, or we're in
-	 * a shutdown sequence (during which we ignore bgworkers altogether).
+	 * Normal case: either there are no background workers and no WAL receiver
+	 * at all, or we're in a shutdown sequence (during which we ignore
+	 * bgworkers and the WAL receiver altogether).
 	 */
 	if (Shutdown > NoShutdown ||
-		(!StartWorkerNeeded && !HaveCrashedWorker))
+		(!StartWorkerNeeded && !HaveCrashedWorker && !WalReceiverRequested))
 	{
 		if (AbortStartTime != 0)
 		{
@@ -1674,6 +1675,21 @@ DetermineSleepTime(struct timeval *timeout)
 		}
 	}
 
+	/*
+	 * If WalReceiverRequested is set, we're probably waiting on a SIGCHLD to
+	 * arrive to clear WalReceiverPID before starting the new WAL receiver.  We
+	 * don't expect that to take long, so limit the sleep to 100ms so that we
+	 * start the new WAL receiver promptly.
+	 */
+	if (WalReceiverRequested)
+	{
+		TimestampTz walrcv_wakeup;
+
+		walrcv_wakeup = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), 100);
+		if (next_wakeup == 0 || walrcv_wakeup < next_wakeup)
+			next_wakeup = walrcv_wakeup;
+	}
+
 	if (next_wakeup != 0)
 	{
 		long		secs;
-- 
2.25.1

Re: Allow +group in pg_ident.conf

2023-01-10 Thread Michael Paquier

On Tue, Jan 10, 2023 at 09:42:19AM -0500, Andrew Dunstan wrote:
> Ok, that sounds reasonable, but the cfbot doesn't like patches that
> depend on other patches in a different email. Maybe you can roll this up
> as an extra patch in your next version? It's pretty small.

This can go two ways if both of you agree, by sending an updated patch
on this thread based on the other one..  And actually, Jelte's patch
has less C code than this thread's proposal, eventhough it lacks
tests.
--
Michael


signature.asc
Description: PGP signature

Re: Infinite Interval

2023-01-10 Thread Joseph Koshakow

On Sun, Jan 8, 2023 at 11:17 PM jian he  wrote:
>
>
>
> On Sun, Jan 8, 2023 at 4:22 AM Joseph Koshakow  wrote:
>>
>> On Sat, Jan 7, 2023 at 3:05 PM Joseph Koshakow  wrote:
>> >
>> > On Sat, Jan 7, 2023 at 3:04 PM Joseph Koshakow  wrote:
>> > >
>> > > I think this patch is just about ready for review, except for the
>> > > following two questions:
>> > >   1. Should finite checks on intervals only look at months or all three
>> > >   fields?
>> > >   2. Should we make the error messages for adding/subtracting infinite
>> > >   values more generic or leave them as is?
>> > >
>> > > My opinions are
>> > >   1. We should only look at months.
>> > >   2. We should make the errors more generic.
>> > >
>> > > Anyone else have any thoughts?
>>
>> Here's a patch with the more generic error messages.
>>
>> - Joe
>
>
> HI.
>
> I just found out another problem.
>
> select * from  generate_series(timestamp'-infinity', timestamp 'infinity', 
> interval 'infinity');
> ERROR:  timestamp out of range
>
> select * from  generate_series(timestamp'-infinity',timestamp 'infinity', 
> interval '-infinity'); --return following
>
>  generate_series
> -
> (0 rows)
>
>
> select * from generate_series(timestamp 'infinity',timestamp 'infinity', 
> interval 'infinity');
> --will run all the time.
>
> select * from  generate_series(timestamp 'infinity',timestamp 'infinity', 
> interval '-infinity');
> ERROR:  timestamp out of range
>
>  select * from  generate_series(timestamp'-infinity',timestamp'-infinity', 
> interval 'infinity');
> ERROR:  timestamp out of range
>
> select * from  generate_series(timestamp'-infinity',timestamp'-infinity', 
> interval '-infinity');
> --will run all the time.

Good catch, I didn't think to check non date/time functions.
Unfortunately, I think you may have opened Pandoras box. I went through
pg_proc.dat and found the following functions that all involve
intervals. We should probably investigate all of them and make sure
that they handle infinite intervals properly.

{ oid => '1026', descr => 'adjust timestamp to new time zone',
proname => 'timezone', prorettype => 'timestamp',
proargtypes => 'interval timestamptz', prosrc => 'timestamptz_izone' },

{ oid => '4133', descr => 'window RANGE support',
proname => 'in_range', prorettype => 'bool',
proargtypes => 'date date interval bool bool',
prosrc => 'in_range_date_interval' },

{ oid => '1305', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
provolatile => 's', prorettype => 'bool',
proargtypes => 'timestamptz interval timestamptz interval',
prosrc => 'see system_functions.sql' },

{ oid => '1305', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
provolatile => 's', prorettype => 'bool',
proargtypes => 'timestamptz interval timestamptz interval',
prosrc => 'see system_functions.sql' },
{ oid => '1306', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
provolatile => 's', prorettype => 'bool',
proargtypes => 'timestamptz timestamptz timestamptz interval',
prosrc => 'see system_functions.sql' },
{ oid => '1307', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
provolatile => 's', prorettype => 'bool',
proargtypes => 'timestamptz interval timestamptz timestamptz',
prosrc => 'see system_functions.sql' },

{ oid => '1308', descr => 'intervals overlap?',
proname => 'overlaps', proisstrict => 'f', prorettype => 'bool',
proargtypes => 'time time time time', prosrc => 'overlaps_time' },
{ oid => '1309', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
prorettype => 'bool', proargtypes => 'time interval time interval',
prosrc => 'see system_functions.sql' },
{ oid => '1310', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
prorettype => 'bool', proargtypes => 'time time time interval',
prosrc => 'see system_functions.sql' },
{ oid => '1311', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
prorettype => 'bool', proargtypes => 'time interval time time',
prosrc => 'see system_functions.sql' },

{ oid => '1386',
descr => 'date difference from today preserving months and years',
proname => 'age', prolang => 'sql', provolatile => 's',
prorettype => 'interval', proargtypes => 'timestamptz',
prosrc => 'see system_functions.sql' },

{ oid => '2042', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
prorettype => 'bool', proargtypes => 'timestamp interval timestamp interval',
prosrc => 'see system_functions.sql' },
{ oid => '2043', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
prorettype => 'bool', proargtypes => 'timestamp timestamp timestamp interval',
prosrc => 'see system_functions.sql' },
{ oid => '2044', descr => 'intervals overlap?',
proname => 'overlaps', prolang =>

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 4:39 PM Peter Geoghegan  wrote:
> * Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
> relevant visibilitymap_set() call site (the one that passes
> VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
> VISIBILITYMAP_ALL_VISIBLE).
>
> Now all_visible_according_to_vm is set to true, but we don't have a
> lock/pin on the same heap page just yet.
>
> * Acquire several non-conflicting row locks on a row on the block in
> question, so that a new multi is allocated.

Forgot to mention that there needs to be a HOT update mixed in with
these SELECT ... FOR SHARE row lockers, too, which must abort once its
XID has been added to a multi. Obviously heap_lock_tuple() won't ever
unset VISIBILITYMAP_ALL_VISIBLE or PD_ALL_VISIBLE (it only ever clears
VISIBILITYMAP_ALL_FROZEN) -- so we need a heap_update() to clear all
of these status bits.

This enables the assertion to fail because:

* Pruning can get rid of the aborted successor heap-only tuple right
away, so it is not going to block us from setting the page all_visible
(that just leaves the original tuple to consider).

* The original tuple's xmax is a Multi, so it won't automatically be
ineligible for freezing because it's > OldestXmin in this scenario.

* FreezeMultiXactId() processing will completely remove xmax, without
caring too much about cutoffs like OldestXmin -- it only cares about
whether each individual XID needs to be kept or not.

(Granted, FreezeMultiXactId() will only remove xmax like this because
I deliberately removed its FRM_NOOP handling, but that is a very
delicate thing to rely on, especially from such a great distance. I
can't imagine that it doesn't fail on HEAD for any reason beyond sheer
luck.)

--
Peter Geoghegan

Spinlock is missing when updating two_phase of ReplicationSlot

2023-01-10 Thread Masahiko Sawada

Hi,

I realized that in CreateDecodingContext() function, we update both
slot->data.two_phase and two_phase_at without acquiring the spinlock:

/* Mark slot to allow two_phase decoding if not already marked */
if (ctx->twophase && !slot->data.two_phase)
{
slot->data.two_phase = true;
slot->data.two_phase_at = start_lsn;
ReplicationSlotMarkDirty();
ReplicationSlotSave();
SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
}

I think we should acquire the spinlock when updating fields of the
replication slot even by its owner. Otherwise readers could see
inconsistent results. Looking at another place where we update
two_phase_at, we acquire the spinlock:

SpinLockAcquire(&slot->mutex);
slot->data.confirmed_flush = ctx->reader->EndRecPtr;
if (slot->data.two_phase)
slot->data.two_phase_at = ctx->reader->EndRecPtr;
SpinLockRelease(&slot->mutex);

It seems to me an oversight of commit a8fd13cab0b. I've attached the
small patch to fix it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_spinlock.patch
Description: Binary data

1 2 >

1 - 100 of 155 matches

Mail list logo