date:20211101

Re: Teach pg_receivewal to use lz4 compression

2021-11-01 Thread Michael Paquier

On Fri, Oct 29, 2021 at 08:38:33PM +0900, Michael Paquier wrote:
> Why would the header size change between the moment the segment is
> begun and it is finished?  We could store it in memory and write it
> again when the segment is closed instead, even if it means to fseek()
> back to the beginning of the file once the segment is completed.
> Storing WalSegSz from the moment a segment is opened makes the code
> weaker to SIGINTs and the kind, so this does not fix the problem I
> mentioned previously :/

I got to think more on this one, and another argument against storing
an incorrect contentSize while the segment is not completed would
break the case of partial segments with --synchronous, where we should
still be able to recover as much data flushed as possible.  Like zlib,
where one has to complete the partial segment with zeros after
decompression until the WAL segment size is reached, we should be able
to support that with LZ4.  (I have saved some customer data in the
past thanks to this property, btw.)

It is proves to be too fancy to rewrite the header with a correct
contentSize once the segment is completed, another way would be to
enforce a decompression of each segment in-memory.  The advantage of
this method is that we would be a maximum portable.  For example, if
one begins to use pg_receivewal on an archive directory where we used
an archive_command, we would be able to grab the starting LSN.  That's
more costly of course, but the LZ4 protocol does not make that easy
either with its chunk protocol.  By the way, you are right that we
should worry about the variability in size of the header as we only
have the guarantee that it can be within a give window.  I missed
that and lz4frame.h mentions that around LZ4F_headerSize :/

It would be good to test with many segments, but could we think about
just relying on LZ4F_decompress() with a frame and compute the
decompressed size by ourselves?  At least that will never break, and
that would work in all the cases aimed by pg_receivewal.
--
Michael

signature.asc
Description: PGP signature

Re: Teach pg_receivewal to use lz4 compression

2021-11-01 Thread gkokolatos

‐‐‐ Original Message ‐‐‐

On Monday, November 1st, 2021 at 9:09 AM, Michael Paquier  
wrote:

> On Fri, Oct 29, 2021 at 08:38:33PM +0900, Michael Paquier wrote:
>
> It would be good to test with many segments, but could we think about
> just relying on LZ4F_decompress() with a frame and compute the
> decompressed size by ourselves? At least that will never break, and
> that would work in all the cases aimed by pg_receivewal.

Agreed.

I have already started on v8 of the patch with that technique. I should
be able to update the thread soon.

>
> Michael

Re: wait event and archive_command

2021-11-01 Thread Fujii Masao




On 2021/10/21 23:55, Bharath Rupireddy wrote:

Also how about adding wait events for other commands like
archive_cleanup_command, restore_command and recovery_end_command?


+1 for the wait event.


Thanks!
I added the wait events for also restore_command, etc into the patch.
I attached that updated version of the patch.



The following activitymsg that are being set to ps display in
XLogFileRead and pgarch_archiveXlog have come up for one of our
internal team discussions recently:

 snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
  xlogfname);
 set_ps_display(activitymsg);

 snprintf(activitymsg, sizeof(activitymsg), "recovering %s",
  xlogfname);
 set_ps_display(activitymsg);

 snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog);
 set_ps_display(activitymsg);

The ps display info might be useful if we run postgres on a stand
alone box and there's someone monitoring at the ps output, but it
doesn't help debugging after an issue has occurred. How about we have
the following statements which will be useful for someone to look at
the server logs and know what was/is happening during the recovery and
archiving.


If an issue occurs while the command is executing,
the error message is logged, isn't it? Isn't that enough for your case?



IMO, we should also have the elog statement.

elog(LOG, "waiting for %s", xlogfname);
elog(LOG, "recovering %s"", xlogfname);
elog(LOG, "archiving %s", xlog);


I'm afraid that some people think that it's noisy to always log those messages.



Another idea could be to have a hook emitting the above info to
outside components, but a hook just for this purpose isn't a great
idea IMO.


Yes, this idea sounds overkill to me.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3173ec2566..af6914872b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1569,7 +1569,17 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   Waiting for subplan nodes of an Append plan
node to be ready.
  
-
+ 
+  ArchiveCleanupCommand
+  Waiting for  to
+   complete.
+ 
+ 
+  ArchiveCommand
+  Waiting for  to
+   complete.
+ 
+ 
   BackendTermination
   Waiting for the termination of another backend.
  
@@ -1747,6 +1757,11 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   Waiting for recovery conflict resolution for dropping a
tablespace.
  
+ 
+  RecoveryEndCommand
+  Waiting for  to
+   complete.
+ 
  
   RecoveryPause
   Waiting for recovery to be resumed.
@@ -1761,6 +1776,11 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   Waiting for a replication slot to become inactive so it can be
dropped.
  
+ 
+  RestoreCommand
+  Waiting for  to
+   complete.
+ 
  
   SafeSnapshot
   Waiting to obtain a valid snapshot for a READ ONLY
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index f547efd294..c23229a0ac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5747,7 +5747,8 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, 
XLogRecPtr EndOfLog)
if (recoveryEndCommand && strcmp(recoveryEndCommand, "") != 0)
ExecuteRecoveryCommand(recoveryEndCommand,
   
"recovery_end_command",
-  true);
+  true,
+  
WAIT_EVENT_RECOVERY_END_COMMAND);
 
/*
 * We switched to a new timeline. Clean up segments on the old timeline.
@@ -9863,7 +9864,8 @@ CreateRestartPoint(int flags)
if (archiveCleanupCommand && strcmp(archiveCleanupCommand, "") != 0)
ExecuteRecoveryCommand(archiveCleanupCommand,
   
"archive_cleanup_command",
-  false);
+  false,
+  
WAIT_EVENT_ARCHIVE_CLEANUP_COMMAND);
 
return true;
 }
diff --git a/src/backend/access/transam/xlogarchive.c 
b/src/backend/access/transam/xlogarchive.c
index 26b023e754..47a287dd2c 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -24,6 +24,7 @@
 #include "access/xlogarchive.h"
 #include "common/archive.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "postmaster/startup.

Re: wait event and archive_command

2021-11-01 Thread Fujii Masao





On 2021/10/22 18:32, Michael Paquier wrote:

On Thu, Oct 21, 2021 at 10:57:50PM +0900, Fujii Masao wrote:

Also how about adding wait events for other commands like
archive_cleanup_command, restore_command and recovery_end_command?


+1 to add something for all of them as we track the startup process in
pg_stat_activity.  Thinking with a larger picture, this comes down to
the usage of system().  We could introduce a small wrapper of system()
that takes as argument a wait event for the backend.


That's an idea, but as far as I implemented the patch, introduing such wrapper
function doesn't seem to simplify or improve the source code.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Added schema level support for publication.

2021-11-01 Thread Greg Nancarrow

On Mon, Nov 1, 2021 at 5:07 PM Masahiko Sawada  wrote:
>
> I haven't followed the discussion on pg_publication_objects view but
> what is the primary use case of this view? If it's to list all tables
> published in a publication (e.g, "select * from pg_publication_objects
> where pubname = 'pub1'), pg_publication_objects view lacks the
> information of FOR ALL TABLES publications. And probably we can use
> pg_publication_tables instead. On the other hand, if it's to list all
> tables published in FOR ALL TABLES IN SCHEMA publications (e.g.,
> "select * from pg_publication_object where objtype = 'schema'), the
> view doesn't show tables published in such publications.
>

I think that Amit originally suggested to have a view that provides
information about the objects in each publication (like table, tables
in schema, sequence ...).
So it currently is at the granularity level of the objects that are
actually added to the publication (TABLE t, ALL TABLES IN SCHEMA s)
I agree that the view is currently missing ALL TABLES publication
information, but I think it could be easily added.
Also, currently for the "objtype" column, the type "schema" does not
seem specific enough; maybe that should be instead named
"all-tables-in-schema" or similar.

Regards,
Greg Nancarrow
Fujitsu Australia

Fix C4819 warning in MSVC

2021-11-01 Thread Daniel Gustafsson

Reading 1051867.1635720...@sss.pgh.pa.us I noticed that hamerkop raise a C4819
warning on brin_bloom.c, which is defined as:

   "The file contains a character that cannot be represented in the current
code page (number).  Save the file in Unicode format to prevent data loss."

The warning message is mostly gibberish but AFAICT the issue is that the
headercomment includes a citation to the paper used for the hashing scheme, in
which a footnote character has been copied from the paper (without the footnote
copied).  Since the footnote is irrelevant for our documentation, I propose to
remove this offending character to remove warnings from the build.

--
Daniel Gustafsson   https://vmware.com/



brin_citation_footnote.diff
Description: Binary data

Re: Partial aggregates pushdown

2021-11-01 Thread Peter Eisentraut




On 21.10.21 12:55, Alexander Pyhalov wrote:
Now aggregates with internal states can be pushed down, if they are 
marked as pushdown safe (this flag is set to true for min/max/sum),
have internal states and associated converters. Converters are called 
locally, they transform aggregate result to serialized internal 
representation.
As converters don't have access to internal aggregate state, partial 
aggregates like avg() are still not pushable.


It seems to me that the system should be able to determine from the 
existing aggregate catalog entry whether an aggregate can be pushed 
down.  For example, it could check aggtranstype != internal and similar. 
 A separate boolean flag should not be necessary.  Or if it is, the 
patch should provide some guidance about how an aggregate function 
author should set it.

Re: Added schema level support for publication.

2021-11-01 Thread Amit Kapila

On Mon, Nov 1, 2021 at 2:48 AM Tomas Vondra
 wrote:
>
> On 10/28/21 04:41, Amit Kapila wrote:
> > On Mon, Oct 25, 2021 at 3:09 PM Amit Kapila  wrote:
> >>
> >> On Mon, Oct 25, 2021 at 1:11 PM vignesh C  wrote:
> >>>
> >>> I have fixed this in the v47 version attached.
> >>>
> >>
> >> Thanks, the first patch in the series "Allow publishing the tables of
> >> schema." looks good to me. Unless there are more
> >> comments/bugs/objections, I am planning to commit it in a day or so.
> >>
> >
> > Yesterday, I have pushed the first patch. Feel free to submit the
> > remaining patches.
> >
>
> I haven't been following this thread recently, but while rebasing the
> sequence decoding patch I noticed this adds
>
>  PUBLICATIONOBJ_TABLE,/* Table type */
>  PUBLICATIONOBJ_REL_IN_SCHEMA,/* Relations in schema type */
>
> Shouldn't it be PUBLICATIONOBJ_TABLE_IN_SCHEMA, or why does it use rel
> instead of table?
>

Yeah, it should be PUBLICATIONOBJ_TABLE_IN_SCHEMA considering we have
to add other objects like sequence.

> I'm asking because the sequence decoding patch mimics ALTER PUBLICATION
> options for sequences, including ALL SEQUENCES IN SCHEMA etc. and this
> seems ambiguous. The same issue applies to PUBLICATIONOBJ_CURRSCHEMA,
> which does not specify the object type.
>

I think we should name it PUBLICATIONOBJ_TABLE_CURRSCHEMA. Does that make sense?

> I wonder if it'd be better to just separate the schema and object type
> specification, instead of mashing it into a single constant.
>

Do you mean to say the syntax on the lines of Create Publication For
Table t1, t2 Schema s1, s2;? If so, then originally the patch had the
syntax on those lines but Tom pointed out that the meaning of such a
syntax can change over a period of time and that can break apps [1]. I
think the current syntax gives a lot of flexibility to users and we
have some precedent for it as well.

[1] - https://www.postgresql.org/message-id/155565.1628954580%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.

Re: Added schema level support for publication.

2021-11-01 Thread Amit Kapila

On Mon, Nov 1, 2021 at 2:38 PM Greg Nancarrow  wrote:
>
> On Mon, Nov 1, 2021 at 5:07 PM Masahiko Sawada  wrote:
> >
> > I haven't followed the discussion on pg_publication_objects view but
> > what is the primary use case of this view? If it's to list all tables
> > published in a publication (e.g, "select * from pg_publication_objects
> > where pubname = 'pub1'), pg_publication_objects view lacks the
> > information of FOR ALL TABLES publications. And probably we can use
> > pg_publication_tables instead. On the other hand, if it's to list all
> > tables published in FOR ALL TABLES IN SCHEMA publications (e.g.,
> > "select * from pg_publication_object where objtype = 'schema'), the
> > view doesn't show tables published in such publications.
> >

Both the problems mentioned can be fixed if we follow the change
suggested by me in one of the emails above [1].

>
> I think that Amit originally suggested to have a view that provides
> information about the objects in each publication (like table, tables
> in schema, sequence ...).
>

Right and I think as you also mentioned in your previous email it can
save the effort of users if they want to know all the objects
published via a publication. I am just not sure if it is worth adding
such a view or we leave it to users to find that information via
querying individual views or system tables for objects.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BL2-6JQ174sVfE3_K%3DmjTKJ2A8-z%2B_pExDHhqdBJvb5Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Partial aggregates pushdown

2021-11-01 Thread Alexander Pyhalov


Peter Eisentraut писал 2021-11-01 12:47:

On 21.10.21 12:55, Alexander Pyhalov wrote:
Now aggregates with internal states can be pushed down, if they are 
marked as pushdown safe (this flag is set to true for min/max/sum),
have internal states and associated converters. Converters are called 
locally, they transform aggregate result to serialized internal 
representation.
As converters don't have access to internal aggregate state, partial 
aggregates like avg() are still not pushable.


It seems to me that the system should be able to determine from the
existing aggregate catalog entry whether an aggregate can be pushed
down.  For example, it could check aggtranstype != internal and
similar.  A separate boolean flag should not be necessary.


Hi.
I think we can't infer this property from existing flags. For example, 
if I have avg() with bigint[] argtranstype, it doesn't mean we can push 
down it. We couldn't also decide if partial aggregete is safe to push 
down based on aggfinalfn presence (for example, it is defined for 
sum(numeric), but we can push it down.



Or if it
is, the patch should provide some guidance about how an aggregate
function author should set it.


Where should it be provided?

--
Best regards,
Alexander Pyhalov,
Postgres Professional

Re: Missing include in be-secure-openssl.c?

2021-11-01 Thread Daniel Gustafsson

> On 1 Nov 2021, at 06:27, Michael Paquier  wrote:
> On Sun, Oct 31, 2021 at 06:45:47PM -0400, Tom Lane wrote:

>> Anyway, I propose adding that #include.
> 
> openssl/ssl.h includes openssl/x509.h if OPENSSL_NO_DEPRECATED_1_1_0
> is not defined, but agreed that adding the header makes sense here.

It does make sense, but it's a bit worrisome that the indirect inclusion no
longer works as there is no obvious explanation as to why.  Looking at the
headers in supported versions, the only real difference would be that 3.0.0 now
defines #pragma once.  For that to matter though it would mean it was included
in the compilation unit before OPENSSL_API_COMPAT is defined from pg_config.h
(or something entirely else as #pragma once is problematic and compiler
dependent).

Knowing the version used in hamerkop before and after (assuming it changed)
would be quite interesting.

> x509v3.h includes x509.h, so fe-secure-openssl.h would not need an
> update.  Now could it be a better practice to include both there?

Judging by OpenSSL, including both is common practice unless the module only
deals with v3 extensions. Following that lead seems reasonable.

--
Daniel Gustafsson   https://vmware.com/

Re: Added schema level support for publication.

2021-11-01 Thread Tomas Vondra





On 11/1/21 11:18, Amit Kapila wrote:

On Mon, Nov 1, 2021 at 2:48 AM Tomas Vondra
 wrote:


On 10/28/21 04:41, Amit Kapila wrote:

On Mon, Oct 25, 2021 at 3:09 PM Amit Kapila  wrote:


On Mon, Oct 25, 2021 at 1:11 PM vignesh C  wrote:


I have fixed this in the v47 version attached.



Thanks, the first patch in the series "Allow publishing the tables of
schema." looks good to me. Unless there are more
comments/bugs/objections, I am planning to commit it in a day or so.



Yesterday, I have pushed the first patch. Feel free to submit the
remaining patches.



I haven't been following this thread recently, but while rebasing the
sequence decoding patch I noticed this adds

  PUBLICATIONOBJ_TABLE,/* Table type */
  PUBLICATIONOBJ_REL_IN_SCHEMA,/* Relations in schema type */

Shouldn't it be PUBLICATIONOBJ_TABLE_IN_SCHEMA, or why does it use rel
instead of table?



Yeah, it should be PUBLICATIONOBJ_TABLE_IN_SCHEMA considering we have
to add other objects like sequence.


I'm asking because the sequence decoding patch mimics ALTER PUBLICATION
options for sequences, including ALL SEQUENCES IN SCHEMA etc. and this
seems ambiguous. The same issue applies to PUBLICATIONOBJ_CURRSCHEMA,
which does not specify the object type.



I think we should name it PUBLICATIONOBJ_TABLE_CURRSCHEMA. Does that make sense?


I wonder if it'd be better to just separate the schema and object type
specification, instead of mashing it into a single constant.



Do you mean to say the syntax on the lines of Create Publication For
Table t1, t2 Schema s1, s2;? If so, then originally the patch had the
syntax on those lines but Tom pointed out that the meaning of such a
syntax can change over a period of time and that can break apps [1]. I
think the current syntax gives a lot of flexibility to users and we
have some precedent for it as well.



No, I'm not talking about the syntax at all - I'm talking about how we 
represent it. PUBLICATIONOBJ_TABLE_CURRSCHEMA mixes the object type and 
schema in the same constant, so I am wondering if we should just split 
that into two pieces - one determining the schema, one determining the 
object type. So PublicationObjSpec would have two fields instead of just 
pubobjtype.


The advantage would be we wouldn't need a whole lot of new constants for 
each object type - adding sequences pretty much means adding


PUBLICATIONOBJ_SEQUENCE
PUBLICATIONOBJ_SEQUENCE_IN_SCHEMA
PUBLICATIONOBJ_SEQUENCE_CURRSCHEMA

and after splitting we'd need just the first one. But maybe it's not 
that bad, though. We don't expect all that many object types in 
publications, I guess.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Feature request for adoptive indexes

2021-11-01 Thread Hayk Manukyan

I agree with the above mentioned.
The only concern I have is that we compare little wrong things.
For read we should compare
 (job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp,
year,  issue_flag  ) VS  (job, nlp, year, sequence, Scan_ID, issue_flag)
OR  (job, nlp, year INCLUDE(sequence, Scan_ID, issue_flag) )
Because our proposed index for reading should be closer to a combination of
those 3 and we have current solutions like index on all or with Include
statement.
We should try to find a gap between these three cases.
For DML queries
 (job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year
INCLUDE(sequence, Scan_ID, issue_flag) ) VS  (job, nlp, year, sequence) AND
(job, nlp, year, Scan_ID) and (job, nlp, year,  issue_flag  )
Because again the proposed index should be just one and cover all 3
separate ones.

If you agree with these cases I will try to find a bigger time frame to
compare these two cases deeper.
The issue is not high prio but I strongly believe it can help and can be
nice feature for even more complicated cases.

Best regards.

вс, 31 окт. 2021 г. в 21:33, Tomas Vondra :

>
>
> On 10/31/21 16:48, Pavel Borisov wrote:
> >4 columns: 106 ms
> >6 columns: 109 ms
> >
> > So there's like 3% difference between the two cases, and even that
> > might
> > be just noise. This is consistent with the two indexes being about
> the
> > same size.
> >
> > I also don't think we can get great speedup in the mentioned case, so it
> > is not urgently needed of course. My point is that it is just nice to
> > have a multicolumn index constructed on stacked trees constructed on
> > separate columns, not on the index tuples as a whole thing.
>
> Well, I'd say "nice to have" features are pointless unless they actually
> give tangible benefits (like speedup) to users. I'd bet no one is going
> to implement and maintain something unless it has such benefit, because
> they have to weight it against other beneficial features.
>
> Maybe there are use cases where this would be beneficial, but so far we
> haven't seen one. Usually it's the OP who presents such a case, and a
> plausible way to improve it - but it seems this thread presents a
> solution and now we're looking for an issue it might solve.
>
> > At least there is a benefit of sparing shared memory if we don't need
> > to cache index tuples of several similar indexes, instead caching one
> > "compound index". So if someone wants to propose this thing I'd
> > support it provided problems with concurrency, which were mentioned
> > by Peter are solved.
> >
>
> The problem with this it assumes the new index would use (significantly)
> less space than three separate indexes. I find that rather unlikely, but
> maybe there is a smart way to achieve that (certainly not in detail).
>
> I don't want to sound overly pessimistic and if you have an idea how to
> do this, I'd like to hear it. But it seems pretty tricky, particularly
> if we assume the suffix columns are more variable (which limits the
> "compression" ratio etc.).
>
> > These problems could be appear easy though, as we have index tuples
> > constructed in a similar way as heap tuples. Maybe it could be easier if
> > we had another heap am, which stored separate attributes (if so it could
> > be useful as a better JSON storage method than we have today).
> >
>
> IMO this just moved the goalposts somewhere outside the solar system.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: parallel vacuum comments

2021-11-01 Thread Masahiko Sawada

On Mon, Nov 1, 2021 at 10:44 AM Masahiko Sawada  wrote:
>
> On Sun, Oct 31, 2021 at 6:21 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > Due to bug #17245: [1] I spent a considerably amount of time looking at 
> > vacuum
> > related code. And I found a few things that I think could stand improvement:
> >
> > - There's pretty much no tests. This is way way too complicated feature for
> >   that. If there had been tests for the obvious edge case of some indexes
> >   being too small to be handled in parallel, but others needing parallelism,
> >   the mistake leading to #17245 would have been caught during development.
>
> Yes. We should have tests at least for such cases.

For discussion, I've written a patch only for adding some tests to
parallel vacuum. The test includes the reported case where small
indexes are not processed by the leader process as well as cases where
different kinds of indexes (i.g., different amparallelvacuumoptions)
are vacuumed or cleaned up.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


regression_tests_for_parallel_vacuum.patch
Description: Binary data

Re: should we enable log_checkpoints out of the box?

2021-11-01 Thread Magnus Hagander

On Sun, Oct 31, 2021 at 9:05 PM Tomas Vondra 
wrote:

> On 10/31/21 21:16, Andres Freund wrote:
> > Hi,
> >
> > On 2021-10-31 15:43:57 -0400, Tom Lane wrote:
> >> Andres Freund  writes:
> >>> On 2021-10-31 10:59:19 -0400, Tom Lane wrote:
>  No DBA would be likely to consider it as anything but log spam.
> >>
> >>> I don't agree at all. No postgres instance should be run without
> >>> log_checkpoints enabled. Performance is poor if checkpoints are
> >>> triggered by anything but time, and that can only be diagnosed if
> >>> log_checkpoints is on.
> >>
> >> This is complete nonsense.
> >
> > Shrug. It's based on many years of doing or being around people doing
> > postgres support escalation shifts. And it's not like log_checkpoints
> > incurs meaningful overhead or causes that much log volume.
> >
>
> Yeah. In tuned instances the checkpoints happen fairly infrequently most
> of the time (occasional batch loads being an exception, etc.), so the
> extra log traffic should be fairly negligible. And it's not like we can
> make checkpointer infinitely smart - sometimes the cause is a change in
> the workload etc.
>
> OTOH most of this data (# of timed/xlog checkpoints, buffers written by
> checkpointer etc.) is available in the pg_stat_bgwriter view, and people
> generally have monitoring these days.
>

Yeah, I think you can get much of the data you need in pg_stat_bgwriter.
There is still some data that log_checkpoint gives you that the statistics
don't -- but maybe we should instead look at exposing that information in
pg_stat_bgwriter, rather than changing the default.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

RE: Failed transaction statistics to measure the logical replication progress

2021-11-01 Thread osumi.takami...@fujitsu.com

On Thursday, October 28, 2021 11:19 PM I wrote:
> I've created a new patch that extends pg_stat_subscription_workers to include
> other transaction statistics.
> 
> Note that this patch depends on v18 patch-set in [1]...
Rebased based on the v19 in [1].
Also, fixed documentation a little and made tests tidy.
FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
because I checked that 100 times of its execution in a tight loop all passed.


[1] - 
https://www.postgresql.org/message-id/CAD21AoDY-9_x819F_m1_wfCVXXFJrGiSmR2MfC9Nw4nW8Om0qA%40mail.gmail.com


Best Regards,
Takamichi Osumi



extend_xact_stats_of_subscription_worker_v7.patch
Description: extend_xact_stats_of_subscription_worker_v7.patch

Re: Missing include in be-secure-openssl.c?

2021-11-01 Thread Tom Lane

Daniel Gustafsson  writes:
> It does make sense, but it's a bit worrisome that the indirect inclusion no
> longer works as there is no obvious explanation as to why.

Yeah.  Just to make things even more confusing, hamerkop is not failing
in the back branches.  v14 at least has exactly the same contents of
be-secure-openssl.c, so how's that happening?

>> x509v3.h includes x509.h, so fe-secure-openssl.h would not need an
>> update.  Now could it be a better practice to include both there?

> Judging by OpenSSL, including both is common practice unless the module only
> deals with v3 extensions. Following that lead seems reasonable.

I see that fe-secure-openssl.c includes only x509v3.h, and it builds
successfully on hamerkop.  So I'm now inclined to make be-secure-openssl.c
match that.  But it'd still be a good thing to trace the real cause.

regards, tom lane

Commitfest 2021-11

2021-11-01 Thread Daniel Gustafsson

It is now November 1st AOE and thus the 2021-11 commitfest is now in progress.
I've switched the status and opened 2022-01 for new patches, but noone has so
far raised their hand to run it AFAICT?  Do we have a volunteer Commitfest
Manager keen to help the community make progress on closing patches?

--
Daniel Gustafsson   https://vmware.com/

Re: remove internal support in pgcrypto?

2021-11-01 Thread Andrew Dunstan



On 8/24/21 08:38, Daniel Gustafsson wrote:
>> On 24 Aug 2021, at 11:13, Peter Eisentraut 
>>  wrote:
>> So I'm tempted to suggest that we remove the built-in, non-OpenSSL cipher 
>> and hash implementations in pgcrypto (basically INT_SRCS in 
>> pgcrypto/Makefile), and then also pursue the simplifications in the OpenSSL 
>> code paths described in [0].
> +1
>
>> Thoughts?
> With src/common/cryptohash_*.c and contrib/pgcrypto we have two abstractions
> for hashing ciphers, should we perhaps retire hashing from pgcrypto altogether
> and pull across what we feel is useful to core (AES and 3DES and..)?  There is
> already significant overlap, and allowing core to only support certain ciphers
> when compiled with OpenSSL isn’t any different from doing it in pgcrypto
> really.
>
>> (Some thoughts from those pursuing NSS support would also be useful.)
> Blowfish and CAST5 are not available in NSS.  I've used the internal Blowfish
> implementation as a fallback in the NSS patch and left CAST5 as not supported.
> This proposal would mean that Blowfish too wasn’t supported in NSS builds, but
> I personally don’t see that as a dealbreaker.
>

Maybe it would be worth creating a non-core extension for things like
this that we are ripping out? I have no idea how many people might be
using them.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: storing an explicit nonce

2021-11-01 Thread Antonin Houska

Sasasu  wrote:

> On 2021/10/6 23:01, Robert Haas wrote:
> > This seems wrong to me. CTR requires that you not reuse the IV. If you
> > re-encrypt the page with a different IV, torn pages are a problem. If
> > you re-encrypt it with the same IV, then it's not secure any more.

> for CBC if the IV is predictable will case "dictionary attack".

The following sounds like IV *uniqueness* is needed to defend against "known
plaintext attack" ...

> and for CBC and GCM reuse IV will case "known plaintext attack".

... but here you seem to say that *randomness* is also necessary:

> XTS works like CBC but adds a tweak step.  the tweak step does not add
> randomness. It means XTS still has "known plaintext attack",

(I suppose you mean "XTS with incorrect (e.g. non-random) IV", rather than XTS
as such.)

> due to the same reason from CBC.

According to the Appendix C of

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf

CBC requires *unpredictability* of the IV, but that does not necessarily mean
randomness: the unpredictable IV can be obtained by applying the forward
cipher function to an unique value.

Can you please try to explain once again what you consider a requirement
(uniqueness, randomness, etc.) on the IV for the XTS mode? Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Fix C4819 warning in MSVC

2021-11-01 Thread Tom Lane

Daniel Gustafsson  writes:
> Reading 1051867.1635720...@sss.pgh.pa.us I noticed that hamerkop raise a C4819
> warning on brin_bloom.c, which is defined as:
>"The file contains a character that cannot be represented in the current
> code page (number).  Save the file in Unicode format to prevent data 
> loss."
> The warning message is mostly gibberish but AFAICT the issue is that the
> headercomment includes a citation to the paper used for the hashing scheme, in
> which a footnote character has been copied from the paper (without the 
> footnote
> copied).  Since the footnote is irrelevant for our documentation, I propose to
> remove this offending character to remove warnings from the build.

+1, but there are also C4819 warnings in fe_utils/print.c.  Can we get
rid of that too?  That one's a bit more integral to the code, since
(I think) it's complaining about the comments in the unicode_style table.
But maybe we could replace those with Unicode codes + symbol names?

regards, tom lane

Re: pgbench bug candidate: negative "initial connection time"

2021-11-01 Thread Fujii Masao




On 2021/09/29 22:11, Fujii Masao wrote:



On 2021/09/24 11:26, Fujii Masao wrote:



On 2021/09/24 7:26, Yugo NAGATA wrote:

That makes sense. Failures of setup connection or initial connection doesn't
seem 'static problems'. I rewrote this description to explain exit status 1
indicates also interal errors and early errors.

   Exit status 1 indicates static problems such as invalid command-line options
   or internal errors which are supposed to never occur.  Early errors that 
occur
   when starting benchmark such as initial connection failures also exit with
   status 1.


LGTM. Barring any objection, I will commit the patch.


I extracted two changes from the patch and pushed (also back-patched) them.

The remainings are the changes of handling of initial connection or
logfile open failures. I agree to push them at least for the master.
But I'm not sure if they should be back-patched. Without these changes,
even when those failures happen, pgbench proceeds the benchmark and
reports the result. But with the changes, pgbench exits immediately in
that case. I'm not sure if there are people who expect this behavior,
but if there are, maybe we should not change it at least at stable branches.
Thought?


The current behavior should be improved, but not a bug.
So I don't think that the patch needs to be back-patched.
Barring any objection, I will push the attached patch to the master.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 0f432767c2..c71dab644c 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -904,10 +904,12 @@ pgbench  options 
 d
 
   
A successful run will exit with status 0.  Exit status 1 indicates static
-   problems such as invalid command-line options.  Errors during the run such
-   as database errors or problems in the script will result in exit status 2.
-   In the latter case, pgbench will print partial
-   results.
+   problems such as invalid command-line options or internal errors which
+   are supposed to never occur.  Early errors that occur when starting
+   benchmark such as initial connection failures also exit with status 1.
+   Errors during the run such as database errors or problems in the script
+   will result in exit status 2. In the latter case,
+   pgbench will print partial results.
   
  
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d17f69333f..f8331bbb60 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3181,6 +3181,7 @@ advanceConnectionState(TState *thread, CState *st, 
StatsData *agg)
 
if ((st->con = doConnect()) == NULL)
{
+   /* as the bench is already 
running, we do not abort the process */
pg_log_error("client %d aborted 
while establishing connection", st->id);
st->state = CSTATE_ABORTED;
break;
@@ -4456,7 +4457,10 @@ runInitSteps(const char *initialize_steps)
initPQExpBuffer(&stats);
 
if ((con = doConnect()) == NULL)
+   {
+   pg_log_fatal("could not create connection for initialization");
exit(1);
+   }
 
setup_cancel_handler(NULL);
SetCancelConn(con);
@@ -6399,7 +6403,10 @@ main(int argc, char **argv)
/* opening connection... */
con = doConnect();
if (con == NULL)
+   {
+   pg_log_fatal("could not create connection for setup");
exit(1);
+   }
 
/* report pgbench and server versions */
printVersion(con);
@@ -6625,7 +6632,7 @@ threadRun(void *arg)
if (thread->logfile == NULL)
{
pg_log_fatal("could not open logfile \"%s\": %m", 
logpath);
-   goto done;
+   exit(1);
}
}
 
@@ -6650,16 +6657,10 @@ threadRun(void *arg)
{
if ((state[i].con = doConnect()) == NULL)
{
-   /*
-* On connection failure, we meet the barrier 
here in place of
-* GO before proceeding to the "done" path 
which will cleanup,
-* so as to avoid locking the process.
-*
-* It is unclear whether it is worth doing 
anything rather
-* than coldly exiting with an error message.
-*/
-   THREAD_BARRIER_WAIT(&barrier);
-   goto done;
+   /

Re: pgbench bug candidate: negative "initial connection time"

2021-11-01 Thread Fujii Masao





On 2021/10/09 0:41, Fujii Masao wrote:



On 2021/10/01 15:27, Michael Paquier wrote:

On Wed, Sep 29, 2021 at 10:11:53PM +0900, Fujii Masao wrote:

BTW, when logfile fails to be opened, pgbench gets stuck due to commit
aeb57af8e6. So even if we decided not to back-patch those changes,
we should improve the handling of logfile open failure, to fix the issue.


There is an entry in the CF for this thread:
https://commitfest.postgresql.org/34/3219/

I have moved that to the next one as some pieces are missing.  If you
are planning to handle the rest, could you register your name as a
committer?


Thanks for letting me know that!
I registered myself as a committer of the patch again.


 pg_time_usec_t conn_duration;    /* cumulated connection and deconnection
  * delays */

BTW, while reading the patch, I found the above comment in pgbench.c.
"deconnection" seems a valid word in French (?), but isn't it better to
replace it with "disconnection"? Patch attached.


Barring any objection, I will push this patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Non-superuser subscription owners

2021-11-01 Thread Andrew Dunstan



On 10/20/21 14:40, Mark Dilger wrote:
> These patches have been split off the now deprecated monolithic "Delegating 
> superuser tasks to new security roles" thread at [1].
>
> The purpose of these patches is to allow non-superuser subscription owners 
> without risk of them overwriting tables they lack privilege to write 
> directly. This both allows subscriptions to be managed by non-superusers, and 
> protects servers with subscriptions from malicious activity on the publisher 
> side.
>
> [1] 
> https://www.postgresql.org/message-id/flat/F9408A5A-B20B-42D2-9E7F-49CD3D1547BC%40enterprisedb.com


These patches look good on their face. The code changes are very
straightforward.


w.r.t. this:

+   On the subscriber, the subscription owner's privileges are
re-checked for
+   each change record when applied, but beware that a change of
ownership for a
+   subscription may not be noticed immediately by the replication workers.
+   Changes made on the publisher may be applied on the subscriber as
+   the old owner.  In such cases, the old owner's privileges will be
the ones
+   that matter.  Worse still, it may be hard to predict when replication
+   workers will notice the new ownership.  Subscriptions created
disabled and
+   only enabled after ownership has been changed will not be subject to
this
+   race condition.


maybe we should disable the subscription before making such a change and
then re-enable it?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Feature request for adoptive indexes

2021-11-01 Thread Tomas Vondra

On 11/1/21 1:24 PM, Hayk Manukyan wrote:
> I agree with the above mentioned.  
> The only concern I have is that we compare little wrong things.
> For read we should compare  
>  (job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp,
> year,  issue_flag  ) VS  (job, nlp, year, sequence, Scan_ID, issue_flag)
> OR  (job, nlp, year INCLUDE(sequence, Scan_ID, issue_flag) )
> Because our proposed index for reading should be closer to a combination
> of those 3 and we have current solutions like index on all or with
> Include statement. 

I don't follow.

The whole point of the experiment was to show the gap between a "best
case" and "worst case" alternatives, with the assumption the gap would
be substantial and the new index type might get close to the best case.

Are you suggesting those are not the actual best/worst cases and we
should use some other indexes? If yes, which ones?

IMHO those best/worst cases are fine because:

1) best case (job, nlp, year, sequence)

I don't see how we could get anything better for queries on "sequence"
than this index, because that's literally one of the indexes that would
be included in the whole index.

Yes, if you need to support queries on additional columns, you might
need more indexes, but that's irrelevant - why would anyone define those
indexes, when the "worst case" btree index with all the columns is so
close to the best case?

2) worst case (job, nlp, year, scan_id, issue_flag, sequence)

I think an index with INCLUDE is entirely irrelevant here. The reason to
use INCLUDE is to define UNIQUE index on a subset of columns, but that's
not what we need here. I repeated the benchmark with such index, and the
timing is ~150ms, so about 50% slower than the simple index. Sorting on
all columns is clearly beneficial even for the last column.

So I still think those best/worst cases are sensible, and the proposed
index would need to beat the worst case. Which seems challenging,
considering how close it is to the best case. Or it might break the best
case, if there's some sort of revolutionary way to store the small
indexes or something like that.

The fact that there's no size difference between the two cases is mostly
a coincidence, due to the columns being just 2B each, and with wider
values the difference might be substantial, making the gap larger. But
then the new index would have to improve on this, but there's no
proposal on how to do that.

> We should try to find a gap between these three cases.
> For DML queries 
>  (job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year
> INCLUDE(sequence, Scan_ID, issue_flag) ) VS  (job, nlp, year, sequence)
> AND (job, nlp, year, Scan_ID) and (job, nlp, year,  issue_flag  )
> Because again the proposed index should be just one and cover all 3
> separate ones. 
> 
> If you agree with these cases I will try to find a bigger time frame to
> compare these two cases deeper. 
>
> The issue is not high prio but I strongly believe it can help and can be
> nice feature for even more complicated cases.
> 

You don't need my approval to run benchmarks etc. If you believe this is
beneficial then just do the tests and you'll see if it makes sense ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Added TRANSFORM FOR for COMMENT tab completion

2021-11-01 Thread Fujii Masao





On 2021/10/27 15:54, Shinya Kato wrote:

On 2021-10-27 14:45, Michael Paquier wrote:

On Tue, Oct 26, 2021 at 05:04:24PM +0900, Shinya Kato wrote:

Barring any objection, I will change status to Ready for Committer.


+   else if (Matches("COMMENT", "ON", "PROCEDURAL"))
+   COMPLETE_WITH("LANGUAGE");
+   else if (Matches("COMMENT", "ON", "PROCEDURAL", "LANGUAGE"))
+   COMPLETE_WITH_QUERY(Query_for_list_of_languages);
I don't think that there is much point in being this picky either with
the usage of PROCEDURAL, as we already complete a similar and simpler
grammar with LANGUAGE.  I would just remove this part of the patch.

In my opinion, it is written in the documentation, so tab-completion of 
"PROCEDURAL"is good.
How about a completion with "LANGUAGE" and "PROCEDURAL LANGUAGE", like "PASSWORD" and 
"ENCRYPTED PASSWORD" in CREATE ROLE?


+   else if (Matches("COMMENT", "ON", "OPERATOR"))
+   COMPLETE_WITH("CLASS", "FAMILY");
Isn't this one wrong?  Operators can have comments, and we'd miss
them.  This is mentioned upthread, but it seems to me that we'd better
drop this part of the patch if the operator naming part cannot be
solved easily.

As you said, it may be misleading.
I agree to drop it.


So I changed the status of the patch to Waiting on Author in CF.


+static const SchemaQuery Query_for_list_of_text_search_configurations = {

We already have Query_for_list_of_ts_configurations in tab-complete.c.
Do we really need both queries? Or we can drop either of them?


+#define Query_for_list_of_operator_class_index_methods \
+"SELECT pg_catalog.quote_ident(amname)"\
+"  FROM pg_catalog.pg_am"\
+" WHERE (%d = pg_catalog.length('%s'))"\
+"   AND oid IN "\
+"   (SELECT opcmethod FROM pg_catalog.pg_opclass "\
+" WHERE pg_catalog.quote_ident(opcname)='%s')"

Isn't it overkill to tab-complete this? I thought that because
I'm not sure if COMMENT command for OPERATOR CLASS or FAMILY is
usually executed via psql interactively, instead I just guess
it's executed via script. Also because there is no tab-completion
support for ALTER OPERATOR CLASS or FAMILY command. It's a bit
strange to support the tab-complete for COMMENT for OPERATOR CLASS
or FAMILY first.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

enabling FOO=bar arguments to vcregress.pl

2021-11-01 Thread Andrew Dunstan


As I mentioned recently at
, 
I want to get USE_MODULE_DB working for vcregress.pl. I started out
writing code to strip this from the command line or get it from the
environment, but then it struck me that if would be better to implement
a general Makefile-like mechanism for handling FOO=bar type arguments on
the command line, along the lines of the attached.


Thoughts?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl
index fc826da3ff..168ea7f82a 100644
--- a/src/tools/msvc/vcregress.pl
+++ b/src/tools/msvc/vcregress.pl
@@ -39,6 +39,14 @@ if (-e "src/tools/msvc/buildenv.pl")
 	do "./src/tools/msvc/buildenv.pl";
 }
 
+my %settings;
+
+while (@ARGV && $ARGV[0] =~ /([A-Za-z]\w*)=(.*)/)
+{
+	$settings{$1}={$2};
+	shift;
+}
+
 my $what = shift || "";
 if ($what =~
 	/^(check|installcheck|plcheck|contribcheck|modulescheck|ecpgcheck|isolationcheck|upgradecheck|bincheck|recoverycheck|taptest)$/i

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Tom Lane

Kyotaro Horiguchi  writes:
> At Sun, 31 Oct 2021 16:55:01 -0400, Tom Lane  wrote in 
>> I looked at the remaining list_delete_first callers.
>> 
>> 1. Attached is a proposed patch to get rid of the calls in trgm_regexp.c.
>> I'm not certain that those lists could get long enough to be a problem,
>> given the existing complexity limits in that file (MAX_EXPANDED_STATES
>> etc).  But I'm not certain they can't, either, and it's easy enough to
>> fix along the same lines as in StandbyReleaseLockList.

> I should be missing something, but at the added list_free() there's a
> case where keysQueue has some elelments.  I think the remaining
> elements are useless but AFAICS all the memory allocated there is
> freed after createTrgmNFAInternal returnes, before the "next cycle"
> comes. Do we need to add that list_free there?

I was mainly trying to preserve the memory allocation behavior of the
current code, which will free the List when its last element is removed.
I agree that it *probably* doesn't matter --- but processState is
invoked multiple times during one createTrgmNFAInternal call, so I'm
not quite sure that leaking all those lists couldn't amount to anything.
It seemed prudent to ensure that things couldn't be made worse by this
patch.

>> 3. Is agg_refill_hash_table a problem?  Probably; we've seen cases
>> with lots of batches.

> I excluded it since I'm not sure it is in the pattern at a glance. I
> would want to leave it alone, since changing the logic there seems
> making things a bit complex and the gain by removing list_delete_first
> doesn't look so large..

It looks to me like nodeAgg.c uses the hash_batches list as a stack:
it's pushing things on with lcons, and later popping them off with
list_delete_first.  This is exactly the pattern that 1cff1b95a
recommended reversing.  Now, it's possible that that stack never gets
deep enough for the O(N^2) cost to matter, but ...

>> The logic in gistFindPath looks like a mess
>> anyway since it's appending to both ends of the "fifo" list in different
>> places (is that really necessary?).

> From the other side, the elemnts are inserted by lcons, then removed
> by list_delete_first.  It is the worst usage of the current list
> implementation as a FIFO. Couldn't we construct and iterate over a
> list in the reverse order?

Yeah; at the very least, the use of both lcons and lappend falsifies
the variable name "fifo".  I wonder though if that was intentional
or just somebody's sloppy coding.

regards, tom lane

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Bossart, Nathan

On 10/31/21, 12:39 PM, "Tom Lane"  wrote:
> Yeah, there's no expectation that this data structure needs to be kept
> consistent after an error; and I'm not real sure that the existing
> code could claim to satisfy such a requirement if we did need it.
> (There's at least a short window where the caller's hash table entry
> will point at an already-freed List.)

Right.

> Pushed the patch as given.  I've not yet reviewed other list_delete_first
> callers, but I'll take a look.  (I seem to remember that I did survey
> them while writing 1cff1b95a, but I evidently missed that this code
> could be dealing with a list long enough to be problematic.)

Thanks!

Nathan

Re: Fix C4819 warning in MSVC

2021-11-01 Thread Daniel Gustafsson

> On 1 Nov 2021, at 14:56, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> Reading 1051867.1635720...@sss.pgh.pa.us I noticed that hamerkop raise a 
>> C4819
>> warning on brin_bloom.c, which is defined as:
>>   "The file contains a character that cannot be represented in the current
>>code page (number).  Save the file in Unicode format to prevent data 
>> loss."
>> The warning message is mostly gibberish but AFAICT the issue is that the
>> headercomment includes a citation to the paper used for the hashing scheme, 
>> in
>> which a footnote character has been copied from the paper (without the 
>> footnote
>> copied).  Since the footnote is irrelevant for our documentation, I propose 
>> to
>> remove this offending character to remove warnings from the build.
> 
> +1, but there are also C4819 warnings in fe_utils/print.c.  Can we get
> rid of that too?  That one's a bit more integral to the code, since
> (I think) it's complaining about the comments in the unicode_style table.
> But maybe we could replace those with Unicode codes + symbol names?

Aha, I missed that one when skimming the (quite chatty) log.  The attached
addresses that file as well, replacing the comments with codepoints and names.
It does make the section of the code more verbose, but also more readable IMO.

--
Daniel Gustafsson   https://vmware.com/



msvc_c4819.diff
Description: Binary data

Re: Fix C4819 warning in MSVC

2021-11-01 Thread Tom Lane

Daniel Gustafsson  writes:
> On 1 Nov 2021, at 14:56, Tom Lane  wrote:
>> +1, but there are also C4819 warnings in fe_utils/print.c.  Can we get
>> rid of that too?  That one's a bit more integral to the code, since
>> (I think) it's complaining about the comments in the unicode_style table.
>> But maybe we could replace those with Unicode codes + symbol names?

> Aha, I missed that one when skimming the (quite chatty) log.  The attached
> addresses that file as well, replacing the comments with codepoints and names.
> It does make the section of the code more verbose, but also more readable IMO.

Generally +1, but I think you'll need to add some dashes to keep pgindent
from re-flowing those comment blocks (which would be a big hit to the
readability).

regards, tom lane

Re: Improve logging when using Huge Pages

2021-11-01 Thread Fujii Masao





On 2021/10/29 7:05, Justin Pryzby wrote:

Hi,

On Wed, Oct 27, 2021 at 06:39:46AM +, Shinoda, Noriyoshi (PN Japan FSIP) 
wrote:

Thank you for your comment.
The attached patch stops message splitting.
This patch also limits the timing of message output when huge_pages = try and 
HugePages is not used.


The log message should be reported even when huge_pages=off (and huge pages
are not used)? Otherwise we cannot determine whether huge pages are actually
used or not when no such log message is found in the server log.

Or it's simpler and more intuitive to log the message "Anonymous shared
memory was allocated with huge pages" only when huge pages are successfully
requested? If that message is logged, we can determine that huge pages are
used whatever the setting is. OTOH, if there is no such message, we can
determine that huge pages are not used for some reasons, e.g., OS doesn't
support huge pages, shared_memory_type is not set to mmap, etc.



+   } else if (!with_hugepages && huge_pages == HUGE_PAGES_TRY)
+   ereport(LOG, (errmsg("Anonymous shared memory was allocated without 
huge pages.")));

I would write this as a separate "if".  The preceding block is a terminal FATAL
and it seems more intuitive that way.


Agreed.



Should it include an errcode() ?
ERRCODE_INSUFFICIENT_RESOURCES ?
ERRCODE_OUT_OF_MEMORY ?


Probably errcode is not necessary here because it's a log message
not error one?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Improve logging when using Huge Pages

2021-11-01 Thread Fujii Masao





On 2021/10/29 16:00, Masahiko Sawada wrote:

Which is noisy. Perhaps it's better to log it only when
IsPostmasterEnvironment is true.


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Predefined role pg_maintenance for VACUUM, ANALYZE, CHECKPOINT.

2021-11-01 Thread Stephen Frost

Greetings,

* Bossart, Nathan (bossa...@amazon.com) wrote:
> On 10/30/21, 11:14 AM, "Jeff Davis"  wrote:
> > On Sat, 2021-10-30 at 13:24 +0530, Bharath Rupireddy wrote:
> >> IMHO, moving away from SQL command "CHECKPOINT" to function
> >> "pg_checkpoint()" isn't nice as the SQL command has been there for a
> >> long time and all the applications or services that were/are being
> >> built around the postgres ecosystem would have to adapt someday to
> >> the
> >> new function (if at all we deprecate the command and onboard the
> >> function). This isn't good at all given the CHECKPOINT is one of the
> >> mostly used commands in the apps or services layer. Moreover, if we
> >> go
> >> with the function pg_checkpoint(), we might see patches coming in for
> >> pg_vacuum(), pg_reindex(), pg_cluster() and so on.
> >
> > I tend to agree with all of this. The CHECKPOINT command is already
> > there and people already use it. If we are already chipping away at the
> > need for superuser elsewhere, we should offer a way to use CHECKPOINT
> > without being superuser.
> 
> I think Bharath brings up some good points.  The simple fact is that
> CHECKPOINT has been around for a while, and creating functions for
> maintenance tasks would add just as much or more clutter than adding a
> predefined role for each one.  I do wonder what we would've done if
> CHECKPOINT didn't already exist.  Based on the goal of this thread, I
> get the feeling that we might've seriously considered introducing it
> as a function so that you can just GRANT EXECUTE as needed.  

I don't really buy off on the "because it's been around a long time" as
a reason to invent a predefined role for an individual command that
doesn't take any options and could certainly just be a function.
Applications developed to run as a superuser aren't likely to magically
start working because they were GRANT'd this one additional predefined
role either but likely would need other changes anyway.

All that said, I wonder if we can have our cake and eat it too.  I
haven't looked into this at all yet and perhaps it's foolish on its
face, but, could we make CHECKPOINT; basically turn around and just run
select pg_checkpoint(); with the regular privilege checking happening?
Then we'd keep the existing syntax working, but if the user is allowed
to run the command would depend on if they've been GRANT'd EXECUTE
rights on the function or not.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Bossart, Nathan

On 10/31/21, 1:55 PM, "Tom Lane"  wrote:
> 1. Attached is a proposed patch to get rid of the calls in trgm_regexp.c.
> I'm not certain that those lists could get long enough to be a problem,
> given the existing complexity limits in that file (MAX_EXPANDED_STATES
> etc).  But I'm not certain they can't, either, and it's easy enough to
> fix along the same lines as in StandbyReleaseLockList.

Should there be a list_free(trgmNFA->queue) at the end of
transformGraph()?

> 2. I think we almost certainly have a problem in SyncPostCheckpoint.

This one doesn't look as straightforward.  It looks like we might need
a list_delete_first_n() to delete the first N entries all at once to
improve this one.

> 3. Is agg_refill_hash_table a problem?  Probably; we've seen cases
> with lots of batches.

IIUC this one can be improved by pushing/popping from the end of the
list instead of the beginning.

> 4. I'm a bit worried about the uses in access/gist/, but I don't know
> that code well enough to want to mess with it.  It's possible the
> list lengths are bounded by the index tree height, in which case it
> likely doesn't matter.  The logic in gistFindPath looks like a mess
> anyway since it's appending to both ends of the "fifo" list in different
> places (is that really necessary?).
>
> 5. Not sure about CopyMultiInsertInfoFlush ... how many buffers
> could we have there?

I haven't looked too closely at these.

> 6. llvm_release_context may not have a long enough list to be a
> problem, but on the other hand, it looks easy to fix.

Yeah, I reviewed this one earlier.  I didn't see any reason this one
couldn't be changed to foreach().

> 7. The list lengths in the parser and dependency.c, ruleutils.c,
> explain.c are bounded by subquery nesting depth or plan tree depth,
> so I doubt it's worth worrying about.

+1

> 8. The uses in namespace.c don't seem like an issue either -- for
> instance, GetOverrideSearchPath can't iterate more than twice,
> and the overrideStack list shouldn't get very deep.

+1

Nathan

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Tom Lane

"Bossart, Nathan"  writes:
> On 10/31/21, 1:55 PM, "Tom Lane"  wrote:
>> 1. Attached is a proposed patch to get rid of the calls in trgm_regexp.c.

> Should there be a list_free(trgmNFA->queue) at the end of
> transformGraph()?

There could be, but that's visibly invoked only once per 
createTrgmNFAInternal call, so I didn't think it was worthwhile
to do so (unlike the case for processState).  If we were concerned
about leakage in that function, the hash table would be a far
bigger issue.

regards, tom lane

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Bossart, Nathan

On 11/1/21, 9:58 AM, "Tom Lane"  wrote:
> "Bossart, Nathan"  writes:
>> Should there be a list_free(trgmNFA->queue) at the end of
>> transformGraph()?
>
> There could be, but that's visibly invoked only once per
> createTrgmNFAInternal call, so I didn't think it was worthwhile
> to do so (unlike the case for processState).  If we were concerned
> about leakage in that function, the hash table would be a far
> bigger issue.

Ah, I see it now.  The patch looks good to me, then.

Nathan

Re: Predefined role pg_maintenance for VACUUM, ANALYZE, CHECKPOINT.

2021-11-01 Thread Bossart, Nathan

On 11/1/21, 9:51 AM, "Stephen Frost"  wrote:
> I don't really buy off on the "because it's been around a long time" as
> a reason to invent a predefined role for an individual command that
> doesn't take any options and could certainly just be a function.
> Applications developed to run as a superuser aren't likely to magically
> start working because they were GRANT'd this one additional predefined
> role either but likely would need other changes anyway.

I suspect the CHECKPOINT command wouldn't be removed anytime soon,
either.  I definitely understand the desire to avoid changing
something that's been around a long time, but I think a function fits
better in this case.

> All that said, I wonder if we can have our cake and eat it too.  I
> haven't looked into this at all yet and perhaps it's foolish on its
> face, but, could we make CHECKPOINT; basically turn around and just run
> select pg_checkpoint(); with the regular privilege checking happening?
> Then we'd keep the existing syntax working, but if the user is allowed
> to run the command would depend on if they've been GRANT'd EXECUTE
> rights on the function or not.

I'd be worried about the behavior of CHECKPOINT changing because
someone messed with the function.

Nathan

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Tom Lane

"Bossart, Nathan"  writes:
> Ah, I see it now.  The patch looks good to me, then.

Thanks for looking!  Here's an expanded patch that also takes care
of the other two easy-to-fix cases, nodeAgg.c and llvmjit.c.
AFAICS, llvm_release_context is like StandbyReleaseLockList
in that we don't need to worry about whether the data structure
is valid after an error partway through.  (Maybe we should be
worrying, but I think the callers would need work as well if
that's to be the standard.)

regards, tom lane

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 64816dd370..1887a39161 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -907,6 +907,7 @@ transformGraph(TrgmNFA *trgmNFA)
 	HASHCTL		hashCtl;
 	TrgmStateKey initkey;
 	TrgmState  *initstate;
+	ListCell   *lc;
 
 	/* Initialize this stage's workspace in trgmNFA struct */
 	trgmNFA->queue = NIL;
@@ -937,12 +938,13 @@ transformGraph(TrgmNFA *trgmNFA)
 	/*
 	 * Recursively build the expanded graph by processing queue of states
 	 * (breadth-first search).  getState already put initstate in the queue.
+	 * Note that getState will append new states to the queue within the loop,
+	 * too; this works as long as we don't do repeat fetches using the "lc"
+	 * pointer.
 	 */
-	while (trgmNFA->queue != NIL)
+	foreach(lc, trgmNFA->queue)
 	{
-		TrgmState  *state = (TrgmState *) linitial(trgmNFA->queue);
-
-		trgmNFA->queue = list_delete_first(trgmNFA->queue);
+		TrgmState  *state = (TrgmState *) lfirst(lc);
 
 		/*
 		 * If we overflowed then just mark state as final.  Otherwise do
@@ -966,22 +968,29 @@ transformGraph(TrgmNFA *trgmNFA)
 static void
 processState(TrgmNFA *trgmNFA, TrgmState *state)
 {
+	ListCell   *lc;
+
 	/* keysQueue should be NIL already, but make sure */
 	trgmNFA->keysQueue = NIL;
 
 	/*
 	 * Add state's own key, and then process all keys added to keysQueue until
-	 * queue is empty.  But we can quit if the state gets marked final.
+	 * queue is finished.  But we can quit if the state gets marked final.
 	 */
 	addKey(trgmNFA, state, &state->stateKey);
-	while (trgmNFA->keysQueue != NIL && !(state->flags & TSTATE_FIN))
+	foreach(lc, trgmNFA->keysQueue)
 	{
-		TrgmStateKey *key = (TrgmStateKey *) linitial(trgmNFA->keysQueue);
+		TrgmStateKey *key = (TrgmStateKey *) lfirst(lc);
 
-		trgmNFA->keysQueue = list_delete_first(trgmNFA->keysQueue);
+		if (state->flags & TSTATE_FIN)
+			break;
 		addKey(trgmNFA, state, key);
 	}
 
+	/* Release keysQueue to clean up for next cycle */
+	list_free(trgmNFA->keysQueue);
+	trgmNFA->keysQueue = NIL;
+
 	/*
 	 * Add outgoing arcs only if state isn't final (we have no interest in
 	 * outgoing arcs if we already match)
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index c99a0de4dd..f5a187cae3 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2584,8 +2584,9 @@ agg_refill_hash_table(AggState *aggstate)
 	if (aggstate->hash_batches == NIL)
 		return false;
 
-	batch = linitial(aggstate->hash_batches);
-	aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+	/* hash_batches is a stack, with the top item at the end of the list */
+	batch = llast(aggstate->hash_batches);
+	aggstate->hash_batches = list_delete_last(aggstate->hash_batches);
 
 	hash_agg_set_limits(aggstate->hashentrysize, batch->input_card,
 		batch->used_bits, &aggstate->hash_mem_limit,
@@ -3098,7 +3099,7 @@ hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
 		new_batch = hashagg_batch_new(tape, setno,
 	  spill->ntuples[i], cardinality,
 	  used_bits);
-		aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
+		aggstate->hash_batches = lappend(aggstate->hash_batches, new_batch);
 		aggstate->hash_batches_used++;
 	}
 
@@ -3113,8 +3114,6 @@ hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
 static void
 hashagg_reset_spill_state(AggState *aggstate)
 {
-	ListCell   *lc;
-
 	/* free spills from initial pass */
 	if (aggstate->hash_spills != NULL)
 	{
@@ -3132,13 +3131,7 @@ hashagg_reset_spill_state(AggState *aggstate)
 	}
 
 	/* free batches */
-	foreach(lc, aggstate->hash_batches)
-	{
-		HashAggBatch *batch = (HashAggBatch *) lfirst(lc);
-
-		pfree(batch);
-	}
-	list_free(aggstate->hash_batches);
+	list_free_deep(aggstate->hash_batches);
 	aggstate->hash_batches = NIL;
 
 	/* close tape set */
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 169dad96d7..fb29449573 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -171,6 +171,7 @@ static void
 llvm_release_context(JitContext *context)
 {
 	LLVMJitContext *llvm_context = (LLVMJitContext *) context;
+	ListCell   *lc;
 
 	/*
 	 * When this backend is exiting, don't clean up LLVM. As an error might
@@ -188,12 +189,9 @@ llvm_release_context(JitContext *context)
 		llvm_context->module = NULL;
 	}
 
-	while (l

Re: Predefined role pg_maintenance for VACUUM, ANALYZE, CHECKPOINT.

2021-11-01 Thread Stephen Frost

Greetings,

* Bossart, Nathan (bossa...@amazon.com) wrote:
> On 11/1/21, 9:51 AM, "Stephen Frost"  wrote:
> > All that said, I wonder if we can have our cake and eat it too.  I
> > haven't looked into this at all yet and perhaps it's foolish on its
> > face, but, could we make CHECKPOINT; basically turn around and just run
> > select pg_checkpoint(); with the regular privilege checking happening?
> > Then we'd keep the existing syntax working, but if the user is allowed
> > to run the command would depend on if they've been GRANT'd EXECUTE
> > rights on the function or not.
> 
> I'd be worried about the behavior of CHECKPOINT changing because
> someone messed with the function.

Folks playing around in the catalog can break lots of things, I don't
really see this as an argument against the idea.

I do wonder if we should put a bit more effort into preventing people
from messing with functions and such in pg_catalog.  Being able to do
something like:

create or replace function xpath ( text, xml ) returns xml[]
as $$ begin return 'xml'; end; $$ language plpgsql;

(or with much worse functions..) strikes me as just a bit too easy to
mistakenly cause problems as a superuser.  Still, that's really an
independent issue from this discussion.  It's not like someone breaking
CHECKPOINT; would actually impact normal checkpoints anyway.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: parallelizing the archiver

2021-11-01 Thread Stephen Frost

Greetings,

* Bossart, Nathan (bossa...@amazon.com) wrote:
> On 10/25/21, 1:41 PM, "Bossart, Nathan"  wrote:
> > Great.  Unless I see additional feedback on the basic design shortly,
> > I'll give the documentation updates a try.
> 
> Okay, here is a more complete patch with a first attempt at the
> documentation changes.  I tried to keep the changes to the existing
> docs as minimal as possible, and then I added a new chapter that
> describes what goes into creating an archive module.  Separately, I
> simplified the basic_archive module, moved it to src/test/modules,
> and added a simple test.  My goal is for this to serve as a basic
> example and to provide some test coverage on the new infrastructure.

Definitely interested and plan to look at this more shortly, and
generally this all sounds good, but maybe we should have it be posted
under a new thread as it's moved pretty far from the subject and folks
might not appreciate what this is about at this point..?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Non-superuser subscription owners

2021-11-01 Thread Mark Dilger

> On Nov 1, 2021, at 7:18 AM, Andrew Dunstan  wrote:
> 
> w.r.t. this:
> 
> +   On the subscriber, the subscription owner's privileges are
> re-checked for
> +   each change record when applied, but beware that a change of
> ownership for a
> +   subscription may not be noticed immediately by the replication workers.
> +   Changes made on the publisher may be applied on the subscriber as
> +   the old owner.  In such cases, the old owner's privileges will be
> the ones
> +   that matter.  Worse still, it may be hard to predict when replication
> +   workers will notice the new ownership.  Subscriptions created
> disabled and
> +   only enabled after ownership has been changed will not be subject to
> this
> +   race condition.
> 
> 
> maybe we should disable the subscription before making such a change and
> then re-enable it?

Right.  I commented the code that way because there is a clear concern, but I 
was uncertain which way around the problem was best.

ALTER SUBSCRIPTION..[ENABLE | DISABLE] do not synchronously start or stop 
subscription workers.  The ALTER command updates the catalog's subenabled 
field, but workers only lazily respond to that.  Disabling and enabling the 
subscription as part of the OWNER TO would not reliably accomplish anything.

The attached patch demonstrates the race condition.  It sets up a publisher and 
subscriber, and toggles the subscription on and off on the subscriber node, 
interleaved with inserts and deletes on the publisher node.  If the ALTER 
SUBSCRIPTION commands were synchronous, the test results would be 
deterministic, with only the inserts performed while the subscription is 
enabled being replicated, but because the ALTER commands are asynchronous, the 
results are nondeterministic.

It is unclear that I can make ALTER SUBSCRIPTION..OWNER TO synchronous without 
redesigning the way workers respond to pg_subscription catalog updates 
generally.  That may be a good project to eventually tackle, but I don't see 
that it is more important to close the race condition in an OWNER TO than for a 
DISABLE.

Thoughts?

alter_subscription_race.patch
Description: Binary data

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XTS cipher mode for cluster file encryption

2021-11-01 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, Oct 26, 2021 at 11:11:39PM +0200, Tomas Vondra wrote:
> > BTW I'm not sure what the existing patches do, but I wonder if we should
> > calculate the checksum before or after encryption. I'd say it should be
> > after encryption, because checksums were meant as a protection against
> > issues at the storage level, so the checksum should be on what's written to
> > storage, and it'd also allow offline verification of checksums etc. (Of
> > course, that'd make the whole idea of relying on our checksums even more
> > futile.)
> > 
> > Note: Maybe there are reasons why the checksum needs to be calculated before
> > encryption, not sure.
> 
> Yes, these are the tradeoffs --- allowing offline checksum checking
> without requiring the key vs. giving _some_ integrity checking and
> requiring the key.

I'm in favor of calculating the checksum before encrypting as that will
still catch the storage level bit-flips that it was implemented to
address in the first place and will also make it so that we're very
likely to realize we have an incorrect key before trying to do anything
with the page.  That it might also serve as a deterrent against
attackers trying to randomly flip bits in a page has perhaps some value
but without a cryptographic-level hash it isn't really enough to prevent
against an attacker who has write access to a page.

Any tools which include checking the checksum on pages already have to
deal with clusters where checksums aren't enabled anyway and I wouldn't
expect it to generally be an issue for tools which want to validate
checksums on an encrypted cluster to be able to have the appropriate
key(s) necessary for doing so and to be able to perform the decryption
to do the check.  We can certainly make pg_checksums do this and I don't
see it as an issue for pgbackrest, as two examples that I've
specifically thought about.

> > > > > > - like XTS it doesn't need to change plain text format and doesn't 
> > > > > > need in
> > > > > >additional Nonce/Auth Code.
> > > > > 
> > > > > Sure, in which case it's something that could potentially be added 
> > > > > later
> > > > > as another option in the future.  I don't think we'll always have just
> > > > > one encryption method and it's good to generally think about what it
> > > > > might look like to have others but I don't think it makes sense to try
> > > > > and get everything in all at once.
> > > > 
> > > > And among others Adiantum looks best: it is fast even without hardware
> > > > acceleration, it provides whole block encryption (ie every bit depends
> > > > on every bit) and it doesn't bound to plain-text format.
> > > 
> > > And it could still be added later as another option if folks really want
> > > it to be.  I've outlined why it makes sense to go with XTS first but I
> > > don't mean that to imply that we'll only ever have that.  Indeed, once
> > > we've actually got something, adding other methods will almost certainly
> > > be simpler.  Trying to do everything from the start will make this very
> > > difficult to accomplish though.
> ...
> > So maybe the best thing is simply to roll with both - design the whole
> > feature in a way that allows selecting the encryption scheme, with two
> > options. That's generally a good engineering practice, as it ensures things
> > are not coupled too much. And it's not like the encryption methods are
> > expected to be super difficult.
> 
> I am not in favor of adding additional options to this feature unless we
> can explain why users should choose one over the other.  There is also
> the problem of OpenSSL not supporting Adiantum.

I can understand the general idea that we should be sure to engineer
this in a way that multiple methods can be used, as surely one day folks
will say that AES128 isn't acceptable any more.  In terms of what we'll
do from the start, I would think providing the options of AES128 and
AES256 would be good to ensure that we have the bits covered to support
multiple methods and I don't think that would put us into a situation of
having to really explain which to use to users (we don't for pgcrypto
anyway, as an example).  I agree that we shouldn't be looking at adding
in a whole new crypto library for this though, that's a large and
independent effort (see the work on NSS happening nearby).

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Bossart, Nathan

On 11/1/21, 10:34 AM, "Tom Lane"  wrote:
> Thanks for looking!  Here's an expanded patch that also takes care
> of the other two easy-to-fix cases, nodeAgg.c and llvmjit.c.
> AFAICS, llvm_release_context is like StandbyReleaseLockList
> in that we don't need to worry about whether the data structure
> is valid after an error partway through.  (Maybe we should be
> worrying, but I think the callers would need work as well if
> that's to be the standard.)

LGTM.

Nathan

Re: Predefined role pg_maintenance for VACUUM, ANALYZE, CHECKPOINT.

2021-11-01 Thread Bossart, Nathan

On 11/1/21, 10:43 AM, "Stephen Frost"  wrote:
> Folks playing around in the catalog can break lots of things, I don't
> really see this as an argument against the idea.
>
> I do wonder if we should put a bit more effort into preventing people
> from messing with functions and such in pg_catalog.  Being able to do
> something like:
>
> create or replace function xpath ( text, xml ) returns xml[]
> as $$ begin return 'xml'; end; $$ language plpgsql;
>
> (or with much worse functions..) strikes me as just a bit too easy to
> mistakenly cause problems as a superuser.  Still, that's really an
> independent issue from this discussion.  It's not like someone breaking
> CHECKPOINT; would actually impact normal checkpoints anyway.

Yeah, I see your point.

Nathan

Re: parallelizing the archiver

2021-11-01 Thread Bossart, Nathan

On 11/1/21, 10:57 AM, "Stephen Frost"  wrote:
> Definitely interested and plan to look at this more shortly, and
> generally this all sounds good, but maybe we should have it be posted
> under a new thread as it's moved pretty far from the subject and folks
> might not appreciate what this is about at this point..?

Done: https://postgr.es/m/668D2428-F73B-475E-87AE-F89D67942270%40amazon.com

Looking forward to your feedback.

Nathan

removing global variable ThisTimeLineID

2021-11-01 Thread Robert Haas

Hi,

The global variable ThisTimeLineID is rather confusing. During
recovery, in the startup process, when we're reading a WAL record, it
is the timeline of the WAL record we are trying to read or have just
read, except when we're trying to read the initial checkpoint record,
when it's zero. In other processes, it's typically 0 throughout
recovery, but sometimes it's not, because for example
CreateRestartPoint temporarily sets it to the timeline from which
we're replaying WAL, since there's no other way to get
RemoveOldXlogFiles() to recycle files onto the right timeline, or
PreallocXlogFiles() to preallocate onto the right timeline. Similarly,
walreceiver and walsender find it convenient to make ThisTimeLineID
the timeline from which WAL is being replayed or at least the one from
which it was being replayed at last check. logicalfuncs.c and
slotfuncs.c set the global variable sometimes too, apparently just for
fun, as the value doesn't seem to get used for anything. In normal
running, once the startup process exits, the next call to
RecoveryInProgress() sets the value to the timeline into which new WAL
can be inserted. Note also that all of this is different from
XLogCtl->ThisTimeLineID, which is set at the end of recovery and thus,
in the startup process, generally different from ThisTImeLineID, but
in other processes, generally the same as ThisTimeLineID.

At the risk of stating the obvious, using the same variable for
different purposes at different times is not a great programming
practice. Commits 2f5c4397c39dea49c5608ba583868e26d767fc32 and
902a2c280012557b85c7e0fce3f6f0e355cb2d69 show that the possibility of
programmer error is not zero, even though neither of those issues are
serious. Moreover, the global variable itself seems to do nothing but
invite programming errors. The name itself is a disaster. What is
"this" timeline ID? Well, uh, the right one, of course! We should be
more clear about what we mean: the timeline into which we are
inserting, the timeline from which we are replaying, the timeline from
which we are performing logical decoding. I suspect that having a
global variable here isn't even really saving us anything much, as a
value that does not change can be read from shared memory without a
lock.

I would like to clean this up. Attached is a series of patches which
try to do that. For ease of review, this is separated out into quite a
few separate patches, but most likely we'd combine some of them for
commit. Patches 0001 through 0004 eliminate all use of the global
variable ThisTimeLineID outside of xlog.c, and patch 0005 makes it
"static" so that it ceases to be visible outside of xlog.c. Patches
0006 to 0010 clean up uses of ThisTimeLineID itself, passing it around
as a function parameter, or otherwise arranging to fetch the relevant
timeline ID from someplace sensible rather than using the global
variable as a scratchpad. Finally, patch 0011 deletes the global
variable.

I have not made a serious attempt to clear up all of the
terminological confusion created by the term ThisTimeLineID, which
also appears as part of some structs, so you'll still see that name in
the source code after applying these patches, just not as the name of
a global variable. I have, however, used names like insertTLI and
replayTLI in many places changed by the patch, and I think that makes
it significantly more clear which timeline is being discussed in each
case. In some places I have endeavored to improve the comments. There
is doubtless more that could be done here, but I think this is a
fairly respectable start.

Opinions welcome.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com


0003-walsender.c-Don-t-rely-on-the-global-variable-ThisTi.patch
Description: Binary data


0004-walreceiver.c-Don-t-rely-on-the-global-variable-This.patch
Description: Binary data


0002-Add-GetWALInsertionTimeLine-also-return-it-via-GetFl.patch
Description: Binary data


0001-Don-t-set-ThisTimeLineID-when-there-s-no-reason-to-d.patch
Description: Binary data


0006-xlog.c-Change-assorted-functions-to-take-an-explicit.patch
Description: Binary data


0005-Change-ThisTimeLineID-to-be-static-rather-than-exter.patch
Description: Binary data


0009-xlog.c-Make-xlog_redo-not-depend-on-ThisTimeLineID-g.patch
Description: Binary data


0008-xlog.c-Use-XLogCtl-ThisTimeLineID-in-various-places.patch
Description: Binary data


0010-xlog.c-Adjust-some-more-functions-to-pass-the-TLI-ar.patch
Description: Binary data


0007-xlog.c-Invent-openLogTLI.patch
Description: Binary data


0011-Demote-ThisTimeLineID-to-a-local-variable.patch
Description: Binary data

Portability report: ninja

2021-11-01 Thread Tom Lane

Meson depends on ninja, so it behooves us to look into portability
of ninja if we're considering migrating to meson as the build system.
I tried it out quickly on some ancient platforms, and here's what
I found.

1. The documentation is kinda sucky, eg it fails to explain how
to run ninja's own regression tests.  I ended up with this
reverse-engineered recipe:

git clone git://github.com/ninja-build/ninja.git
cd ninja
git checkout release
./configure.py --bootstrap
./ninja ninja_test
./ninja_test
sudo cp ninja /usr/local/bin  # or location of choice

We'll probably have to include that in our own documentation.

2. You need Python to build it this way.  There is also a process
for bootstrapping using cmake, but that doesn't seem like a more
attractive dependency for old platforms.  Fortunately for my
purposes here, it seems to work with fairly old Python --- I built
successfully with python 2.6.2, though not with 2.4.1.

3. You also need a C++ compiler, but really old ones will work.
I did not see any problems with g++ as old as 4.0.1, other than
some compiler warnings about non-virtual destructors.

4. It built and passed self-test on macOS Leopard (10.5.8), which is
pretty old ... but not old enough for prairiedog, which is stuck on
10.4.x.  There, the build fails miserably for lack of .
It looks like that was added to POSIX in 2001, so one could have
wished for it in a 2005-vintage OS; but nope, Apple took another
couple of years to get around to that.  I'm not even going to bother
trying on gaur's pre-turn-of-the-century OS.

5. It built and passed self-test on Solaris 11, but failed self-test
on Solaris 10.

6. While configure.py thinks it knows what to do on AIX, it fails
on AIX 7.1 and 7.2:

Traceback (most recent call last):
  File "./configure.py", line 544, in 
if platform.is_aix() and not platform.is_os400_pase():
  File "./configure.py", line 103, in is_os400_pase
return self._platform == 'os400' or os.uname().sysname.startswith('OS400')
AttributeError: 'tuple' object has no attribute 'sysname'

Possibly the ninja guys would take a patch for that (or maybe
this is a you-need-python-3 case?).  I do see /usr/include/spawn.h
on that platform, so there's room to hope it'd work.


So it's pretty clear that if we go this way, it'll be the end of
the line for support for some very old OS versions.  I can't,
however, argue with the idea that it's reasonable to require
POSIX 2001 support now.  Based on these results, I doubt that
ninja will give us trouble on any platform that isn't old
enough to get its drivers license.

regards, tom lane

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-11-01 Thread Stephen Frost

Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Oct 29, 2021, at 4:46 PM, Jeff Davis  wrote:
> > But I don't think the concept of role ownership has zero potential
> > confusion, either. For instance, I could certainly imagine a user A
> > creating a role B (and therefore owning it), and then doing "GRANT A TO
> > B". Is there a reason to do that, or is the user confused about what
> > membership versus ownership mean?
> 
> In general, I think that would be the result of the user being confused.  But 
> it is hard to say that definitively, because perhaps users A and C want to 
> create a single user B with the union of both their roles, and have agreed to 
> perform:
> 
> user_a%  CREATE ROLE B;
> user_a%  GRANT A TO B;
> user_c%  GRANT C TO B;
> 
> The easiest way of thinking about role ownership is that a role's owner is 
> superuser in so far as that role is concerned.  It can drop them, modify 
> them, take their objects away from them, assign other objects to them, etc.  
> Anything a superuser could do to impoverish them, their owner can do to 
> impoverish them.  The difference is that an actual superuser can enrich them 
> with anything the superuser likes, whereas their owner can only enrich them 
> with objects and privileges that the owner itself has rights to assign.

I can generally get behind the idea that a user who has been allowed to
create other roles should be able to do various other things with that
role, but should also be limited by what rights they themselves have
(unlike how CREATEROLE works today).

That said, I have a hard time seeing why we're drawing this distinction
of 'ownership' as being ultimately different from simple 'admin' rights
on a role.  In other words, beyond the ability to actually create/drop
roles, having 'admin' rights on a role already conveys just about
everything 'ownership' would.  The things that are getting in the way
there are:

 - Ability to actually create/alter/drop roles, this needs to be
   addressed somehow but doesn't necessarily imply a need for
   'ownership' as a concept.

 - Restriction of a role from being able to implicitly have 'admin'
   rights on itself, as I started a discussion about elsewhere.

 - Some system for deciding who event triggers should fire for.  I don't
   think this should really be tied into the question of who has admin
   rights on whom except to the extent that maybe you can only cause
   event triggers to fire for roles you've got admin rights on (or maybe
   membership in).

One thing that comes to mind is that event triggers aren't the only
thing out there and I have to wonder if we should be thinking about
other things.  As a thought exercise- how is an event trigger really
different from a table-level trigger?  Anyone who has the ability to
create objects is able to create tables, create functions, create
operators, and a user logging in and running SQL can certainly end up
running those things with their privileges.  We've generally recognized
that that's not great and there's been work to get it so that the
'public' schema that everyone has in their search_path by default won't
be world-writable but that isn't exactly a cure-all for the general
problem.

One of the interesting bits is that there's two sides to this.  On the
one hand, as a user, maybe I don't want to run functions of people who I
don't trust.  As an admin/superuser/landlord, maybe I want to require
everyone who I have authority over to run these functions/event
triggers.  I'm not sure that we can find a solution to everything with
this but figure I'd share these thoughts.

Last thought I'll share is that I do believe we're going to want to
provide flexibility when it comes to defining who event triggers run
for, as a given admin may wish for that set to be different from the set
of roles that they ultimately have control over.  I dislike tying these
two things together at such a core level and therefore continue to feel
that CREATE EVENT TRIGGER should be extended in some fashion to allow
individuals who can create them to specify who they are to run for.
Open to different ideas as to how a user could express that, but it
feels to me like that should be a core part of the definition of a
user-defined event trigger (ie: could be "FOR ALL ROLES I OWN" or
whatever, and maybe that's the default, but having that be the only
option isn't appealing).

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Portability report: ninja

2021-11-01 Thread Jesper Pedersen


Hi,

On 11/1/21 15:25, Tom Lane wrote:

So it's pretty clear that if we go this way, it'll be the end of
the line for support for some very old OS versions.  I can't,
however, argue with the idea that it's reasonable to require
POSIX 2001 support now.  Based on these results, I doubt that
ninja will give us trouble on any platform that isn't old
enough to get its drivers license.



You can also look at it as:


If PostgreSQL choose a newer build system then it is up to the companies 
owning the "non-supported" operating systems to add support for the 
build system in question; not the PostgreSQL community.



+1 for POSIX.2001 and meson/ninja.


Best regards,

 Jesper

Re: Feature request for adoptive indexes

2021-11-01 Thread Robert Haas

On Tue, Oct 26, 2021 at 11:11 AM Tomas Vondra
 wrote:
> If I get what you propose, you want to have a "top" tree for (job, nlp,
> year), which "splits" the data set into subsets of ~5000-7000 rows. And
> then for each subset you want a separate "small" trees on each of the
> other columns, so in this case three trees.
>
> Well, the problem with this is pretty obvious - each of the small trees
> requires separate copies of the leaf pages. And remember, in a btree the
> internal pages are usually less than 1% of the index, so this pretty
> much triples the size of the index. And if you insert a row into the
> index, it has to insert the item pointer into each of the small trees,
> likely requiring a separate I/O for each.
>
> So I'd bet this is not any different from just having three separate
> indexes - it doesn't save space, doesn't save I/O, nothing.

I agree. In a lot of cases, it's actually useful to define the index
on fewer columns, like (job, nlp, year) or even just (job, nlp) or
even just (job) because it makes the index so much smaller and that's
pretty important. If you have enough duplicate entries in a (job, nlp,
year) index to justify create a (job, nlp, year, sequence) index, the
number of distinct (job, nlp, year) tuples has to be small compared to
the number of (job, nlp, year, sequence) tuples - and that means that
you wouldn't actually save much by combining your (job, nlp, year,
sequence) index with a (job, nlp, year, other-stuff) index. As you
say, the internal pages aren't the problem.

I don't intend to say that there's no possible use for this kind of
technology. Peter G. says that people are writing research papers
about that and they probably wouldn't be doing that unless they'd
found some case where it's a big win. But this example seems extremely
unconvincing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-11-01 Thread Stephen Frost

Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Oct 25, 2021, at 11:30 AM, Stephen Frost  wrote:
> 
> > Consider instead:
> > 
> > CREATE ROLE X;
> > CREATE ROLE Y;
> > CREATE ROLE Z;
> > 
> > GRANT Y to X;
> > GRANT Z to X;
> > 
> > SET ROLE X;
> > CREATE EVENT TRIGGER FOR Y do_stuff();
> > 
> > Now, X has explicitly said that they want the event trigger to fire for
> > role Y and if the event trigger fires or not is no longer based on
> > membership in the role creating the trigger but instead depends on being
> > the role which the event trigger was explicitly defined to fire on.
> 
> I don't think your proposal quite works, because the set of users you'd like 
> to audit with an event trigger based auditing system may be both large and 
> dynamic:
> 
> CREATE ROLE batman;
> CREATE ROLE robin;
> 
> SET ROLE batman;
> CREATE ROLE gotham_residents NOLOGIN;
> CREATE ROLE riddler IN ROLE gotham_residents LOGIN;
> -- create millions of other Gotham residents
> CREATE EVENT TRIGGER FOR gotham_residents audit_criminal_activity();
> 
> Batman is not superuser, but he's pretty powerful, and he wants to audit all 
> the criminal activity in Gotham.  How should he expect this example to work?
> 
> Having the "FOR gotham_residents" clause mean anybody with membership in role 
> gotham_residents is problematic, because it means that being granted into a 
> role both increases and decreases your freedoms, rather than merely giving 
> you more freedoms.  If Batman covets privileges that Robin has, but wants not 
> to be subjected to any event triggers that fire for Robin, he both wants into 
> and out of role Robin.

The privileges afforded to 'robin' could be GRANT'd to another role
created for that purpose which is then GRANT'd to 'batman' though.
Indeed, that role could be used as the role which GRANT's 'robin' those
rights in the first place too.  This kind of permission management is
largely the point of the role-based system we have.

> Having "FOR gotham_residents" mean that only actions performed by role 
> "gotham_residents" should fire the trigger is useless, since Gotham residents 
> don't log in as that, but as themselves.  Batman won't catch anybody this way.

Naturally.  That doesn't mean that there couldn't be some other role
which all of those roles are made a member of though.  Either way,
there's a big list of "roles this event trigger runs for" and that has
to be managed.  That it happens to be "roles owned by batman", if we
went with your suggested approach, instead of other role membership
doesn't really 'fix' that because there'll be other roles in the system
which 'batman' doesn't own.  One nice thing of using roles for this is
that you end up being able to use the same role multiple ways- consider
this: I want to audit all roles who login to database mydb.  Ah-hah, now
I can say:

CREATE DATABASE mydb;
CREATE EVENT TRIGGER FOR gotham_residents audit_stuff();
REVOKE CONNECT ON DATABASE mydb FROM PUBLIC;
GRANT CONNECT ON DATABASE mydb TO gotham_residents;

Now the two are connected- if you can connect to that database, then
you're going to get audited, and if you manage access to the 'mydb'
database using membership in that role then there's no way for a role to
be able to connect to that database without being audited (except for a
true superuser, but that's always going to be an exception).

> Having to list each new resident to the trigger is tedious and error-prone.  
> Batman may not be able to pass a compliance audit.

Agreed.  Also wouldn't be great since eventually the role list might
have to get TOAST'd and then you're doing an extra lookup to pull back
the list, yuck.

> Having Batman *own* all residents in Gotham city would work, if we can agree 
> on a role ownership system.  It has the downside that only a role's (direct 
> or indirect) owner can do the auditing, though.  That's more flexible than 
> what we have today, where only superuser can do it, but maybe some people 
> would want to argue for a different solution with even more flexibility?  A 
> grantable privilege perhaps?  But whatever it is, the reasoning about who 
> gets audited and who does not must be clear enough that Batman can pass a 
> compliance audit.

What about roles which Batman owns but which he *doesn't* want the event
trigger to fire for?

Note that event triggers are not strictly limited to the auditing case.
Viewing them through that lense masks other quite common use-cases which
are also importnat to consider (like preventing many users, but not all,
from being able to DROP objects as a clear example).

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-11-01 Thread Mark Dilger

> On Nov 1, 2021, at 12:44 PM, Stephen Frost  wrote:
> 
> I can generally get behind the idea that a user who has been allowed to
> create other roles should be able to do various other things with that
> role, but should also be limited by what rights they themselves have
> (unlike how CREATEROLE works today).

I intend to rearrange the role ownership patch set to have the 
0004-Restrict-power-granted-via-CREATEROLE.patch come before, and be 
independent of, the patches that introduce role ownership.  That would put the 
less controversial patch first, and might get committed what we all agree.

> That said, I have a hard time seeing why we're drawing this distinction
> of 'ownership' as being ultimately different from simple 'admin' rights
> on a role.  In other words, beyond the ability to actually create/drop
> roles, having 'admin' rights on a role already conveys just about
> everything 'ownership' would.  The things that are getting in the way
> there are:
> 
> - Ability to actually create/alter/drop roles, this needs to be
>   addressed somehow but doesn't necessarily imply a need for
>   'ownership' as a concept.
> 
> - Restriction of a role from being able to implicitly have 'admin'
>   rights on itself, as I started a discussion about elsewhere.
> 
> - Some system for deciding who event triggers should fire for.  I don't
>   think this should really be tied into the question of who has admin
>   rights on whom except to the extent that maybe you can only cause
>   event triggers to fire for roles you've got admin rights on (or maybe
>   membership in).

You and I are not that far apart on this issue.  The reason I wanted to use 
"ownership" rather than ADMIN is that the spec has a concept of ADMIN and I 
don't know that we can fix everything we want to fix and still be within 
compliance with the spec.

> One thing that comes to mind is that event triggers aren't the only
> thing out there and I have to wonder if we should be thinking about
> other things.  As a thought exercise- how is an event trigger really
> different from a table-level trigger?  Anyone who has the ability to
> create objects is able to create tables, create functions, create
> operators, and a user logging in and running SQL can certainly end up
> running those things with their privileges.

The difference in my mind is that table triggers owned by non-superusers have 
been around for a long time and are in heavy use, so changing how that behaves 
is a huge backwards compatibility break.  Event triggers owned by 
non-superusers are only a fluke, not an intentional feature, and only occur 
when a superuser creates an event trigger and later has superuser privileges 
revoked.  We can expect that far fewer users are really depending on that to 
work compared with table triggers.

In a green field, I would not create table triggers to work as they do.

>  We've generally recognized
> that that's not great and there's been work to get it so that the
> 'public' schema that everyone has in their search_path by default won't
> be world-writable but that isn't exactly a cure-all for the general
> problem.

I agree.

> One of the interesting bits is that there's two sides to this.  On the
> one hand, as a user, maybe I don't want to run functions of people who I
> don't trust.  As an admin/superuser/landlord, maybe I want to require
> everyone who I have authority over to run these functions/event
> triggers.  I'm not sure that we can find a solution to everything with
> this but figure I'd share these thoughts.

If roles were not cluster-wide, I might not have such a problem with leaving 
things mostly as they are.  But there is something really objectionable to 
having two separate databases in a cluster intended for two separate purposes 
and with two separate sets of roles, and the set of roles in one database can 
mess with the roles intended for the other database.  I think some kind of 
partitioning is needed, and I saw role ownership as the cleanest solution to 
it.  I share your intuitions that perhaps the WITH ADMIN OPTION stuff could be 
used instead, but I don't see quite how.

> Last thought I'll share is that I do believe we're going to want to
> provide flexibility when it comes to defining who event triggers run
> for, as a given admin may wish for that set to be different from the set
> of roles that they ultimately have control over.  I dislike tying these
> two things together at such a core level and therefore continue to feel
> that CREATE EVENT TRIGGER should be extended in some fashion to allow
> individuals who can create them to specify who they are to run for.

Within reason, sure.  It is fine by me if we support CREATE EVENT 
TRIGGER...AUTHORIZATION... in order to accomplish that.  But the role running 
that command still needs to be limited to just a (proper or otherwise) subset 
of their own privileges.

I thought about this some when originally writing the event trigger patch.  The 
author of the even

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-11-01 Thread Mark Dilger

> On Nov 1, 2021, at 1:13 PM, Stephen Frost  wrote:
> 
>> Having Batman *own* all residents in Gotham city would work, if we can agree 
>> on a role ownership system.  It has the downside that only a role's (direct 
>> or indirect) owner can do the auditing, though.  That's more flexible than 
>> what we have today, where only superuser can do it, but maybe some people 
>> would want to argue for a different solution with even more flexibility?  A 
>> grantable privilege perhaps?  But whatever it is, the reasoning about who 
>> gets audited and who does not must be clear enough that Batman can pass a 
>> compliance audit.
> 
> What about roles which Batman owns but which he *doesn't* want the event
> trigger to fire for?

I think Batman just has the event trigger exit early for that.  There is 
nothing we can hardcode for filtering users into and out of the trigger that 
will be as flexible as the logic that Batman can implement in the trigger 
itself.  We only need to worry about Batman over stepping his authority.  It's 
not our job to filter further than that.

> Note that event triggers are not strictly limited to the auditing case.
> Viewing them through that lense masks other quite common use-cases which
> are also importnat to consider (like preventing many users, but not all,
> from being able to DROP objects as a clear example).

Nothing in my proposal limits what superusers can do with event triggers they 
create.  The issue under discussion is entirely to do with what non-superusers 
are allowed to do with event triggers.  I see no reason why some ordinary role 
"joe" should be allowed to thwart DROP commands issued on a table that "joe" 
doesn't own by roles that "joe" doesn't own.  Maybe "own" here should be "have 
ADMIN on", but it has to be something.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Portability report: ninja

2021-11-01 Thread Thomas Munro

On Tue, Nov 2, 2021 at 8:25 AM Tom Lane  wrote:
> 6. While configure.py thinks it knows what to do on AIX, it fails
> on AIX 7.1 and 7.2:
>
> Traceback (most recent call last):
>   File "./configure.py", line 544, in 
> if platform.is_aix() and not platform.is_os400_pase():
>   File "./configure.py", line 103, in is_os400_pase
> return self._platform == 'os400' or os.uname().sysname.startswith('OS400')
> AttributeError: 'tuple' object has no attribute 'sysname'

Yeah, the result type changed: os.uname()[0] works on python2 AND python3.

Re: Portability report: ninja

2021-11-01 Thread Thomas Munro

On Tue, Nov 2, 2021 at 8:25 AM Tom Lane  wrote:
> 5. It built and passed self-test on Solaris 11, but failed self-test
> on Solaris 10.

I haven't had time to actually do anything with it yet, but I can
report that meson and ninja are, as of quite recently, available in
the package repository of OpenIndiana.

That's the distribution of illumos I found very convenient to work
with last time I turned the Solaris-family machines red in the build
farm.  The reason I like it is that there's a rolling "vagrant" image
(libvirt and virtualbox flavours), so it's possible to spin up a
working VM with minimum effort.  My current way of keeping track of
what commands are needed to install the various packages PostgreSQL
needs for niche OSes corresponding to our build farm is to record it
all in Vagrant files.

https://github.com/macdice/postgresql-dev-vagrant/blob/master/openindiana/Vagrantfile

Re: parallel vacuum comments

2021-11-01 Thread Peter Geoghegan

On Mon, Nov 1, 2021 at 5:47 AM Masahiko Sawada  wrote:
> For discussion, I've written a patch only for adding some tests to
> parallel vacuum. The test includes the reported case where small
> indexes are not processed by the leader process as well as cases where
> different kinds of indexes (i.g., different amparallelvacuumoptions)
> are vacuumed or cleaned up.

I started looking at this because I want to commit something like it
alongside a fix to bug #17245.

While I tend to favor relatively heavy assertions (e.g., the
heap_page_is_all_visible() related asserts I added to
lazy_scan_prune()), the idea of having a whole shared memory area just
for assertions seems a bit too much, even to me. I tried to simplify
it by just adding a new field to LVSharedIndStats, which seemed more
natural. It took me at least 15 minutes before I realized that I was
actually repeating exactly the same mistake that led to bug #17245 in
the first place. I somehow forgot that
parallel_stats_for_idx()/IndStatsIsNull() will return NULL for any
index that has already been deemed too small to be worth processing in
parallel. Even after all that drama!

Rather than inventing PARALLEL_VACUUM_KEY_INDVAC_CHECK (just for
assert-enabled builds), we should invent PARALLEL_VACUUM_STATS -- a
dedicated shmem area for the array of LVSharedIndStats (no more
storing LVSharedIndStats entries at the end of the LVShared space in
an ad-hoc, type unsafe way). There should be one array element for
each and every index -- even those indexes where parallel index
vacuuming is unsafe or not worthwhile (unsure if avoiding parallel
processing for "not worthwhile" indexes actually makes sense, BTW). We
can then get rid of the bitmap/IndStatsIsNull() stuff entirely. We'd
also add new per-index state fields to LVSharedIndStats itself. We
could directly record the status of each index (e.g., parallel unsafe,
amvacuumcleanup processing done, ambulkdelete processing done)
explicitly. All code could safely subscript the LVSharedIndStats array
directly, using idx style integers. That seems far more robust and
consistent.

I think that this PARALLEL_VACUUM_STATS refactoring is actually the
simplest way to comprehensively test parallel VACUUM. I will still
need to add tests for my fix to bug #17245, but they won't be truly
general tests. I'll have to make sure that one of the assertions in
nbtdedup.c fails when the tests are run without the fix in place, or
something like that.

-- 
Peter Geoghegan

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-11-01 Thread Stephen Frost

Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Nov 1, 2021, at 12:44 PM, Stephen Frost  wrote:
> > I can generally get behind the idea that a user who has been allowed to
> > create other roles should be able to do various other things with that
> > role, but should also be limited by what rights they themselves have
> > (unlike how CREATEROLE works today).
> 
> I intend to rearrange the role ownership patch set to have the 
> 0004-Restrict-power-granted-via-CREATEROLE.patch come before, and be 
> independent of, the patches that introduce role ownership.  That would put 
> the less controversial patch first, and might get committed what we all agree.

I've not directly looked at that patch, but I like it based on the name,
if we think we can actually get folks to agree to is as it's quite a
change from the current situation.  Previously I've felt that we
wouldn't have support for breaking backwards compatibility in such a
manner but perhaps I'm wrong on that.  There's also something to be
said, in my view anyway, of having a predefined role be used for what we
want CREATEROLE to be rather than changing the existing CREATEROLE role
attribute.  Reason being that such a predefined role could then be
GRANT'd to some other role along with whatever other generally-relevant
privileges there are and then that GRANT'd to whomever should have those
rights.  That's not really possible with the current CREATEROLE role
attribute.

> > That said, I have a hard time seeing why we're drawing this distinction
> > of 'ownership' as being ultimately different from simple 'admin' rights
> > on a role.  In other words, beyond the ability to actually create/drop
> > roles, having 'admin' rights on a role already conveys just about
> > everything 'ownership' would.  The things that are getting in the way
> > there are:
> > 
> > - Ability to actually create/alter/drop roles, this needs to be
> >   addressed somehow but doesn't necessarily imply a need for
> >   'ownership' as a concept.
> > 
> > - Restriction of a role from being able to implicitly have 'admin'
> >   rights on itself, as I started a discussion about elsewhere.
> > 
> > - Some system for deciding who event triggers should fire for.  I don't
> >   think this should really be tied into the question of who has admin
> >   rights on whom except to the extent that maybe you can only cause
> >   event triggers to fire for roles you've got admin rights on (or maybe
> >   membership in).
> 
> You and I are not that far apart on this issue.  The reason I wanted to use 
> "ownership" rather than ADMIN is that the spec has a concept of ADMIN and I 
> don't know that we can fix everything we want to fix and still be within 
> compliance with the spec.

There's no concept in the spec of event triggers, I don't believe
anyway, so I'm not really buying this particular argument.  Seems like
we'd be more likely to run afoul of some future spec by creating a
concept of role ownership and creating a definition around what that
means than using something existing in the spec as controlling what some
other not-in-spec thing does.

> > One thing that comes to mind is that event triggers aren't the only
> > thing out there and I have to wonder if we should be thinking about
> > other things.  As a thought exercise- how is an event trigger really
> > different from a table-level trigger?  Anyone who has the ability to
> > create objects is able to create tables, create functions, create
> > operators, and a user logging in and running SQL can certainly end up
> > running those things with their privileges.
> 
> The difference in my mind is that table triggers owned by non-superusers have 
> been around for a long time and are in heavy use, so changing how that 
> behaves is a huge backwards compatibility break.  Event triggers owned by 
> non-superusers are only a fluke, not an intentional feature, and only occur 
> when a superuser creates an event trigger and later has superuser privileges 
> revoked.  We can expect that far fewer users are really depending on that to 
> work compared with table triggers.
> 
> In a green field, I would not create table triggers to work as they do.

I don't think we're entirely beholden to having table-level triggers
work the way they do today.  I agree that we can't simply stop having
them fire for some users while letting things continue to happen on the
table but throwing an error and rolling back a transaction with an error
saying "you were about to run this trigger which runs this function with
your privileges and you don't trust the person who wrote it" seems
entirely within reason, were we to have such a concept.

> >  We've generally recognized
> > that that's not great and there's been work to get it so that the
> > 'public' schema that everyone has in their search_path by default won't
> > be world-writable but that isn't exactly a cure-all for the general
> > problem.
> 
> I agree.
> 
> > One of the interesting bits i

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-11-01 Thread Stephen Frost

Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Nov 1, 2021, at 1:13 PM, Stephen Frost  wrote:
> >> Having Batman *own* all residents in Gotham city would work, if we can 
> >> agree on a role ownership system.  It has the downside that only a role's 
> >> (direct or indirect) owner can do the auditing, though.  That's more 
> >> flexible than what we have today, where only superuser can do it, but 
> >> maybe some people would want to argue for a different solution with even 
> >> more flexibility?  A grantable privilege perhaps?  But whatever it is, the 
> >> reasoning about who gets audited and who does not must be clear enough 
> >> that Batman can pass a compliance audit.
> > 
> > What about roles which Batman owns but which he *doesn't* want the event
> > trigger to fire for?
> 
> I think Batman just has the event trigger exit early for that.  There is 
> nothing we can hardcode for filtering users into and out of the trigger that 
> will be as flexible as the logic that Batman can implement in the trigger 
> itself.  We only need to worry about Batman over stepping his authority.  
> It's not our job to filter further than that.

As noted in my other email you're likely currently reading, this
presumes that Batman is the author of the trigger and is able to make
such changes.  I'm also not thrilled with the presumption that, even if
batman is the author and maintainer, that batman would then have to
write in such exclusions for what strikes me as a pretty commonly wished
for use-case.

> > Note that event triggers are not strictly limited to the auditing case.
> > Viewing them through that lense masks other quite common use-cases which
> > are also importnat to consider (like preventing many users, but not all,
> > from being able to DROP objects as a clear example).
> 
> Nothing in my proposal limits what superusers can do with event triggers they 
> create.  The issue under discussion is entirely to do with what 
> non-superusers are allowed to do with event triggers.  I see no reason why 
> some ordinary role "joe" should be allowed to thwart DROP commands issued on 
> a table that "joe" doesn't own by roles that "joe" doesn't own.  Maybe "own" 
> here should be "have ADMIN on", but it has to be something.

I agree that we're talking about non-superuser event triggers.  I wasn't
suggesting that a non-superuser role 'joe' be able to create event
triggers that impact roles that 'joe' doesn't have rights of some kind
over.  I'm not quite following how your response here is addressing the
point that I brought up in what was quoted above it.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Feature request for adoptive indexes

2021-11-01 Thread Tomas Vondra





On 11/1/21 21:06, Robert Haas wrote:

On Tue, Oct 26, 2021 at 11:11 AM Tomas Vondra
 wrote:

If I get what you propose, you want to have a "top" tree for (job, nlp,
year), which "splits" the data set into subsets of ~5000-7000 rows. And
then for each subset you want a separate "small" trees on each of the
other columns, so in this case three trees.

Well, the problem with this is pretty obvious - each of the small trees
requires separate copies of the leaf pages. And remember, in a btree the
internal pages are usually less than 1% of the index, so this pretty
much triples the size of the index. And if you insert a row into the
index, it has to insert the item pointer into each of the small trees,
likely requiring a separate I/O for each.

So I'd bet this is not any different from just having three separate
indexes - it doesn't save space, doesn't save I/O, nothing.


I agree. In a lot of cases, it's actually useful to define the index
on fewer columns, like (job, nlp, year) or even just (job, nlp) or
even just (job) because it makes the index so much smaller and that's
pretty important. If you have enough duplicate entries in a (job, nlp,
year) index to justify create a (job, nlp, year, sequence) index, the
number of distinct (job, nlp, year) tuples has to be small compared to
the number of (job, nlp, year, sequence) tuples - and that means that
you wouldn't actually save much by combining your (job, nlp, year,
sequence) index with a (job, nlp, year, other-stuff) index. As you
say, the internal pages aren't the problem.

I don't intend to say that there's no possible use for this kind of
technology. Peter G. says that people are writing research papers
about that and they probably wouldn't be doing that unless they'd
found some case where it's a big win. But this example seems extremely
unconvincing.



I actually looked at the use case mentioned by Peter G, i.e. merged 
indexes with master-detail clustering (see e.g. [1]), but that seems 
like a rather different thing. The master-detail refers to storing rows 
from multiple tables, interleaved in a way that allows faster joins. So 
it's essentially a denormalization tool.


Perhaps there's something we could learn about efficient storage of the 
small trees, but I haven't found any papers describing that (I haven't 
spent much time on the search, though).


[1] Algorithms for merged indexes, Goetz Graefe
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.7709


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Partial aggregates pushdown

2021-11-01 Thread Ilya Gladyshev


Hi,

On 21.10.2021 13:55, Alexander Pyhalov wrote:

Hi. Updated patch.
Now aggregates with internal states can be pushed down, if they are 
marked as pushdown safe (this flag is set to true for min/max/sum),

have internal states and associated converters.


I don't quite understand why this is restricted only to aggregates that 
have 'internal' state, I feel like that should be possible for any 
aggregate that has a function to convert its final result back to 
aggregate state to be pushed down. While I couldn't come up with a 
useful example for this, except maybe for an aggregate whose aggfinalfn 
is used purely for cosmetic purposes (e.g. format the result into a 
string), I still feel that it is an unnecessary restriction.


A few minor review notes to the patch:


+static List *build_conv_list(RelOptInfo *foreignrel);

this should probably be up top among other declarations.


@@ -1433,6 +1453,48 @@ postgresGetForeignPlan(PlannerInfo *root,
                         outer_plan);
 }

+/*
+ * Generate attinmeta if there are some converters:
+ * they are expecxted to return BYTEA, but real input type is likely 
different.

+ */


typo in word "expec*x*ted".


@@ -139,10 +147,13 @@ typedef struct PgFdwScanState
                              * for a foreign join scan. */
 TupleDesc    tupdesc;        /* tuple descriptor of scan */
 AttInMetadata *attinmeta;    /* attribute datatype conversion 
metadata */
+    AttInMetadata *rcvd_attinmeta;    /* metadata for received tuples, 
NULL if

+                                     * there's no converters */


Looks like rcvd_attinmeta is redundant and you could use attinmeta for 
conversion metadata.


The last thing - the patch needs to be rebased, it doesn't apply cleanly 
on top of current master.


Thanks,

Ilya Gladyshev

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-11-01 Thread Mark Dilger

> On Nov 1, 2021, at 2:00 PM, Stephen Frost  wrote:

> I've not directly looked at that patch, but I like it based on the name,
> if we think we can actually get folks to agree to is as it's quite a
> change from the current situation.  Previously I've felt that we
> wouldn't have support for breaking backwards compatibility in such a
> manner but perhaps I'm wrong on that.  

I am neutral on this.  I prefer not to break backward compatibility, but I also 
prefer to fix broken features rather than leave them as traps for the unwary.  
The CREATEROLE attribute exists and is defined in a way that is broadly viewed 
as a misfeature.  Fixing it has long term benefits, but short term 
compatibility concerns.

> There's also something to be
> said, in my view anyway, of having a predefined role be used for what we
> want CREATEROLE to be rather than changing the existing CREATEROLE role
> attribute.

I don't see an additional benefit beyond preserving compatibility with how 
CREATEROLE has historically worked.

>  Reason being that such a predefined role could then be
> GRANT'd to some other role along with whatever other generally-relevant
> privileges there are and then that GRANT'd to whomever should have those
> rights.  That's not really possible with the current CREATEROLE role
> attribute.

I'm not seeing that.  If you create a role "role_admin" and give it CREATEROLE 
and other stuff, maybe CREATEDB, pg_read_all_data, and so forth, then you grant 
"stephen" membership in "role_admin", doesn't that work?  Why would 
"role_admin" being a member of some new role, say "pg_can_create_roles", work 
differently than "role_admin" having the CREATEROLE attribute?

>>> That said, I have a hard time seeing why we're drawing this distinction
>>> of 'ownership' as being ultimately different from simple 'admin' rights
>>> on a role.  In other words, beyond the ability to actually create/drop
>>> roles, having 'admin' rights on a role already conveys just about
>>> everything 'ownership' would.  The things that are getting in the way
>>> there are:
>>> 
>>> - Ability to actually create/alter/drop roles, this needs to be
>>>  addressed somehow but doesn't necessarily imply a need for
>>>  'ownership' as a concept.
>>> 
>>> - Restriction of a role from being able to implicitly have 'admin'
>>>  rights on itself, as I started a discussion about elsewhere.
>>> 
>>> - Some system for deciding who event triggers should fire for.  I don't
>>>  think this should really be tied into the question of who has admin
>>>  rights on whom except to the extent that maybe you can only cause
>>>  event triggers to fire for roles you've got admin rights on (or maybe
>>>  membership in).
>> 
>> You and I are not that far apart on this issue.  The reason I wanted to use 
>> "ownership" rather than ADMIN is that the spec has a concept of ADMIN and I 
>> don't know that we can fix everything we want to fix and still be within 
>> compliance with the spec.
> 
> There's no concept in the spec of event triggers, I don't believe
> anyway, so I'm not really buying this particular argument.  Seems like
> we'd be more likely to run afoul of some future spec by creating a
> concept of role ownership and creating a definition around what that
> means than using something existing in the spec as controlling what some
> other not-in-spec thing does.

The WITH ADMIN OPTION feature has a really well defined meaning.  If you have 
ADMIN on a role, you can grant and revoke that role to/from other roles.  
That's it.  If we start tying a bunch of other stuff to that, we're breaking 
reasonable expectations about how WITH ADMIN OPTION works, and since the spec 
defines how that works, we're then in violation of the spec.

CREATEROLE, on the other hand, has no defined meaning in the spec.  It's a 
postgres invention.  So if we change what it means, we're not breaking 
compability with the spec, only backward compatibility with older version of 
postgres vis-a-vis the CREATEROLE misfeature that most people presumably don't 
use.  I find that far preferable to breaking spec compliance.  It is strange to 
me that you view changing how WITH ADMIN OPTION functions as being motivated by 
spec compliance, since I see it as going in the opposite direction.

As you say above, we'd have to tie the ability to create, alter, and drop roles 
to the ADMIN option.  That already sounds like a non-starter to me.  We'd 
further want to tie other stuff, like event triggers, to ADMIN option.  I don't 
see how this furthers spec compliance.

Tying this stuff to CREATEROLE seems perfectly fair.  Nobody coming from 
another database vendor to postgres should have any spec-compatibility-based 
expectations about how CREATEROLE works.

>>> One thing that comes to mind is that event triggers aren't the only
>>> thing out there and I have to wonder if we should be thinking about
>>> other things.  As a thought exercise- how is an event trigger really
>>> different from a table-level

Re: Partial aggregates pushdown

2021-11-01 Thread Ilya Gladyshev




On 01.11.2021 13:30, Alexander Pyhalov wrote:

Peter Eisentraut писал 2021-11-01 12:47:

On 21.10.21 12:55, Alexander Pyhalov wrote:
Now aggregates with internal states can be pushed down, if they are 
marked as pushdown safe (this flag is set to true for min/max/sum),
have internal states and associated converters. Converters are 
called locally, they transform aggregate result to serialized 
internal representation.
As converters don't have access to internal aggregate state, partial 
aggregates like avg() are still not pushable.


It seems to me that the system should be able to determine from the
existing aggregate catalog entry whether an aggregate can be pushed
down.  For example, it could check aggtranstype != internal and
similar.  A separate boolean flag should not be necessary.


Hi.
I think we can't infer this property from existing flags. For example, 
if I have avg() with bigint[] argtranstype, it doesn't mean we can 
push down it. We couldn't also decide if partial aggregete is safe to 
push down based on aggfinalfn presence (for example, it is defined for 
sum(numeric), but we can push it down.


I think one potential way to do it would be to allow pushing down 
aggregates that EITHER have state of the same type as their return type, 
OR have a conversion function that converts their return value to the 
type of their state.

Re: Partial aggregates pushdown

2021-11-01 Thread Tomas Vondra





On 11/1/21 22:31, Ilya Gladyshev wrote:

Hi,

On 21.10.2021 13:55, Alexander Pyhalov wrote:

Hi. Updated patch.
Now aggregates with internal states can be pushed down, if they are 
marked as pushdown safe (this flag is set to true for min/max/sum),

have internal states and associated converters.


I don't quite understand why this is restricted only to aggregates that 
have 'internal' state, I feel like that should be possible for any 
aggregate that has a function to convert its final result back to 
aggregate state to be pushed down. While I couldn't come up with a 
useful example for this, except maybe for an aggregate whose aggfinalfn 
is used purely for cosmetic purposes (e.g. format the result into a 
string), I still feel that it is an unnecessary restriction.




But it's *not* restricted to aggregates with internal state. The patch 
merely requires aggregates with "internal" state to have an extra 
"converter" function.


That being said, I don't think the approach used to deal with internal 
state is the right one. AFAICS it simply runs the aggregate on the 
remote node, finalizes is there, and then uses the converter function to 
"expand" the partial result back into the internal state.


Unfortunately that only works for aggregates like "sum" where the result 
is enough to rebuild the internal state, but it fails for anything more 
complex (like "avg" or "var").


Earlier in this thread I mentioned this to serial/deserial functions, 
and I think we need to do something like that for internal state. I.e. 
we need to call the "serial" function on the remote node, and which 
dumps the whole internal state, and then "deserial" on the local node.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fix C4819 warning in MSVC

2021-11-01 Thread Daniel Gustafsson

> On 1 Nov 2021, at 17:12, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> On 1 Nov 2021, at 14:56, Tom Lane  wrote:
>>> +1, but there are also C4819 warnings in fe_utils/print.c.  Can we get
>>> rid of that too?  That one's a bit more integral to the code, since
>>> (I think) it's complaining about the comments in the unicode_style table.
>>> But maybe we could replace those with Unicode codes + symbol names?
> 
>> Aha, I missed that one when skimming the (quite chatty) log.  The attached
>> addresses that file as well, replacing the comments with codepoints and 
>> names.
>> It does make the section of the code more verbose, but also more readable 
>> IMO.
> 
> Generally +1, but I think you'll need to add some dashes to keep pgindent
> from re-flowing those comment blocks (which would be a big hit to the
> readability).

Agreed. Done with reflow-guards for pgindent.

--
Daniel Gustafsson   https://vmware.com/

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Tom Lane

"Bossart, Nathan"  writes:
> On 10/31/21, 1:55 PM, "Tom Lane"  wrote:
>> 2. I think we almost certainly have a problem in SyncPostCheckpoint.

> This one doesn't look as straightforward.  It looks like we might need
> a list_delete_first_n() to delete the first N entries all at once to
> improve this one.

Yeah.  We don't absolutely need a new list primitive: we could use
list_copy_tail() and then free the old list.  But the extra palloc
traffic involved makes this sound like a bad idea.  It does have
the advantage that we could shorten the List's storage even when
it doesn't go to empty, but I'm not sure that's worth anything.
If the List isn't going to empty, that implies that we're getting
a steady stream of unlink requests, meaning we'd probably just
fill it up again.

The minimum-change patch would have us truncating the list before
each AbsorbSyncRequests call, so that the list state meets that
function's expectations.  However, as long as UNLINKS_PER_ABSORB
is only 10, I don't think that gets us out of the O(N^2) woods.
So what I did in the attached is add a "canceled" flag to
PendingUnlinkEntry, which lets us deal with canceled or finished
entries without having to delete them from the list right away.
Then we only need to physically clean up the list once per
SyncPostCheckpoint call.

regards, tom lane

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 94fb236daf..12847e35ea 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -906,6 +906,72 @@ list_delete_last(List *list)
 	return list_truncate(list, list_length(list) - 1);
 }
 
+/*
+ * Delete the first N cells of the list.
+ *
+ * The List is pfree'd if the request causes all cells to be deleted.
+ */
+List *
+list_delete_first_n(List *list, int n)
+{
+	check_list_invariants(list);
+
+	/* No-op request? */
+	if (n <= 0)
+		return list;
+
+	/* Delete whole list? */
+	if (n >= list_length(list))
+	{
+		list_free(list);
+		return NIL;
+	}
+
+	/*
+	 * Otherwise, we normally just collapse out the removed elements.  But for
+	 * debugging purposes, move the whole list contents someplace else.
+	 *
+	 * (Note that we *must* keep the contents in the same memory context.)
+	 */
+#ifndef DEBUG_LIST_MEMORY_USAGE
+	memmove(&list->elements[0], &list->elements[n],
+			(list->length - n) * sizeof(ListCell));
+	list->length -= n;
+#else
+	{
+		ListCell   *newelems;
+		int			newmaxlen = list->length - n;
+
+		newelems = (ListCell *)
+			MemoryContextAlloc(GetMemoryChunkContext(list),
+			   newmaxlen * sizeof(ListCell));
+		memcpy(newelems, &list->elements[n], newmaxlen * sizeof(ListCell));
+		if (list->elements != list->initial_elements)
+			pfree(list->elements);
+		else
+		{
+			/*
+			 * As in enlarge_list(), clear the initial_elements[] space and/or
+			 * mark it inaccessible.
+			 */
+#ifdef CLOBBER_FREED_MEMORY
+			wipe_mem(list->initial_elements,
+	 list->max_length * sizeof(ListCell));
+#else
+			VALGRIND_MAKE_MEM_NOACCESS(list->initial_elements,
+	   list->max_length * sizeof(ListCell));
+#endif
+		}
+		list->elements = newelems;
+		list->max_length = newmaxlen;
+		list->length = newmaxlen;
+		check_list_invariants(list);
+	}
+#endif
+
+	return list;
+}
+
 /*
  * Generate the union of two lists. This is calculated by copying
  * list1 via list_copy(), then adding to it all the members of list2
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 4a2ed414b0..023d7413e2 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -69,6 +69,7 @@ typedef struct
 {
 	FileTag		tag;			/* identifies handler and file */
 	CycleCtr	cycle_ctr;		/* checkpoint_cycle_ctr when request was made */
+	bool		canceled;		/* true if request has been canceled */
 } PendingUnlinkEntry;
 
 static HTAB *pendingOps = NULL;
@@ -195,11 +196,12 @@ void
 SyncPostCheckpoint(void)
 {
 	int			absorb_counter;
+	ListCell   *lc;
 
 	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
+	foreach(lc, pendingUnlinks)
 	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(lc);
 		char		path[MAXPGPATH];
 
 		/*
@@ -214,6 +216,10 @@ SyncPostCheckpoint(void)
 		if (entry->cycle_ctr == checkpoint_cycle_ctr)
 			break;
 
+		/* Skip over any canceled entries */
+		if (entry->canceled)
+			continue;
+
 		/* Unlink the file */
 		if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
 		  path) < 0)
@@ -231,15 +237,13 @@ SyncPostCheckpoint(void)
 		 errmsg("could not remove file \"%s\": %m", path)));
 		}
 
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
+		/* Mark the list entry as canceled, just in case */
+		entry->canceled = true;
 
 		/*
 		 * As in ProcessSyncRequests, we don't want to stop absorbing fsync
 		 * requests for a long time when there are many deletions to be d

Re: Missing include in be-secure-openssl.c?

2021-11-01 Thread Daniel Gustafsson

> On 1 Nov 2021, at 14:33, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> It does make sense, but it's a bit worrisome that the indirect inclusion no
>> longer works as there is no obvious explanation as to why.
> 
> Yeah.  Just to make things even more confusing, hamerkop is not failing
> in the back branches.  v14 at least has exactly the same contents of
> be-secure-openssl.c, so how's that happening?

That to me has the smell of some form of environment tainting or pollution, as
opposed to a code problem.  v14 and HEAD are identical wrt to building this, so
the answer is likely to lie elsewhere.

>>> x509v3.h includes x509.h, so fe-secure-openssl.h would not need an
>>> update.  Now could it be a better practice to include both there?
> 
>> Judging by OpenSSL, including both is common practice unless the module only
>> deals with v3 extensions. Following that lead seems reasonable.
> 
> I see that fe-secure-openssl.c includes only x509v3.h, and it builds
> successfully on hamerkop.  So I'm now inclined to make be-secure-openssl.c
> match that.

That is in and out of itself not wrong, it shouldn't be required but it's
definitely not wrong to do regardless of what's causing this.

>  But it'd still be a good thing to trace the real cause.

Agreed, I'm hoping the animal owner can shed some light on this.

--
Daniel Gustafsson   https://vmware.com/

Re: Missing include in be-secure-openssl.c?

2021-11-01 Thread Michael Paquier

On Mon, Nov 01, 2021 at 11:15:32PM +0100, Daniel Gustafsson wrote:
> On 1 Nov 2021, at 14:33, Tom Lane  wrote:
>>> Judging by OpenSSL, including both is common practice unless the module only
>>> deals with v3 extensions. Following that lead seems reasonable.
>> 
>> I see that fe-secure-openssl.c includes only x509v3.h, and it builds
>> successfully on hamerkop.  So I'm now inclined to make be-secure-openssl.c
>> match that.
> 
> That is in and out of itself not wrong, it shouldn't be required but it's
> definitely not wrong to do regardless of what's causing this.

I would follow the practice of upstream to include both if were me
to be consistent, but I'm also fine if you just add x509v3.h to
be-secure-openssl.c.
--
Michael


signature.asc
Description: PGP signature

Re: Teach pg_receivewal to use lz4 compression

2021-11-01 Thread Michael Paquier

On Mon, Nov 01, 2021 at 08:39:59AM +, gkokola...@pm.me wrote:
> Agreed.
> 
> I have already started on v8 of the patch with that technique. I should
> be able to update the thread soon.

Nice, thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Missing include in be-secure-openssl.c?

2021-11-01 Thread Tom Lane

Michael Paquier  writes:
> I would follow the practice of upstream to include both if were me
> to be consistent, but I'm also fine if you just add x509v3.h to
> be-secure-openssl.c.

Another thing that's potentially relevant here is that
be-secure-openssl.c and fe-secure-openssl.c are including
ssl-related headers in different orders.  Theoretically
that shouldn't make any difference, but ...

regards, tom lane

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-11-01 Thread Mark Dilger

> On Nov 1, 2021, at 2:05 PM, Stephen Frost  wrote:
> 
> I'm not quite following how your response here is addressing the
> point that I brought up in what was quoted above it.

Could you clarify which question I didn't answer?  I fear I may have left 
something unanswered, but I don't know to what you are referring.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Non-superuser subscription owners

2021-11-01 Thread Mark Dilger

> On Nov 1, 2021, at 10:58 AM, Mark Dilger  wrote:
> 
> ALTER SUBSCRIPTION..[ENABLE | DISABLE] do not synchronously start or stop 
> subscription workers.  The ALTER command updates the catalog's subenabled 
> field, but workers only lazily respond to that.  Disabling and enabling the 
> subscription as part of the OWNER TO would not reliably accomplish anything.

Having discussed this with Andrew off-list, we've concluded that updating the 
documentation for logical replication to make this point more clear is probably 
sufficient, but I wonder if anyone thinks otherwise?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit

2021-11-01 Thread David Zhang


I evaluated the effectiveness of the patch using a simple
multi-statement transaction:

BEGIN;
SAVEPOINT s;
INSERT INTO ft1 VALUES (10, 10);
INSERT INTO ft2 VALUES (20, 20);
RELEASE SAVEPOINT s;
COMMIT;

where ft1 and ft2 are foreign tables created on different foreign
servers hosted on different machines.  I ran the transaction five
times using the patch with the option enabled/disabled, and measured
the latencies for the RELEASE and COMMIT commands in each run.  The
average latencies for these commands over the five runs are:

* RELEASE
   parallel_commit=0: 0.385 ms
   parallel_commit=1: 0.221 ms

* COMMIT
   parallel_commit=0: 1.660 ms
   parallel_commit=1: 0.861 ms

With the option enabled, the average latencies for both commands are
reduced significantly!
Followed your instructions, I performed some basic tests to compare the 
performance between before and after. In my testing environment (two 
foreign servers on the same local machine), the performance varies, 
sometimes the time spent on RELEASE and COMMIT without patch are close 
to after patched, but seems it always perform better after patched. Then 
I ran a 1-millions tuples insert, 5 times average is something like below,


Before
    RELEASE 0.171 ms, COMMIT 1.861 ms

After
    RELEASE 0.147 ms, COMMIT 1.305 ms

Best regards,
--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Re: Partial aggregates pushdown

2021-11-01 Thread Tomas Vondra





On 11/1/21 22:53, Ilya Gladyshev wrote:


On 01.11.2021 13:30, Alexander Pyhalov wrote:

Peter Eisentraut писал 2021-11-01 12:47:

On 21.10.21 12:55, Alexander Pyhalov wrote:
Now aggregates with internal states can be pushed down, if they are 
marked as pushdown safe (this flag is set to true for min/max/sum),
have internal states and associated converters. Converters are 
called locally, they transform aggregate result to serialized 
internal representation.
As converters don't have access to internal aggregate state, partial 
aggregates like avg() are still not pushable.


It seems to me that the system should be able to determine from the
existing aggregate catalog entry whether an aggregate can be pushed
down.  For example, it could check aggtranstype != internal and
similar.  A separate boolean flag should not be necessary.


Hi.
I think we can't infer this property from existing flags. For example, 
if I have avg() with bigint[] argtranstype, it doesn't mean we can 
push down it. We couldn't also decide if partial aggregete is safe to 
push down based on aggfinalfn presence (for example, it is defined for 
sum(numeric), but we can push it down.


I think one potential way to do it would be to allow pushing down 
aggregates that EITHER have state of the same type as their return type, 
OR have a conversion function that converts their return value to the 
type of their state.




IMO just checking (aggtranstype == result type) entirely ignores the 
issue of portability - we've never required the aggregate state to be 
portable in any meaningful way (between architectures, minor/major 
versions, ...) and it seems foolish to just start relying on it here.


Imagine for example an aggregate using bytea state, storing some complex 
C struct in it. You can't just copy that between architectures.


It's a bit like why we don't simply copy data types to network, but pass 
them through input/output or send/receive functions. The new flag is a 
way to mark aggregates where this is safe, and I don't think we can do 
away without it.


The more I think about this, the more I'm convinced the proper way to do 
this would be adding export/import functions, similar to serial/deserial 
functions, with the extra portability guarantees. And we'd need to do 
that for all aggregates, not just those with (aggtranstype == internal).


I get it - the idea of the patch is that keeping the data types the same 
makes it much simpler to pass the aggregate state (compared to having to 
export/import it). But I'm not sure it's the right approach.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Kyotaro Horiguchi

At Mon, 01 Nov 2021 11:58:35 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > At Sun, 31 Oct 2021 16:55:01 -0400, Tom Lane  wrote in 
> >> I looked at the remaining list_delete_first callers.
> >> 
> >> 1. Attached is a proposed patch to get rid of the calls in trgm_regexp.c.
> >> I'm not certain that those lists could get long enough to be a problem,
> >> given the existing complexity limits in that file (MAX_EXPANDED_STATES
> >> etc).  But I'm not certain they can't, either, and it's easy enough to
> >> fix along the same lines as in StandbyReleaseLockList.
> 
> > I should be missing something, but at the added list_free() there's a
> > case where keysQueue has some elelments.  I think the remaining
> > elements are useless but AFAICS all the memory allocated there is
> > freed after createTrgmNFAInternal returnes, before the "next cycle"
> > comes. Do we need to add that list_free there?
> 
> I was mainly trying to preserve the memory allocation behavior of the
> current code, which will free the List when its last element is removed.
> I agree that it *probably* doesn't matter --- but processState is
> invoked multiple times during one createTrgmNFAInternal call, so I'm
> not quite sure that leaking all those lists couldn't amount to anything.
> It seemed prudent to ensure that things couldn't be made worse by this
> patch.

Hmm. Actually that's a kind of convincing.  Thinking together with the
reason for not releasing ramaining elements,  It's fine with me.

> >> 3. Is agg_refill_hash_table a problem?  Probably; we've seen cases
> >> with lots of batches.
> 
> > I excluded it since I'm not sure it is in the pattern at a glance. I
> > would want to leave it alone, since changing the logic there seems
> > making things a bit complex and the gain by removing list_delete_first
> > doesn't look so large..
> 
> It looks to me like nodeAgg.c uses the hash_batches list as a stack:
> it's pushing things on with lcons, and later popping them off with
> list_delete_first.  This is exactly the pattern that 1cff1b95a
> recommended reversing.  Now, it's possible that that stack never gets
> deep enough for the O(N^2) cost to matter, but ...

Right. And the patch in another branch looks good to me.

> >> The logic in gistFindPath looks like a mess
> >> anyway since it's appending to both ends of the "fifo" list in different
> >> places (is that really necessary?).
> 
> > From the other side, the elemnts are inserted by lcons, then removed
> > by list_delete_first.  It is the worst usage of the current list
> > implementation as a FIFO. Couldn't we construct and iterate over a
> > list in the reverse order?
> 
> Yeah; at the very least, the use of both lcons and lappend falsifies
> the variable name "fifo".  I wonder though if that was intentional
> or just somebody's sloppy coding.

It looks like intentional.

> /* Append this child to the list of pages to visit later */

So we would replace the lappend with lcons for the same effect with
the reverse list.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Kyotaro Horiguchi

At Mon, 1 Nov 2021 17:02:49 +, "Bossart, Nathan"  
wrote in 
> On 11/1/21, 9:58 AM, "Tom Lane"  wrote:
> > "Bossart, Nathan"  writes:
> >> Should there be a list_free(trgmNFA->queue) at the end of
> >> transformGraph()?
> >
> > There could be, but that's visibly invoked only once per
> > createTrgmNFAInternal call, so I didn't think it was worthwhile
> > to do so (unlike the case for processState).  If we were concerned
> > about leakage in that function, the hash table would be a far
> > bigger issue.
> 
> Ah, I see it now.  The patch looks good to me, then.

+1

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Added schema level support for publication.

2021-11-01 Thread Masahiko Sawada

On Mon, Nov 1, 2021 at 7:28 PM Amit Kapila  wrote:
>
> On Mon, Nov 1, 2021 at 2:38 PM Greg Nancarrow  wrote:
> >
> > On Mon, Nov 1, 2021 at 5:07 PM Masahiko Sawada  
> > wrote:
> > >
> > > I haven't followed the discussion on pg_publication_objects view but
> > > what is the primary use case of this view? If it's to list all tables
> > > published in a publication (e.g, "select * from pg_publication_objects
> > > where pubname = 'pub1'), pg_publication_objects view lacks the
> > > information of FOR ALL TABLES publications. And probably we can use
> > > pg_publication_tables instead. On the other hand, if it's to list all
> > > tables published in FOR ALL TABLES IN SCHEMA publications (e.g.,
> > > "select * from pg_publication_object where objtype = 'schema'), the
> > > view doesn't show tables published in such publications.
> > >
>
> Both the problems mentioned can be fixed if we follow the change
> suggested by me in one of the emails above [1].
>
> >
> > I think that Amit originally suggested to have a view that provides
> > information about the objects in each publication (like table, tables
> > in schema, sequence ...).
> >
>
> Right and I think as you also mentioned in your previous email it can
> save the effort of users if they want to know all the objects
> published via a publication.

Thank you for the explanation. Given we already have
pg_publication_tables view, if pg_publication_objects view also shows
all tables published in FOR ALL TABLES publications or FOR ALL TABLES
IN SCHEMA publications, there is essentially not much difference
between pg_publication_tables and pg_publication_objects except for
objtype column. Right? If so it'd be better to have one row for each
FOR ALL TABLES publication and FOR ALL TABLES IN SCHEMA publication
with objtype = 'database' or 'schema' etc, instead of individual
tables.

> I am just not sure if it is worth adding
> such a view or we leave it to users to find that information via
> querying individual views or system tables for objects.

I've not looked at the patch for logical replication of sequences but
the view becomes more useful once we support the new type of
replication object? If so, we can consider this view again after the
patch gets committed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: enabling FOO=bar arguments to vcregress.pl

2021-11-01 Thread Michael Paquier

On Mon, Nov 01, 2021 at 11:33:21AM -0400, Andrew Dunstan wrote:
> As I mentioned recently at
> , 
> I want to get USE_MODULE_DB working for vcregress.pl. I started out
> writing code to strip this from the command line or get it from the
> environment, but then it struck me that if would be better to implement
> a general Makefile-like mechanism for handling FOO=bar type arguments on
> the command line, along the lines of the attached.

I am not sure to understand how that will that work with USE_MODULE_DB
which sets up the database names used by the regression tests.  Each
target's module has its own needs in terms of settings that can be
used, meaning that you would still need some boilerplate to do the
mapping between a variable and its command argument?
--
Michael


signature.asc
Description: PGP signature

Re: inefficient loop in StandbyReleaseLockList()

2021-11-01 Thread Kyotaro Horiguchi

At Mon, 01 Nov 2021 18:01:18 -0400, Tom Lane  wrote in 
> "Bossart, Nathan"  writes:
> > On 10/31/21, 1:55 PM, "Tom Lane"  wrote:
> >> 2. I think we almost certainly have a problem in SyncPostCheckpoint.
> 
> > This one doesn't look as straightforward.  It looks like we might need
> > a list_delete_first_n() to delete the first N entries all at once to
> > improve this one.
> 
> Yeah.  We don't absolutely need a new list primitive: we could use
> list_copy_tail() and then free the old list.  But the extra palloc
> traffic involved makes this sound like a bad idea.  It does have
> the advantage that we could shorten the List's storage even when
> it doesn't go to empty, but I'm not sure that's worth anything.
> If the List isn't going to empty, that implies that we're getting
> a steady stream of unlink requests, meaning we'd probably just
> fill it up again.

I agree to that.  In general I think it's better not to resize storage
on truncation (or shortning) of a list except for a few special cases.
(for example, for a list that temporarily grows prominently but won't
go empty)

> The minimum-change patch would have us truncating the list before
> each AbsorbSyncRequests call, so that the list state meets that
> function's expectations.  However, as long as UNLINKS_PER_ABSORB
> is only 10, I don't think that gets us out of the O(N^2) woods.

Agreed.

> So what I did in the attached is add a "canceled" flag to
> PendingUnlinkEntry, which lets us deal with canceled or finished
> entries without having to delete them from the list right away.
> Then we only need to physically clean up the list once per
> SyncPostCheckpoint call.

We don't loop over so many canceled elements usually so I think it
works well. However, shouldn't we consider canceled before checking
cycle_ctr?  A canceled elements should be invisible from later
accesses at all.  I vaguely feel the name "cancel" might be better be
"consumed" or such but I don't object to "cancel".

I feel that we might need to wipe_mem for the memmove case as well
(together with list_delete_nth_cell) but that is another thing even if
that's correct.

Otherwise it looks good to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: removing global variable ThisTimeLineID

2021-11-01 Thread Michael Paquier

On Mon, Nov 01, 2021 at 03:16:27PM -0400, Robert Haas wrote:
> At the risk of stating the obvious, using the same variable for
> different purposes at different times is not a great programming
> practice. Commits 2f5c4397c39dea49c5608ba583868e26d767fc32 and
> 902a2c280012557b85c7e0fce3f6f0e355cb2d69 show that the possibility of
> programmer error is not zero, even though neither of those issues are
> serious.

This can lead to more serious issues, see 595b9cba as recent example.
I agree that getting rid of it in the long term is appealing.

> Moreover, the global variable itself seems to do nothing but
> invite programming errors. The name itself is a disaster. What is
> "this" timeline ID? Well, uh, the right one, of course! We should be
> more clear about what we mean: the timeline into which we are
> inserting, the timeline from which we are replaying, the timeline from
> which we are performing logical decoding. I suspect that having a
> global variable here isn't even really saving us anything much, as a
> value that does not change can be read from shared memory without a
> lock.

Some callbacks related to the WAL reader to read a patch have
introduced a dependency to ThisTimelineID as a mean to be used for
logical decoding on standbys.  This is discussed a bit on this thread,
for example:
https://www.postgresql.org/message-id/ddc31aa9-8083-58b7-0b47-e371cd4c7...@oss.nttdata.com

Tracking properly the TLI in logical decoding was mentioned by Andres
multiple times on -hackers, and this has not been really done.  Using
ThisTimelineID for this purpose is something I've found confusing
myself, FWIW, so this is a step in the good direction.

> I would like to clean this up. Attached is a series of patches which
> try to do that. For ease of review, this is separated out into quite a
> few separate patches, but most likely we'd combine some of them for
> commit. Patches 0001 through 0004 eliminate all use of the global
> variable ThisTimeLineID outside of xlog.c, and patch 0005 makes it
> "static" so that it ceases to be visible outside of xlog.c. Patches
> 0006 to 0010 clean up uses of ThisTimeLineID itself, passing it around
> as a function parameter, or otherwise arranging to fetch the relevant
> timeline ID from someplace sensible rather than using the global
> variable as a scratchpad. Finally, patch 0011 deletes the global
> variable.
> 
> I have not made a serious attempt to clear up all of the
> terminological confusion created by the term ThisTimeLineID, which
> also appears as part of some structs, so you'll still see that name in
> the source code after applying these patches, just not as the name of
> a global variable. I have, however, used names like insertTLI and
> replayTLI in many places changed by the patch, and I think that makes
> it significantly more clear which timeline is being discussed in each
> case. In some places I have endeavored to improve the comments. There
> is doubtless more that could be done here, but I think this is a
> fairly respectable start.

Thanks for splitting things this way.  This helps a lot.

0001 looks fair.

+   /*
+* If we're writing and flushing WAL, the time line can't be changing,
+* so no lock is required.
+*/
+   if (insertTLI)
+   *insertTLI = XLogCtl->ThisTimeLineID;
In 0002, there is no downside in putting that in the spinlock section
either, no?  It seems to me that we should be able to call this one
even in recovery, as flush LSNs exist while in recovery.

+TimeLineID
+GetWALInsertionTimeLine(void)
+{
+   Assert(XLogCtl->SharedRecoveryState == RECOVERY_STATE_DONE);
Okay, that's cheaper.

The series 0003-0006 seemed rather fine, at quick glance.

In 0007, XLogFileClose() should reset openLogTLI.  The same can be
said in BootStrapXLOG() before creating the control file.  It may be
cleaner here to introduce an InvalidTimelineId, as discussed a couple
of days ago.

Some comments at the top of recoveryTargetTLIRequested need a refresh
with their references to ThisTimelineID.

The handling of XLogCtl->ThisTimeLineID still seems a bit blurry when
it comes to timeline switches because we don't update it while in
recovery when replaying a end-of-recovery record.  This could at least
be documented better.
--
Michael


signature.asc
Description: PGP signature

Re: Add support for ALTER INDEX .. ALTER [COLUMN] col_num {SET,RESET}

2021-11-01 Thread Michael Paquier

On Sat, Oct 30, 2021 at 09:45:50AM +0900, Michael Paquier wrote:
> Another thing that could be done is to mark the index as invalid once
> its set of opclass parameters is updated.  That would be simpler,
> while allowing users to fire a concurrent or non-concurrent rebuild at
> will after an ALTER INDEX.

For the note here, I have looked at this possibility.  And things
become very tricky when it comes to indexes marked as either
indisclustered or indisreplident.  REINDEX CONCURRENTLY has the
particularity to switch both parameters to false when swapping to the
new index because we don't need the new index anymore.  Invalid
indexes can create with CREATE INDEX CONCURRENTLY (constraint failure
at creation for example), but they won't have any of those flags set.
So we assume now that indisinvalid is linked to both of them.

This makes the problem mentioned upthread trickier than it looks, as
we'd need to decide what to do after an ALTER INDEX that just switches
an index to be invalid if any of these are set for the existing
index.
--
Michael

signature.asc
Description: PGP signature

Re: Skipping logical replication transactions on subscriber side

2021-11-01 Thread Greg Nancarrow

On Fri, Oct 29, 2021 at 4:24 PM Masahiko Sawada  wrote:
>
> I've attached a new version patch. Since the syntax of skipping
> transaction id is under the discussion I've attached only the error
> reporting patch for now.
>

I have some comments on the v19-0001 patch:

v19-0001

(1) doc/src/sgml/monitoring.sgml
Seems to be missing the word "information":

BEFORE:
+  At least one row per subscription, showing about errors that
+  occurred on subscription.
AFTER:
+  At least one row per subscription, showing information about
+  errors that occurred on subscription.

(2) pg_stat_reset_subscription_worker(subid Oid, relid Oid)
First of all, I think that the documentation for this function should
make it clear that a non-NULL "subid" parameter is required for both
reset cases (tablesync and apply).
Perhaps this could be done by simply changing the first sentence to say:
"Resets statistics of a single subscription worker error, for a worker
running on subscription with subid."
(and then can remove " running on the subscription with
subid" from the last sentence)

I think that the documentation for this function should say that it
should be used in conjunction with the "pg_stat_subscription_workers"
view in order to obtain the required subid/relid values for resetting.
(and should provide a link to the documentation for that view)
Also, I think that the function documentation should make it clear
that the tablesync error case is indicated by a NULL "command" in the
information returned from the "pg_stat_subscription_workers" view
(otherwise the user needs to look at the server log in order to
determine whether the error is for the apply/tablesync worker).

Finally, there are currently no tests for this new function.

(3)  pg_stat_subscription_workers
In the documentation for this, the description for the "command"
column says: "This field is always NULL if the error was reported
during the initial data copy."
Some users may not realise that this refers to "tablesync", so perhaps
add " (tablesync)" to the end of this sentence, or similar.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Portability report: ninja

2021-11-01 Thread Andres Freund

Hi,

On 2021-11-01 15:25:08 -0400, Tom Lane wrote:
> Meson depends on ninja, so it behooves us to look into portability
> of ninja if we're considering migrating to meson as the build system.
> I tried it out quickly on some ancient platforms, and here's what
> I found.

Thanks, that's helpful!


> 1. The documentation is kinda sucky, eg it fails to explain how
> to run ninja's own regression tests.  I ended up with this
> reverse-engineered recipe:
> 
> git clone git://github.com/ninja-build/ninja.git
> cd ninja
> git checkout release
> ./configure.py --bootstrap
> ./ninja ninja_test
> ./ninja_test
> sudo cp ninja /usr/local/bin  # or location of choice
> 
> We'll probably have to include that in our own documentation.

Or perhaps a helper script? I was wondering whether we'd eventually want a
wrapper ./configure / makefile. Perhaps not worth it though - there's enough
projects using meson and ninja that it might add more confusion than being
helpful.


> There is also a process for bootstrapping using cmake, but that doesn't seem
> like a more attractive dependency for old platforms.

Yea. Especially given that meson itself needs python anyway.


> Fortunately for my
> purposes here, it seems to work with fairly old Python --- I built
> successfully with python 2.6.2, though not with 2.4.1.

meson will need a newer python though...


> 4. It built and passed self-test on macOS Leopard (10.5.8), which is
> pretty old ... but not old enough for prairiedog, which is stuck on
> 10.4.x.  There, the build fails miserably for lack of .
> It looks like that was added to POSIX in 2001, so one could have
> wished for it in a 2005-vintage OS; but nope, Apple took another
> couple of years to get around to that.  I'm not even going to bother
> trying on gaur's pre-turn-of-the-century OS.

> 5. It built and passed self-test on Solaris 11, but failed self-test
> on Solaris 10.

I think we can live with those...


> 6. While configure.py thinks it knows what to do on AIX, it fails
> on AIX 7.1 and 7.2:
> 
> Traceback (most recent call last):
>   File "./configure.py", line 544, in 
> if platform.is_aix() and not platform.is_os400_pase():
>   File "./configure.py", line 103, in is_os400_pase
> return self._platform == 'os400' or os.uname().sysname.startswith('OS400')
> AttributeError: 'tuple' object has no attribute 'sysname'
> 
> Possibly the ninja guys would take a patch for that (or maybe
> this is a you-need-python-3 case?).  I do see /usr/include/spawn.h
> on that platform, so there's room to hope it'd work.

That does seem like it'd be a issue. Briefly trawling the ninja git log it
does look like there's regular-ish maintenance stuff for AIX, so I'd hope we
could get it fixed. I do suspect it's just a python3 issue, as Thomas noted.


> Based on these results, I doubt that ninja will give us trouble on any
> platform that isn't old enough to get its drivers license.

Agreed.

There's also alternative compatible ninja implementation in C99 ([1]). But I
think it's minimum requirements aren't actually lower than ninja's (it says it
requires posix 2008).

Greetings,

Andres Freund


[1] https://github.com/michaelforney/samurai

Re: archive modules

2021-11-01 Thread Fujii Masao





On 2021/11/02 3:54, Bossart, Nathan wrote:

This thread is a continuation of the thread with the subject
"parallelizing the archiver" [0].  That thread had morphed into an
effort to allow creating archive modules, so I've created a new one to
ensure that this topic has the proper visibility.


What is the main motivation of this patch? I was thinking that
it's for parallelizing WAL archiving. But as far as I read
the patch very briefly, WAL file name is still passed to
the archive callback function one by one.

Are you planning to extend this mechanism to other WAL
archiving-related commands like restore_command? I can imagine
that those who use archive library (rather than shell) would
like to use the same mechanism for WAL restore.



I've attached the latest patch from the previous thread.  This patch
does a few things.  First, it adds the archive_library GUC that
specifies a library to use in place of archive_command.  If
archive_library is set to "shell" (the default), archive_command is
still used.  The archive_library is preloaded, so its _PG_init() can
do anything that libraries loaded via shared_preload_libraries can do.
Like logical decoding output plugins, archive modules must define an
initialization function and some callbacks.  The patch also introduces
the basic_archive module to ensure test coverage on the new
infrastructure.


I think that it's worth adding this module into core
rather than handling it as test module. It provides very basic
WAL archiving feature, but (I guess) it's enough for some users.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Portability report: ninja

2021-11-01 Thread Tom Lane

Andres Freund  writes:
> On 2021-11-01 15:25:08 -0400, Tom Lane wrote:
>> Fortunately for my
>> purposes here, it seems to work with fairly old Python --- I built
>> successfully with python 2.6.2, though not with 2.4.1.

> meson will need a newer python though...

Yup.  I was just trying to establish what the lower bound was for
ninja, since they (ahem) don't document it.

The meson docs say it needs Python >= 3.6, which I think is going
to be an issue, but I've not researched it yet.  What I was mainly
curious about today was whether anything using ninja as the backend
could meet our portability goals.  My mind's more or less at ease
on that, but I am afraid that requiring 3.6 is going to be a problem
on LTS distributions.  That's a topic for another thread though.

regards, tom lane

Re: Added schema level support for publication.

2021-11-01 Thread Amit Kapila

On Tue, Nov 2, 2021 at 6:43 AM Masahiko Sawada  wrote:
>
> > I am just not sure if it is worth adding
> > such a view or we leave it to users to find that information via
> > querying individual views or system tables for objects.
>
> I've not looked at the patch for logical replication of sequences but
> the view becomes more useful once we support the new type of
> replication object? If so, we can consider this view again after the
> patch gets committed.
>

+1.

-- 
With Regards,
Amit Kapila.

Re: archive modules

2021-11-01 Thread Michael Paquier

On Tue, Nov 02, 2021 at 01:43:54PM +0900, Fujii Masao wrote:
> On 2021/11/02 3:54, Bossart, Nathan wrote:
>> This thread is a continuation of the thread with the subject
>> "parallelizing the archiver" [0].  That thread had morphed into an
>> effort to allow creating archive modules, so I've created a new one to
>> ensure that this topic has the proper visibility.
> 
> What is the main motivation of this patch? I was thinking that
> it's for parallelizing WAL archiving. But as far as I read
> the patch very briefly, WAL file name is still passed to
> the archive callback function one by one.

It seems to me that this patch is not moving into the right direction
implementation-wise (I have read the arguments about
backward-compatibility that led to the introduction of archive_library
and its shell mode), for what looks like a duplicate of
shared_preload_libraries but for an extra code path dedicated to the
archiver, where we could just have a hook instead?  We have been
talking for some time now to make the archiver process more 
bgworker-ish, so as we finish with something closer to what the 
logical replication launcher is.
--
Michael


signature.asc
Description: PGP signature

Re: Skipping logical replication transactions on subscriber side

2021-11-01 Thread Amit Kapila

On Mon, Nov 1, 2021 at 7:18 AM Masahiko Sawada  wrote:
>
> On Fri, Oct 29, 2021 at 8:20 PM Amit Kapila  wrote:
> >
> >
> > Fair enough. So statistics can be removed either by vacuum or drop
> > subscription. Also, if we go by this logic then there is no harm in
> > retaining the stat entries for tablesync errors. Why have different
> > behavior for apply and tablesync workers?
>
> My understanding is that the subscription worker statistics entry
> corresponds to workers (but not physical workers since the physical
> process is changed after restarting). So if the worker finishes its
> jobs, it is no longer necessary to show errors since further problems
> will not occur after that. Table sync worker’s job finishes when
> completing table copy (unless table sync is performed again by REFRESH
> PUBLICATION) whereas apply worker’s job finishes when the subscription
> is dropped.
>

Actually, I am not very sure how users can use the old error
information after we allowed skipping the conflicting xid. Say, if
they want to add/remove some constraints on the table based on
previous errors then they might want to refer to errors of both the
apply worker and table sync worker.

> Also, I’m concerned about a situation like where a lot of
> table sync failed. In which case, if we don’t drop table sync worker
> statistics after completing its job, we end up having a lot of entries
> in the view unless the subscription is dropped.
>

True, but the same could be said for apply workers where errors can be
accumulated over a period of time.

> >
> > I have another question in this regard. Currently, the reset function
> > seems to be resetting only the first stat entry for a subscription.
> > But can't we have multiple stat entries for a subscription considering
> > the view's cumulative nature?
>
> I might be missing your points but I think that with the current
> patch, the view has multiple entries for a subscription. That is,
> there is one apply worker stats and multiple table sync worker stats
> per subscription.
>

Can't we have multiple entries for one apply worker?

> And pg_stat_reset_subscription() function can reset
> any stats by specifying subscription OID and relation OID.
>

Say, if the user has supplied just subscription OID then isn't it
better to reset all the error entries for that subscription?

-- 
With Regards,
Amit Kapila.

Re: Skipping logical replication transactions on subscriber side

2021-11-01 Thread Amit Kapila

On Mon, Nov 1, 2021 at 7:25 AM Masahiko Sawada  wrote:
>
> On Sat, Oct 30, 2021 at 12:21 PM Amit Kapila  wrote:
> >
> > Don't we want these stats to be dealt in the same way as tables and
> > functions as all the stats entries (subscription entries) are specific
> > to a particular database? If so, I think we should write/read these
> > to/from db specific stats file in the same way as we do for tables or
> > functions. I think in the current patch, it will unnecessarily read
> > and probably write subscription stats even when those are not
> > required.
>
> Good point! So probably we should have PgStat_StatDBEntry have the
> hash table for subscription worker statistics, right?
>

Yes.

-- 
With Regards,
Amit Kapila.

Re: parallel vacuum comments

2021-11-01 Thread Masahiko Sawada

On Tue, Nov 2, 2021 at 5:57 AM Peter Geoghegan  wrote:
>
> On Mon, Nov 1, 2021 at 5:47 AM Masahiko Sawada  wrote:
> > For discussion, I've written a patch only for adding some tests to
> > parallel vacuum. The test includes the reported case where small
> > indexes are not processed by the leader process as well as cases where
> > different kinds of indexes (i.g., different amparallelvacuumoptions)
> > are vacuumed or cleaned up.
>
> I started looking at this because I want to commit something like it
> alongside a fix to bug #17245.
>
> While I tend to favor relatively heavy assertions (e.g., the
> heap_page_is_all_visible() related asserts I added to
> lazy_scan_prune()), the idea of having a whole shared memory area just
> for assertions seems a bit too much, even to me. I tried to simplify
> it by just adding a new field to LVSharedIndStats, which seemed more
> natural. It took me at least 15 minutes before I realized that I was
> actually repeating exactly the same mistake that led to bug #17245 in
> the first place. I somehow forgot that
> parallel_stats_for_idx()/IndStatsIsNull() will return NULL for any
> index that has already been deemed too small to be worth processing in
> parallel. Even after all that drama!

The idea of that patch was for back branches in order to not affect
non-enable-cassert builds. That said, I agree that it's an overkill
solution.

> Rather than inventing PARALLEL_VACUUM_KEY_INDVAC_CHECK (just for
> assert-enabled builds), we should invent PARALLEL_VACUUM_STATS -- a
> dedicated shmem area for the array of LVSharedIndStats (no more
> storing LVSharedIndStats entries at the end of the LVShared space in
> an ad-hoc, type unsafe way). There should be one array element for
> each and every index -- even those indexes where parallel index
> vacuuming is unsafe or not worthwhile (unsure if avoiding parallel
> processing for "not worthwhile" indexes actually makes sense, BTW). We
> can then get rid of the bitmap/IndStatsIsNull() stuff entirely. We'd
> also add new per-index state fields to LVSharedIndStats itself. We
> could directly record the status of each index (e.g., parallel unsafe,
> amvacuumcleanup processing done, ambulkdelete processing done)
> explicitly. All code could safely subscript the LVSharedIndStats array
> directly, using idx style integers. That seems far more robust and
> consistent.

Sounds good.

During the development, I wrote the patch while considering using
fewer shared memory but it seems that it brought complexity (and
therefore the bug). It would not be harmful even if we allocate index
statistics on DSM for unsafe indexes and “not worthwhile" indexes in
practice. Rather, tracking bulkdelete and vacuumcleanup completion
might enable us to improve the vacuum progress reporting so that the
progress stats view shows how many indexes have been vacuumed (or
cleaned up).

Anyway, I'll write a patch accordingly.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: XTS cipher mode for cluster file encryption

2021-11-01 Thread Sasasu


On 2021/11/2 02:24, Stephen Frost wrote:

I can understand the general idea that we should be sure to engineer
this in a way that multiple methods can be used, as surely one day folks
will say that AES128 isn't acceptable any more.

Cheers!


OpenPGP_0x4E72AF09097DAE2E.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature

Re: lastOverflowedXid does not handle transaction ID wraparound

2021-11-01 Thread Nikolay Samokhvalov

On Mon, Oct 25, 2021 at 11:41 AM Nikolay Samokhvalov 
wrote:

> On Thu, Oct 21, 2021 at 07:21 Stan Hu  wrote:
>
>> On Wed, Oct 20, 2021 at 9:01 PM Kyotaro Horiguchi
>>  wrote:
>> >
>> > lastOverflowedXid is the smallest subxid that possibly exists but
>> > possiblly not known to the standby. So if all top-level transactions
>> > older than lastOverflowedXid end, that means that all the
>> > subtransactions in doubt are known to have been ended.
>>
>> Thanks for the patch! I verified that it appears to reset
>> lastOverflowedXid properly.
>
> ...

> Any ideas in the direction of observability?
>

Perhaps, anything additional should be considered separately.

The behavior discussed here looks like a bug.

I also have tested the patch. It works fully as expected, details of
testing – below.

I think this is a serious bug hitting heavily loaded Postgres setups with
hot standbys
 and propose fixing it in all supported major versions ASAP since the fix
looks simple.

Any standby in heavily loaded systems (10k+ TPS) where subtransactions are
used
may experience huge performance degradation on standbys [1]. This is what
happened
recently with GitLab [2]. While a full solution to this problem is
something more complex, probably
requiring changes in SLRU [3], the problem discussed here definitely feels
like a serious bug
– if we fully get rid of subtransactions, since 32-bit lastOverflowedXid is
not reset, in new
XID epoch standbys start experience SubtransControlLock/SubtransSLRU again
–
without any subtransactions. This problem is extremely difficult to
diagnose on one hand,
and it may fully make standbys irresponsible while a long-lasting
transaction last on the primary
("long" here may be a matter of minutes or even dozens of seconds – it
depends on the
TPS level). It is especially hard to diagnose in PG 12 or older – because
it doesn't have
pg_stat_slru yet, so one cannot easily notice Subtrans reads.)

The only current solution to this problem is to restart standby Postgres.

How I tested the patch. First, I reproduced the problem:
- current 15devel Postgres, installed on 2 x c5ad.2xlarge on AWS (8 vCPUs,
16 GiB), working as
primary + standby
- follow the steps described in [3] to initiate SubtransSLRU on the standby
- at some point, stop using SAVEPOINTs on the primary - use regular UPDATEs
instead, wait.

Using the following, observe procArray->lastOverflowedXid:

diff --git a/src/backend/storage/ipc/procarray.c
b/src/backend/storage/ipc/procarray.c
index
bd3c7a47fe21949ba63da26f0d692b2ee618f885..ccf3274344d7ba52a6f28a10b08dbfc310cf97e9
100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2428,6 +2428,9 @@ GetSnapshotData(Snapshot snapshot)
  subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
   xmax);

+if (random() % 10 == 0)
+elog(WARNING, "procArray->lastOverflowedXid: %u",
procArray->lastOverflowedXid);
+
  if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
  suboverflowed = true;
  }

Once we stop using SAVEPOINTs on the primary, the
value procArray->lastOverflowedXid stop
 changing, as expected.

Without the patch applied, lastOverflowedXid remains constant forever –
till the server restart.
And as I mentioned, we start experiencing SubtransSLRU and pg_subtrans
reads.

With the patch, lastOverflowedXid is reset to 0, as expected, shortly after
an ongoing "long"
the transaction ends on the primary.

This solves the bug – we don't have SubtransSLRU on standby without actual
use of subtransactions
on the primary.

[1]
https://postgres.ai/blog/20210831-postgresql-subtransactions-considered-harmful
[2]
https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/
[3]
https://www.postgresql.org/message-id/flat/494C5E7F-E410-48FA-A93E-F7723D859561%40yandex-team.ru#18c79477bf7fc44a3ac3d1ce55e4c169
[4]
https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/21

Re: lastOverflowedXid does not handle transaction ID wraparound

2021-11-01 Thread Nikolay Samokhvalov

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, failed
Spec compliant:   not tested
Documentation:not tested

The fix is trivial and works as expected, solving the problem

Tested, described details of the testing in the email thread.

93 matches

Mail list logo