Re: A proposal for shared memory based backup infrastructure

2022-08-04 Thread Bharath Rupireddy
On Sat, Jul 30, 2022 at 12:23 PM mahendrakar s
 wrote:
>
> On Mon, 25 Jul 2022 at 12:00, Bharath Rupireddy 
>  wrote:
>>
>> On Mon, Jul 25, 2022 at 10:03 AM mahendrakar s
>>  wrote:
>> >
>> > Hi Bharath,
>>
>> Thanks Mahendrakar for taking a look at the design.
>>
>> > "Typically, step (3) takes a good amount of time in production
>> > environments with terabytes or petabytes scale of data and keeping the
>> > session alive from step (1) to (4) has overhead and it wastes the
>> > resources.  And the session can get closed for various reasons - idle
>> > in session timeout, tcp/ip keepalive timeout, network problems etc.
>> > All of these can render the backup useless."
>> >
>> > >> this could be a common scenario and needs to be addressed.
>>
>> Hm. Additionally, the problem of keeping the session that starts the
>> backup open until the entire data directory is backed-up becomes more
>> worrisome if we were to run backups for a huge number of servers at
>> scale - the entity (control plane or whatever), that is responsible
>> for taking backups across huge fleet of postgres production servers,
>> will have tremendous amount of resources wasted and it's a problem for
>> that entity to keep the backup sessions active until the actual backup
>> is finished.
>>
>> > "What if the backup started by a session can also be closed by another
>> > session? This seems to be achievable, if we can place the
>> > backup_label, tablespace_map and other required session/backend level
>> > contents in shared memory with the key as backup_label name. It's a
>> > long way to go."
>> >
>> > >>   I think storing metadata about backup of a session in shared memory 
>> > >> may not work as it gets purged when the database goes for restart. We 
>> > >> might require a separate catalogue table to handle the backup session.
>>
>> Right now, the non-exclusive (and we don't have exclusive backups now
>> from postgres 15) backup will anyway become useless if the postgres
>> restarts, because there's no running backup state (backup_label,
>> tablespace_map contents) that's persisted.
>>
>> Following are few more thoughts with the shared memory based backups
>> as proposed in this thread:
>>
>> 1) How many max backups do we want to allow? Right now, there's no
>> limit, I believe, max_connections number of concurrent backups can be
>> taken - we have XLogCtlInsert->runningBackups but no limit. If we were
>> to use shared memory to track the backup state, we might or might not
>> have to decide on max backup limit to not preallocate and consume
>> shared memory unnecessarily, otherwise, we could use something like
>> dynamic shared memory hash table for storing backup state.
>>
>> 2) How to deal with the backups that are started but no one is coming
>> to stop them? Basically, when to declare that the backup is dead or
>> expired? Perhaps, we can have a max time limit after which if no stop
>> backup is issued for a backup, which is then marked as dead or
>> expired.
>>
>> We may or may not want to think on the above points for now until the
>> idea in general has some benefits over the current backup
>> infrastructure.
>
> Hi Bharath,
>
> There might be security concerns if the backup started by one user can be 
> stopped by another user.
> This is because the user who stops the backup will get the backup_label or 
> table space map file contents of other user.
> Isn't this a concern for non-exclusive backup?
>
> I think there should be role based control for backup related activity which 
> can prevent other unprivileged users from stopping the backup.
>
> Thoughts?

The pg_backup_start() and pg_backup_stop() functions are role based -
restricted to superusers by default, but other users can be granted
EXECUTE to run the functions -  I think the existing behaviour would
suffice. However, the responsibility of not letting the users stop
backups started by other users (yes, just with the label name) can lie
with those who use these functions with the new shared memory based
backups, they have to ensure that whoever starts the backup, they only
should stop it. Perhaps, we can call that out in the documentations
explicitly.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-04 Thread John Naylor
On Thu, Aug 4, 2022 at 3:25 AM Nathan Bossart 
wrote:
> Here is a new patch set without the annotation.

Were you considering adding the new function to simd.h now that that's
committed? It's a bit up in the air what should go in there, but this new
function is low-level and generic enough to be a candidate...

I wonder if the "pg_" prefix is appropriate here, as that is most often
used for things that hide specific details *and* where the base name would
clash, like OS calls or C library functions. I'm not quite sure where the
line is drawn, but I mean that "linearsearch" is a generic algorithm and
not a specific API we are implementing, if that makes sense.

The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(),
pg_popcount32, etc). We'll likely want to search bytes sometime, so
something to keep in mind as far as naming ("_int" vs "_byte"?).

I'm not a fan of "its" as a variable name, and I'm curious what it's
intended to convey.

All the __m128i vars could technically be declared const, although I think
it doesn't matter -- it's just a hint to the reader.

Out of curiosity do we know how much we get by loading four registers
rather than two?
--
John Naylor
EDB: http://www.enterprisedb.com


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-04 Thread Amit Langote
Hi,

On Wed, Aug 3, 2022 at 12:00 AM Andres Freund  wrote:
> On 2022-08-02 12:05:55 +0900, Amit Langote wrote:
> > On Tue, Aug 2, 2022 at 9:39 AM Andres Freund  wrote:
> > > On 2022-07-27 17:01:13 +0900, Amit Langote wrote:
> > > > Here's an updated version of the patch, with mostly cosmetic changes.
> > > > In particular, I added comments describing the new llvm_compile_expr()
> > > > blobs.
> > >
> > > - I've asked a couple times before: Why do we need space for every 
> > > possible
> > >   datatype at once in JsonItemCoercions? Can there be multiple 
> > > "concurrent"
> > >   coercions in process?
> >
> > This topic has been a head-scratcher for me too from the beginning,
> > but I've since come to understand (convince myself) that we do need
> > the coercions for all possible types, because we don't know the type
> > of the JSON item that's going to pop out of the main JSON path
> > expression until we've run it through the JSON path executor that
> > ExecEvalJson() invokes.  So, it's not possible to statically assign
> > the coercion.
>
> Sure. But that doesn't mean we have to have memory for every possible type *at
> the same time*.
>
> > I am not really sure if different coercions may be used
> > in the same query over multiple evaluations of the same JSON path
> > expression, but maybe that's also possible.
>
> Even if the type can change, I don't think that means we need to have space
> for multiple types at the same time - there can't be multiple coercions
> happening at the same time, otherwise there could be two coercions of the same
> type as well. So we don't need memory for every coercion type.

Do you find it unnecessary to statically allocate memory for
JsonItemCoercionState for each possible coercion, as in the following
struct definition:

typedef struct JsonItemCoercionsState
{
JsonItemCoercionState   null;
JsonItemCoercionState   string;
JsonItemCoercionState   numeric;
JsonItemCoercionState   boolean;
JsonItemCoercionState   date;
JsonItemCoercionState   time;
JsonItemCoercionState   timetz;
JsonItemCoercionState   timestamp;
JsonItemCoercionState   timestamptz;
JsonItemCoercionState   composite;
} JsonItemCoercionsState;

A given JsonItemCoercionState (note singular Coercion) contains:

typedef struct JsonItemCoercionState
{
/* Expression used to evaluate the coercion */
JsonCoercion   *coercion;

/* ExprEvalStep to compute this coercion's expression */
int jump_eval_expr;
} JsonItemCoercionState;

jump_eval_expr above is the address in JsonItemCoercions'
ExprState.steps of the 1st ExprEvalStep corresponding to
coercion->expr.  IIUC, all ExprEvalSteps needed to evaluate an
expression and its children must be allocated statically in
ExecInitExprRec(), and none on-the-fly as needed.  So, this considers
all coercions and allocates states of all statically.

> > >   The whole coercion stuff just seems incredibly clunky (in a slightly
> > >   different shape before this patch). ExecEvalJsonExprItemCoercion() calls
> > >   ExecPrepareJsonItemCoercion(), which gets a pointer to one of the 
> > > per-type
> > >   elements in JsonItemCoercionsState, dispatching on the type of the json
> > >   object. Then we later call ExecGetJsonItemCoercion() (via a convoluted
> > >   path), which again will dispatch on the type (extracting the json object
> > >   again afaics!), to then somehow eventually get the coerced value.
> >
> > I think it might be possible to make this a bit simpler, by not
> > leaving anything coercion-related in ExecEvalJsonExpr().
>
> Honestly, this code seems like it should just be rewritten from scratch.

Based on what I wrote above, please let me know if I've misunderstood
your concerns about over-allocation of coercion state.  I can try to
rewrite one more time if I know what this should look like instead.

> > I left some pieces there, because I thought the error of not finding an
> > appropriate coercion must be thrown right away as the code in
> > ExecEvalJsonExpr() does after calling ExecGetJsonItemCoercion().
> >
> > ExecPrepareJsonItemCoercion() is called later when it's time to
> > actually evaluate the coercion.  If we move the error path to
> > ExecPrepareJsonItemCoercion(), both ExecGetJsonItemCoercion() and the
> > error path code in ExecEvalJsonExpr() will be unnecessary.  I will
> > give that a try.
>
> Why do we need the separation of prepare and then evaluation? They're executed
> straight after each other?

ExecPrepareJsonItemCoercion() is a helper routine to choose the
coercion and extract the Datum out of the JsonbValue produced by the
EEOP_JSONEXPR_PATH step to feed to the coercion expression's
ExprEvalStep.  The coercion evaluation will be done by jumping to said
step in ExecInterpExpr().

> > > - Looks like there's still some recursive expression states, namely
> > >   JsonExprState->{result_coercion, coercions}?
> >
> > So, the problem with inlining coercion evaluation into the main p

Re: Correct comment in RemoveNonParentXlogFiles()

2022-08-04 Thread Ashutosh Sharma
On Thu, Aug 4, 2022 at 11:30 AM Kyotaro Horiguchi 
wrote:

> At Wed, 3 Aug 2022 18:16:33 +0530, Ashutosh Sharma 
> wrote in
> > Following comment in RemoveNonParentXlogFiles() says that we are trying
> to
> > remove any WAL file whose segment number is >= the segment number of the
> > first WAL file on the new timeline. However, looking at the code, I can
> say
> > that we are trying to remove the WAL files from the previous timeline
> whose
> > segment number is just greater than (not equal to) the segment number of
> > the first WAL file in the new timeline. I think we should improve this
> > comment, thoughts?
> >
> > /*
> >  * Remove files that are on a timeline older than the new one
> we're
> >  * switching to, but with a segment number >= the first segment
> on
> > the
> >  * new timeline.
> >  */
> > if (strncmp(xlde->d_name, switchseg, 8) < 0 &&
> > strcmp(xlde->d_name + 8, switchseg + 8) > 0)
>
> I'm not sure I'm fully getting your point.  The current comment is
> correctly saying that it removes the segments "on a timeline older
> than the new one". I agree about segment comparison.
>

Yeah my complaint is about the comment on segment comparison for removal.


>
> So, if I changed that comment, I would finish with the following change.
>
> -  * switching to, but with a segment number >= the first segment
> on
> +  * switching to, but with a segment number greater than the
> first segment on
>

which looks correct to me and will inline it with the code.


>
> That disagreement started at the time the code was introduced by
> b2a5545bd6.  Leaving the last segment in the old timeline is correct
> since it is renamed to .partial later.  If timeline switch happened
> just at segment boundary, that segment would not not be created.
>

Yeah, that's why we keep the last segment (partially written) from the old
timeline, which means we're not deleting it here. So the comment should not
be saying that we are also removing the last wal segment from the old
timeline which is equal to the first segment from the new timeline.

--
With Regards,
Ashutosh Sharma.


Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-04 Thread Kyotaro Horiguchi
At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch  wrote in 
> > > I think in this case a HINT might be sufficient to at least keep people 
> > > from
> > > wasting time tracking down a problem that has already been fixed.
> 
> Here's a patch to add that HINT.  I figure it's better to do this before next
> week's minor releases.  In the absence of objections, I'll push this around
> 2022-08-05 14:00 UTC.

+1

> > > However, there is another issue [1] that might argue for a back patch if
> > > this patch (as I believe) would fix the issue.
> > 
> > > [1] 
> > > https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
> > 
> > That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
> > chance of failing.  Recovery adds >N files between restartpoints.  Hence, 
> > the
> > WAL directory grows without bound.  Is that roughly the theory in mind?
> 
> On further reflection, I don't expect it to happen that way.  Each failure
> message is LOG-level, so the remaining recycles still happen.  (The race
> condition can yield an ERROR under PreallocXlogFiles(), but recycling is
> already done at that point.)

I agree to this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: fix typos

2022-08-04 Thread John Naylor
On Wed, Aug 3, 2022 at 11:41 PM Robert Haas  wrote:
>
> I think that it's talking about this (documented) syntax:
>
> ALTER ROUTINE name [ ( [ [ argmode ] [ argname ] argtype [, ...] ] ) ]
> [ NO ] DEPENDS ON EXTENSION extension_name
>
> So the change from "depends" to "depend" here is incorrect. Maybe we
> can say something like:
>
> the DEPENDS ON EXTENSION
> extension_name action
>
> (I haven't tested whether this markup works.)

Makes sense, I'll go make it happen.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-08-04 Thread Bharath Rupireddy
On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin  wrote:
>
> > 25 июля 2022 г., в 14:29, Bharath Rupireddy 
> >  написал(а):
> >
> > Hm, after thinking for a while, I tend to agree with the above
> > approach - meaning, query cancel interrupt processing can completely
> > be disabled in SyncRepWaitForLSN() and process proc die interrupt
> > immediately, this approach requires no GUC as opposed to the proposed
> > v1 patch upthread.
> GUC was proposed here[0] to maintain compatibility with previous behaviour. 
> But I think that having no GUC here is fine too. If we do not allow 
> cancelation of unreplicated backends, of course.
>
> >>
> >> And yes, we need additional complexity - but in some other place. 
> >> Transaction can also be locally committed in presence of a server crash. 
> >> But this another difficult problem. Crashed server must not allow data 
> >> queries until LSN of timeline end is successfully replicated to 
> >> synchronous_standby_names.
> >
> > Hm, that needs to be done anyways. How about doing as proposed
> > initially upthread [1]? Also, quoting the idea here [2].
> >
> > Thoughts?
> >
> > [1] 
> > https://www.postgresql.org/message-id/CALj2ACUrOB59QaE6=jf2cfayv1mr7fzd8tr4ym5+oweyg1s...@mail.gmail.com
> > [2] 2) Wait for sync standbys to catch up upon restart after the crash or
> > in the next txn after the old locally committed txn was canceled. One
> > way to achieve this is to let the backend, that's making the first
> > connection, wait for sync standbys to catch up in ClientAuthentication
> > right after successful authentication. However, I'm not sure this is
> > the best way to do it at this point.
>
>
> I think ideally startup process should not allow read only connections in 
> CheckRecoveryConsistency() until WAL is not replicated to quorum al least up 
> until new timeline LSN.

We can't do it in CheckRecoveryConsistency() unless I'm missing
something. Because, the walsenders (required for sending the remaining
WAL to sync standbys to achieve quorum) can only be started after the
server reaches a consistent state, after all walsenders are
specialized backends.



-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: collate not support Unicode Variation Selector

2022-08-04 Thread Kyotaro Horiguchi
At Wed, 3 Aug 2022 20:12:53 +0900, 荒井元成  wrote in 
> Thank you for your reply.
> 
> About 60,000 characters are registered in the IPAmj Mincho font designated by 
> the national specifications. 
> It should be able to handle all characters.

Yeah, it is one of that fonts. But I didn't know that MS-Word can
*display* ideographic variations. But it is dissapoinging that input
requires to copy-paste from the Web..  Maybe that characters can be
input smoothly by using ATOK or alikes..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-08-04 Thread Bharath Rupireddy
On Thu, Aug 4, 2022 at 1:42 PM Bharath Rupireddy
 wrote:
>
> On Mon, Jul 25, 2022 at 4:20 PM Andrey Borodin  wrote:
> >
> > > 25 июля 2022 г., в 14:29, Bharath Rupireddy 
> > >  написал(а):
> > >
> > > Hm, after thinking for a while, I tend to agree with the above
> > > approach - meaning, query cancel interrupt processing can completely
> > > be disabled in SyncRepWaitForLSN() and process proc die interrupt
> > > immediately, this approach requires no GUC as opposed to the proposed
> > > v1 patch upthread.
> > GUC was proposed here[0] to maintain compatibility with previous behaviour. 
> > But I think that having no GUC here is fine too. If we do not allow 
> > cancelation of unreplicated backends, of course.
> >
> > >>
> > >> And yes, we need additional complexity - but in some other place. 
> > >> Transaction can also be locally committed in presence of a server crash. 
> > >> But this another difficult problem. Crashed server must not allow data 
> > >> queries until LSN of timeline end is successfully replicated to 
> > >> synchronous_standby_names.
> > >
> > > Hm, that needs to be done anyways. How about doing as proposed
> > > initially upthread [1]? Also, quoting the idea here [2].
> > >
> > > Thoughts?
> > >
> > > [1] 
> > > https://www.postgresql.org/message-id/CALj2ACUrOB59QaE6=jf2cfayv1mr7fzd8tr4ym5+oweyg1s...@mail.gmail.com
> > > [2] 2) Wait for sync standbys to catch up upon restart after the crash or
> > > in the next txn after the old locally committed txn was canceled. One
> > > way to achieve this is to let the backend, that's making the first
> > > connection, wait for sync standbys to catch up in ClientAuthentication
> > > right after successful authentication. However, I'm not sure this is
> > > the best way to do it at this point.
> >
> >
> > I think ideally startup process should not allow read only connections in 
> > CheckRecoveryConsistency() until WAL is not replicated to quorum al least 
> > up until new timeline LSN.
>
> We can't do it in CheckRecoveryConsistency() unless I'm missing
> something. Because, the walsenders (required for sending the remaining
> WAL to sync standbys to achieve quorum) can only be started after the
> server reaches a consistent state, after all walsenders are
> specialized backends.

Continuing on the above thought (I inadvertently clicked the send
button previously): A simple approach would be to check for quorum in
PostgresMain() before entering the query loop for (;;) for
non-walsender cases. A disadvantage of this would be that all the
backends will be waiting here in the worst case if it takes time for
achieving the sync quorum after restart -  roughly we can do the
following in PostgresMain(), of course we need locking mechanism so
that all the backends whoever reaches here will wait for the same lsn:

if (sync_replicaion_defined == true &&
shmem->wait_for_sync_repl_upon_restart == true)
{
  SyncRepWaitForLSN(pg_current_wal_flush_lsn(), false);
  shmem->wait_for_sync_repl_upon_restart = false;
}

Thoughts?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: Fix obsoleted comments for function prototypes

2022-08-04 Thread Michael Paquier
On Tue, Aug 02, 2022 at 07:25:49PM +0900, Michael Paquier wrote:
> These declarations are linked to comments with their file paths, so
> making that automated looks rather complicated to me.  I have looked
> at the surroundings without noticing anything obvious, so what you
> have caught here sounds fine to me, good catches :)

Done as of 245e14e.
--
Michael


signature.asc
Description: PGP signature


Re: Mingw task for Cirrus CI

2022-08-04 Thread Thomas Munro
On Thu, Aug 4, 2022 at 2:04 PM Thomas Munro  wrote:
> I noticed that this says:
>
> [01:01:45.657] sqlda.pgc: In function 'dump_sqlda':
> [01:01:45.657] sqlda.pgc:45:33: warning: format '%d' expects argument
> of type 'int', but argument 3 has type 'long long int' [-Wformat=]
> [01:01:45.657] 45 | "name sqlda descriptor: '%s' value %I64d\n",
> [01:01:45.657] | ^~~
> [01:01:45.657] ..
> [01:01:45.657] 49 | sqlda->sqlvar[i].sqlname.data, *(long long int
> *)sqlda->sqlvar[i].sqldata);
> [01:01:45.657] | ~~
> [01:01:45.657] | |
> [01:01:45.657] | long long int
>
> ... but fairywren doesn't.  Why would they disagree on recognising %I64d?

Oops, I was looking in the wrong place.  fairywren does also shows the warning:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=fairywren&dt=2022-08-02%2022%3A05%3A30&stg=ecpg-check

Something to fix, but not directly relevant to this patch.  Sorry for the noise.




Re: How is this possible "publication does not exist"

2022-08-04 Thread Marco Slot
On Wed, Aug 11, 2021 at 1:37 PM Amit Kapila  wrote:
> I think it is and the context is generated via
> output_plugin_error_callback. Is this reproducible for you and if so,
> can you share a test case or some steps to reproduce this? Does this
> work and suddenly start giving errors or it happens the very first
> time you tried to set up publication/subscription? I think some more
> details are required about your setup and steps to analyze this
> problem. You might want to check publication-side logs but not sure if
> get any better clue there.

This seems to regularly reproduce the issue on PostgreSQL 14.4:

drop subscription if exists local_sub;
drop publication if exists local_pub;
drop table if exists local;

select pg_create_logical_replication_slot('test','pgoutput');
create table local (x int, y int);
insert into local values (1,2);
create publication local_pub for table local;
create subscription local_sub connection 'host=localhost port=5432'
publication local_pub with (create_slot = false, slot_name = 'test',
copy_data = false);

The log on the publisher then repeatedly shows:
2022-08-04 10:46:56.140 CEST [12785] ERROR:  publication "local_pub"
does not exist
2022-08-04 10:46:56.140 CEST [12785] CONTEXT:  slot "test", output
plugin "pgoutput", in the change callback, associated LSN 8/6C01A270
2022-08-04 10:46:56.140 CEST [12785] STATEMENT:  START_REPLICATION
SLOT "test" LOGICAL 0/0 (proto_version '2', publication_names
'"local_pub"')

(fails in the same way when setting up the subscription on a different node)

The local_pub does appear in pg_publication, but it seems a bit like
the change_cb is using an old snapshot when reading from the catalog
in GetPublicationByName.

cheers,
Marco




Re: Reducing planning time on tables with many indexes

2022-08-04 Thread David Geier
Hi Tom,

On Wed, Jul 27, 2022 at 6:39 PM Tom Lane  wrote:

> David Geier  writes:
> > We tracked down the root cause of this slowdown to lock contention in
> > 'get_relation_info()'. The index lock of every single index of every
> single
> > table used in that query is acquired. We attempted a fix by pre-filtering
> > out all indexes that anyways cannot be used with a certain query, without
> > taking the index locks (credits to Luc Vlaming for idea and
> > implementation). The patch does so by caching the columns present in
> every
> > index, inside 'struct Relation', similarly to 'rd_indexlist'.
>
> I wonder how much thought you gave to the costs imposed by that extra
> cache space.  We have a lot of users who moan about relcache bloat
> already.


The current representation could be compacted (e.g. by storing the column
indexes as 16-bit integers, instead of using a Bitmapset). That should make
the additional space needed neglectable compared to the current size of
RelationData.  On top of that we could maybe reorder the members of
RelationData to save padding bytes. Currently, there is lots of
interleaving of members of different sizes. Even when taking cache locality
into consideration it looks like a fair amount could be saved, probably
accounting for the additional space needed to store the index column data.

  But more to the point, I do not buy the assumption that
> an index's set of columns is a good filter for which indexes are of
> interest.  A trivial counterexample from the regression database is
>
> regression=# explain select count(*) from tenk1;
>  QUERY PLAN
>
>
>
> 
> 
>  Aggregate  (cost=219.28..219.29 rows=1 width=8)
>->  Index Only Scan using tenk1_hundred on tenk1  (cost=0.29..194.28
> rows=100
> 00 width=0)
> (2 rows)
>
> Only for queries without index conditions, the current version of the
patch simply discards all indexes but the one with the least columns. This
is case (3) in s64_IsUnnecessaryIndex(). This is an over-simplification
with the goal of picking the index which yields least I/O. For single
column indexes that works, but it can fall short for multi-column indexes
(e.g. [INT, TEXT] index vs [INT, INT]t index have both 2 columns but the
latter would be better suited when there's no other index and we want to
read the first integer column). What we should do here instead is to
discard indexes based on storage size.


> It looks to me like the patch also makes unwarranted assumptions about
> being able to discard all but the smallest index having a given set
> of columns.  This would, for example, possibly lead to dropping the
> index that has the most useful sort order, or that has the operator
> class needed to support a specific WHERE clause.t
>
Why would that be? If we keep all indexes that contain columns that are
used by the query, we also keep the indexes which provide a certain sort
order or operator class.

>
> In short, I'm not sure I buy this concept at all.  I think it might
> be more useful to attack the locking overhead more directly.  I kind
> of wonder why we need per-index locks at all during planning ---
> I think that we already interpret AccessShareLock on the parent table
> as being sufficient to block schema changes on existing indexes.
>
> As I said in the reply to your other mail, there's huge savings also in
the serial case where lock contention is not an issue. It seems like
pre-filtering saves work down the road. I'll do some perf runs to track
down where exactly the savings come from. One source I can think of is only
having to consider a subset of all indexes during path creation.


> Unfortunately, as things stand today, the planner needs more than the
> right to look at the indexes' schemas, because it makes physical accesses
> to btree indexes to find out their tree height (and I think there are some
> comparable behaviors in other AMs).  I've never particularly cared for
> that implementation, and would be glad to rip out that behavior if we can
> find another way.  Maybe we could persuade VACUUM or ANALYZE to store that
> info in the index's pg_index row, or some such, and then the planner
> could use it with no lock?
>
> That's another interesting approach, but I would love to save the planner
cycles for the sequential case.

--
David Geier
(ServiceNow)


RE: Introduce wait_for_subscription_sync for TAP tests

2022-08-04 Thread shiy.f...@fujitsu.com
On Thu, Aug 4, 2022 2:28 PM Masahiko Sawada  wrote:
> 
> On Thu, Aug 4, 2022 at 10:37 AM Amit Kapila 
> wrote:
> >
> > On Wed, Aug 3, 2022 at 10:21 AM Amit Kapila 
> wrote:
> > >
> > > Pushed this one and now I'll look at your other patch.
> > >
> >
> > I have pushed the second patch as well after making minor changes in
> > the comments. Alvaro [1] and Tom [2] suggest to back-patch this and
> > they sound reasonable to me. Will you be able to produce back branch
> > patches?
> 
> Yes. I've attached patches for backbranches. The updates are
> straightforward on v11 - v15. However, on v10, we don't use
> wait_for_catchup() in some logical replication test cases. The commit
> bbd3363e128dae refactored the tests to use wait_for_catchup but it's
> not backpatched. So in the patch for v10, I didn't change the code
> that was changed by the commit. Also, since wait_for_catchup requires
> to specify $target_lsn, unlike the one in v11 or later, I changed
> wait_for_subscription_sync() accordingly.
> 

Thanks for your patches.

In the patches for pg11 ~ pg14, it looks we need to add a "=pod" before the
current change in PostgresNode.pm. Right?

Regards,
Shi yu


Re: Add last failed connection error message to pg_stat_wal_receiver

2022-08-04 Thread Bharath Rupireddy
On Mon, Jul 25, 2022 at 2:40 PM Michael Paquier  wrote:
>
> On Mon, Jul 25, 2022 at 12:19:40PM +0530, Bharath Rupireddy wrote:
> > Thanks a lot Cary for reviewing. It will be great if you can add
> > yourself as a reviewer and set the status accordingly in the CF entry
> > here - https://commitfest.postgresql.org/38/3666/.
>
> Hmm.  This stands for the connection error, but there are other things
> that could cause a failure down the road, like an incorrect system
> ID or a TLI-related report, so that seems a bit limited to me?

Good point. The walreceiver can exit for any reason. We can either 1)
store for all the error messages or 2) think of using sigsetjmp but
that only catches the ERROR kinds, leaving FATAL and PANIC messages.
The option (1) is simple but there are problems - we may miss storing
future error messages, good commenting and reviewing may help here and
all the error messages now need to be stored in string, which is
complex. The option (2) seems reasonable but we will miss FATAL and
PANIC messages (we have many ERRORs, 2 FATALs, 3 PANICs). Maybe a
combination of option (1) for FATALs and PANICs, and option (2) for
ERRORs helps.

Thoughts?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




RE: collate not support Unicode Variation Selector

2022-08-04 Thread 荒井元成
Thank you for your reply.

SQLServer supports Unicode Variation Selector, so I would like PostgreSQL to
support them as well.

Regards.

--
Motonari Arai

-Original Message-
From: Kyotaro Horiguchi 
Sent: Thursday, August 4, 2022 5:24 PM
To: n2...@ndensan.co.jp
Cc: thomas.mu...@gmail.com; t...@sss.pgh.pa.us;
pgsql-hackers@lists.postgresql.org
Subject: Re: collate not support Unicode Variation Selector

At Wed, 3 Aug 2022 20:12:53 +0900, 荒井元成  wrote in
> Thank you for your reply.
>
> About 60,000 characters are registered in the IPAmj Mincho font designated
by the national specifications.
> It should be able to handle all characters.

Yeah, it is one of that fonts. But I didn't know that MS-Word can
*display* ideographic variations. But it is dissapoinging that input
requires to copy-paste from the Web..  Maybe that characters can be input
smoothly by using ATOK or alikes..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center






Re: fix typos

2022-08-04 Thread John Naylor
On Tue, Aug 2, 2022 at 1:11 AM Justin Pryzby  wrote:
>
> On Mon, Aug 01, 2022 at 08:04:54PM +0200, Erik Rijkers wrote:
> > Recent typos...
>
> LGTM, thanks.
>
> Here are some others I've been sitting on, mostly in .c files.

I pushed Robert's suggestion, then pushed the rest of Erik's changes and
two of Justin's. For Justin's 0004:

--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -364,7 +364,7 @@ restart:
  if (nowait)
  ereport(ERROR,
  (errcode(ERRCODE_OBJECT_IN_USE),
- errmsg("could not drop replication origin with OID %d, in use by PID %d",
+ errmsg("could not drop replication origin with OID %u, in use by PID %d",

RepOriginId is a typedef for uint16, so this can't print the wrong answer,
but it is inconsistent with other uses. So it seems we don't need to
backpatch this one?

For patch 0002, the whitespace issue in the top comment in inval.c, I'm
inclined to just change all the out-of-place tabs in a single commit, so we
can add that to the list of whitespace commits.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-04 Thread David Steele

On 8/4/22 04:06, Kyotaro Horiguchi wrote:

At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch  wrote in

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.


Here's a patch to add that HINT.  I figure it's better to do this before next
week's minor releases.  In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.


+1


Looks good to me as well.


However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.



[1] 
https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com


That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
WAL directory grows without bound.  Is that roughly the theory in mind?


On further reflection, I don't expect it to happen that way.  Each failure
message is LOG-level, so the remaining recycles still happen.  (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)


I agree to this.


Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal 
growing without bound. Even if we are not sure what is causing it, how 
confident are we that the patches applied to v15 would fix it?


Regards,
-David




Re: Fix obsoleted comments for function prototypes

2022-08-04 Thread Richard Guo
On Thu, Aug 4, 2022 at 4:38 PM Michael Paquier  wrote:

> On Tue, Aug 02, 2022 at 07:25:49PM +0900, Michael Paquier wrote:
> > These declarations are linked to comments with their file paths, so
> > making that automated looks rather complicated to me.  I have looked
> > at the surroundings without noticing anything obvious, so what you
> > have caught here sounds fine to me, good catches :)
>
> Done as of 245e14e.


Thank you Michael!

Thanks
Richard


Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Dilip Kumar
On Thu, Aug 4, 2022 at 9:41 AM Dilip Kumar  wrote:
>
> On Thu, Aug 4, 2022 at 12:18 AM Justin Pryzby  wrote:
> >
> > On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote:
> > > Hm. This looks more like an issue of DROP DATABASE not being 
> > > interruptible. I
> > > suspect this isn't actually related to STRATEGY wal_log and could likely 
> > > be
> > > reproduced in older versions too.
> >
> > I couldn't reproduce it with file_copy, but my recipe isn't exactly 
> > reliable.
> > That may just mean that it's easier to hit now.
>
> I think this looks like a problem with drop db but IMHO you are seeing
> this behavior only when a database is created using WAL LOG because in
> this strategy we are using buffers to write the destination database
> pages and some of the dirty buffers and sync requests might still be
> pending.  And now when we try to drop the database it drops all the
> dirty buffers and all pending sync requests and then before it
> actually removes the directory it gets interrupted and now you see the
> database directory on disk which is partially corrupted.  See below
> sequence of drop database
>
>
> dropdb()
> {
> ...
> DropDatabaseBuffers(db_id);
> ...
> ForgetDatabaseSyncRequests(db_id);
> ...
> RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
>
> WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
>  -- Inside this it can process the cancel query and get interrupted
> remove_dbtablespaces(db_id);
> ..
> }
>
> I reproduced the same error by inducing error just before
> WaitForProcSignalBarrier.
>
> postgres[14968]=# CREATE DATABASE a STRATEGY WAL_LOG ; drop database a;
> CREATE DATABASE
> ERROR:  XX000: test error
> LOCATION:  dropdb, dbcommands.c:1684
> postgres[14968]=# \c a
> connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC:
> could not open critical system index 2662
> Previous connection kept
> postgres[14968]=#

So basically, from this we can say it is completely a problem with
drop databases, I mean I can produce any behavior by interrupting drop
database
1. If we created some tables/inserted data and the drop database got
cancelled, it might have a database directory and those objects are
lost.
2.  Or you can even drop the database directory and then get cancelled
before deleting the pg_database entry then also you will end up with a
corrupted database, doesn't matter whether you created it with WAL LOG
or FILE COPY.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Fix inconsistencies GUC categories

2022-08-04 Thread Shinya Kato

Hi,

It appears that config.sgml and pg_settings have not been updated, even 
though a new subcategory was added in 249d649. a55a984 may have been 
missed in the cleaning.

--
Category is 'CONNECTIONS AND AUTHENTICATION' and subcategory is 
'Connection Settings' at config.sgml.
Category is 'CONNECTIONS AND AUTHENTICATION' and subcategory is 
'Connection Settings' at pg_settings.
Category is 'CONNECTIONS AND AUTHENTICATION' and subcategory is 'TCP 
settings' at postgresql.conf.sample.

--

I would like to unify the following with config.sgml as in a55a984.
--
Category is 'REPORTING AND LOGGING' and subcategory is 'PROCESS TITLE' 
at config.sgml.
Category is 'REPORTING AND LOGGING' and subcategory is 'PROCESS TITLE' 
at pg_settings.
Category is 'PROCESS TITLE' and subcategory is none at 
postgresql.conf.sample.

--

Trivial changes were made to the following short_desc.
--
recovery_prefetch
enable_group_by_reordering
stats_fetch_consistency
--

I've attached a patch.
Thoghts?

--
Regards,

--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATIONdiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e2d728e0c4..2522f4c8c5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -896,6 +896,13 @@ include_dir 'conf.d'

   
  
+ 
+ 
+
+ 
+ TCP Settings
+
+ 
 
  
   tcp_keepalives_idle (integer)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af4a1c3068..5db5df6285 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -778,6 +778,8 @@ const char *const config_group_names[] =
 	gettext_noop("File Locations"),
 	/* CONN_AUTH_SETTINGS */
 	gettext_noop("Connections and Authentication / Connection Settings"),
+	/* CONN_AUTH_TCP */
+	gettext_noop("Connections and Authentication / TCP Settings"),
 	/* CONN_AUTH_AUTH */
 	gettext_noop("Connections and Authentication / Authentication"),
 	/* CONN_AUTH_SSL */
@@ -1216,7 +1218,7 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 	{
 		{"enable_group_by_reordering", PGC_USERSET, QUERY_TUNING_METHOD,
-			gettext_noop("enable reordering of GROUP BY key"),
+			gettext_noop("Enable reordering of GROUP BY key."),
 			NULL,
 			GUC_EXPLAIN
 		},
@@ -3460,7 +3462,7 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"tcp_keepalives_idle", PGC_USERSET, CONN_AUTH_SETTINGS,
+		{"tcp_keepalives_idle", PGC_USERSET, CONN_AUTH_TCP,
 			gettext_noop("Time between issuing TCP keepalives."),
 			gettext_noop("A value of 0 uses the system default."),
 			GUC_UNIT_S
@@ -3471,7 +3473,7 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"tcp_keepalives_interval", PGC_USERSET, CONN_AUTH_SETTINGS,
+		{"tcp_keepalives_interval", PGC_USERSET, CONN_AUTH_TCP,
 			gettext_noop("Time between TCP keepalive retransmits."),
 			gettext_noop("A value of 0 uses the system default."),
 			GUC_UNIT_S
@@ -3493,7 +3495,7 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"tcp_keepalives_count", PGC_USERSET, CONN_AUTH_SETTINGS,
+		{"tcp_keepalives_count", PGC_USERSET, CONN_AUTH_TCP,
 			gettext_noop("Maximum number of TCP keepalive retransmits."),
 			gettext_noop("This controls the number of consecutive keepalive retransmits that can be "
 		 "lost before a connection is considered dead. A value of 0 uses the "
@@ -3595,7 +3597,7 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"tcp_user_timeout", PGC_USERSET, CONN_AUTH_SETTINGS,
+		{"tcp_user_timeout", PGC_USERSET, CONN_AUTH_TCP,
 			gettext_noop("TCP user timeout."),
 			gettext_noop("A value of 0 uses the system default."),
 			GUC_UNIT_MS
@@ -3640,7 +3642,7 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"client_connection_check_interval", PGC_USERSET, CONN_AUTH_SETTINGS,
+		{"client_connection_check_interval", PGC_USERSET, CONN_AUTH_TCP,
 			gettext_noop("Sets the time interval between checks for disconnection while running queries."),
 			NULL,
 			GUC_UNIT_MS
@@ -4953,7 +4955,7 @@ static struct config_enum ConfigureNamesEnum[] =
 
 	{
 		{"stats_fetch_consistency", PGC_USERSET, STATS_CUMULATIVE,
-			gettext_noop("Sets the consistency of accesses to statistics data"),
+			gettext_noop("Sets the consistency of accesses to statistics data."),
 			NULL
 		},
 		&pgstat_fetch_consistency,
@@ -5044,7 +5046,7 @@ static struct config_enum ConfigureNamesEnum[] =
 
 	{
 		{"recovery_prefetch", PGC_SIGHUP, WAL_RECOVERY,
-			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Prefetch referenced blocks during recovery."),
 			gettext_noop("Look ahead in the WAL to find references to uncached data.")
 		},
 		&recovery_prefetch,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b4bc06e5f5..90bec0502c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ 

RE: Data is copied twice when specifying both child and parent table in publication

2022-08-04 Thread houzj.f...@fujitsu.com
On Thursday, July 28, 2022 5:17 PM Peter Smith  wrote:
> Here are some review comments for the HEAD_v7-0001 patch:
> 
> ==
> 
> 1. 
> 
> I have a fundamental question about this patch.
> 
> IIUC the purpose of this patch is to ensure that (when
> publish_via_root = true) the copy of the partition data will happen
> only once (e.g. from one parent table on one of the publishers). But I
> think there is no guarantee that those 2 publishers even had the same
> data, right? Therefore it seems to me you could get different results
> if the data were copied from pub1 or from pub2. (I have not tried it -
> this is just my suspicion).
> 
> Am I correct or mistaken?

Since the subscribed publishers are combined with OR and are from the same 
database.
And we are trying to copy the data from the top most parent table, so I think 
the results
should be as expected.

Best regards,
Hou zj


Re: Question about user/database-level parameters

2022-08-04 Thread Daniel Verite
Japin Li wrote:

> However, if the database is in production, we cannot go into single-user
> mode, should we provide an option to change this behavior on the fly?

It already exists, through PGOPTIONS, which appears to work
for local_preload_libraries, in a quick test.

That is, you can log in by invoking psql with:
PGOPTIONS="-c local_preload_libraries="
and issue the ALTER USER to reset things back to normal
without stopping the instance.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: making relfilenodes 56 bits

2022-08-04 Thread Dilip Kumar
On Sat, Jul 30, 2022 at 1:59 AM Robert Haas  wrote:

> One solution to all this is to do as Dilip proposes here: for system
> relations, keep assigning the OID as the initial relfilenumber.
> Actually, we really only need to do this for pg_largeobject; all the
> other relfilenumber values could be assigned from a counter, as long
> as they're assigned from a range distinct from what we use for user
> relations.
>
> But I don't really like that, because I feel like the whole thing
> where we start out with relfilenumber=oid is a recipe for hidden bugs.
> I believe we'd be better off if we decouple those concepts more
> thoroughly. So here's another idea: what if we set the
> next-relfilenumber counter for the new cluster to the value from the
> old cluster, and then rewrote all the (thus-far-empty) system tables?

You mean in a new cluster start the next-relfilenumber counter from
the highest relfilenode/Oid value in the old cluster right?.  Yeah, if
we start next-relfilenumber after the range of the old cluster then we
can also avoid the logic of SetNextRelFileNumber() during upgrade.

My very initial idea around this was to start the next-relfilenumber
directly from the 4 billion in the new cluster so there can not be any
conflict and we don't even need to identify the highest value of used
relfilenode in the old cluster.  In fact we don't need to rewrite the
system table before upgrading I think.  So what do we lose with this?
just 4 billion relfilenode? does that really matter provided the range
we get with the 56 bits relfilenumber.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Smoothing the subtrans performance catastrophe

2022-08-04 Thread Simon Riggs
On Wed, 3 Aug 2022 at 20:18, Andres Freund  wrote:

> On 2022-08-01 17:42:49 +0100, Simon Riggs wrote:
> > The reason for the slowdown is clear: when we overflow we check every
> > xid against subtrans, producing a large stream of lookups. Some
> > previous hackers have tried to speed up subtrans - this patch takes a
> > different approach: remove as many subtrans lookups as possible. (So
> > is not competing with those other solutions).
> >
> > Attached patch improves on the situation, as also shown in the attached 
> > diagram.
>
> I think we should consider redesigning subtrans more substantially - even with
> the changes you propose here, there's still plenty ways to hit really bad
> performance. And there's only so much we can do about that without more
> fundamental design changes.

I completely agree - you will be glad to hear that I've been working
on a redesign of the subtrans module.

But we should be clear that redesigning subtrans has nothing to do
with this patch; they are separate ideas and this patch relates to
XidInMVCCSnapshot(), an important caller of subtrans.

I will post my patch, when complete, in a different thread.

> One way to fix a lot of the issues around pg_subtrans would be remove the
> pg_subtrans SLRU and replace it with a purely in-memory hashtable. IMO there's
> really no good reason to use an SLRU for it (anymore).
>
> In contrast to e.g. clog or multixact we don't need to access a lot of old
> entries, we don't need persistency etc. Nor is it a good use of memory and IO
> to have loads of pg_subtrans pages that don't point anywhere, because the xid
> is just a "normal" xid.
>
> While we can't put a useful hard cap on the number of potential subtrans
> entries (we can only throw subxid->parent mappings away once no existing
> snapshot might need them), saying that there can't be more subxids "considered
> running" at a time than can fit in memory doesn't seem like a particularly
> problematic restriction.

I do agree that sometimes it is easier to impose restrictions than to
try to provide unbounded resources.

Having said that, I can't see an easy way of making that work well in
practice for this case. Making write transactions just suddenly stop
working at some point doesn't sound like it would be good for
availability, especially when it happens sporadically and
unpredictably as that would, whenever long running transactions appear
alongside users of subtransactions.

> So, why don't we use a dshash table with some amount of statically allocated
> memory for the mapping? In common cases that will *reduce* memory usage
> (because we don't need to reserve space for [as many] subxids in snapshots /
> procarray anymore) and IO (no mostly-zeroes pg_subtrans).

I considered this and have ruled it out, but as I said above, we can
discuss that on a different thread.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: enable/disable broken for statement triggers on partitioned tables

2022-08-04 Thread Alvaro Herrera
Another point for backpatch: EnableDisableTrigger() changes API, which
is potentially not good.  In backbranches I'll keep the function
unchanged and add another function with the added argument,
EnableDisableTriggerNew().

So extensions that want to be compatible with both old and current
versions (assuming any users of that function exist out of core; I
didn't find any) could do something like

#if PG_VERSION_NUM <= 16
EnableDisableTriggerNew( all args )
#else
EnableDisableTrigger( all args )
#endif

and otherwise they're compatible as compiled today.

Since there are no known users of this interface, it doesn't seem to
warrant any more convenient treatment.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Those who use electric razors are infidels destined to burn in hell while
we drink from rivers of beer, download free vids and mingle with naked
well shaved babes." (http://slashdot.org/comments.pl?sid=44793&cid=4647152)




Re: Handle infinite recursion in logical replication setup

2022-08-04 Thread Masahiko Sawada
On Tue, Aug 2, 2022 at 8:59 PM Amit Kapila  wrote:
>
> On Tue, Jul 26, 2022 at 9:07 AM Amit Kapila  wrote:
> >
> > On Tue, Jul 26, 2022 at 7:13 AM Jonathan S. Katz  
> > wrote:
> > >
> > > Thanks for the example. I agree that it is fairly simple to reproduce.
> > >
> > > I understand that "copy_data = force" is meant to protect a user from
> > > hurting themself. I'm not convinced that this is the best way to do so.
> > >
> > > For example today I can subscribe to multiple publications that write to
> > > the same table. If I have a primary key on that table, and two of the
> > > subscriptions try to write an identical ID, we conflict. We don't have
> > > any special flags or modes to guard against that from happening, though
> > > we do have documentation on conflicts and managing them.
> > >
> > > AFAICT the same issue with "copy_data" also exists in the above scenario
> > > too, even without the "origin" attribute.
> > >
> >
> > That's true but there is no parameter like origin = NONE which
> > indicates that constraint violations or duplicate data problems won't
> > occur due to replication. In the current case, I think the situation
> > is different because a user has specifically asked not to replicate
> > any remote data by specifying origin = NONE, which should be dealt
> > differently. Note that current users or their setup won't see any
> > difference/change unless they specify the new parameter origin as
> > NONE.
> >
>
> Let me try to summarize the discussion so that it is easier for others
> to follow. The work in this thread is to avoid loops, and duplicate
> data in logical replication when the operations happened on the same
> table in multiple nodes. It has been explained in email [1] with an
> example of how a logical replication setup can lead to duplicate or
> inconsistent data.
>
> The idea to solve this problem is that we don't replicate data that is
> not generated locally which we can normally identify based on origin
> of data in WAL. The commit 366283961a achieves that for replication
> but still the problem can happen during initial sync which is
> performed internally via copy. We can't differentiate the data in heap
> based on origin. So, we decided to prohibit the subscription
> operations that can perform initial sync (ex. Create Subscription,
> Alter Subscription ... Refresh) by detecting that the publisher has
> subscribed to the same table from some other publisher.
>
> To prohibit the subscription operations, the currently proposed patch
> throws an error.  Then, it also provides a new copy_data option
> 'force' under which the user will still be able to perform the
> operation. This could be useful when the user intentionally wants to
> replicate the initial data even if it contains data from multiple
> nodes (for example, when in a multi-node setup, one decides to get the
> initial data from just one node and then allow replication of data to
> proceed from each of respective nodes).
>
> The other alternative discussed was to just give a warning for
> subscription operations and probably document the steps for users to
> avoid it. But the problem with that is once the user sees this
> warning, it won't be able to do anything except recreate the setup, so
> why not give an error in the first place?
>
> Thoughts?

Thank you for the summary!

I understand that this feature could help some cases, but I'm really
not sure adding new value 'force' for them is worthwhile, and
concerned it could reduce the usability.

IIUC this feature would work only when origin = 'none' and copy_data =
'on', and the purpose is to prevent the data from being
duplicated/conflicted by the initial table sync. But there are cases
where duplication/conflict doesn't happen even if origin = 'none' and
copy_data = 'on'. For instance, the table on the publisher might be
empty. Also, even with origin = 'any', copy_data = 'on', there is a
possibility of data duplication/conflict. Why do we need to address
only the case where origin = 'none'? I think that using origin =
'none' doesn't necessarily mean using bi-directional (or N-way)
replication. Even when using uni-directional logical replication with
two nodes, they may use origin = 'none'. Therefore, it seems to me
that this feature works only for a narrow situation and has false
positives.

Since it has been the user's responsibility not to try to make the
data inconsistent by the initial table sync, I think that it might be
sufficient if we note the risk in the documentation.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Smoothing the subtrans performance catastrophe

2022-08-04 Thread Robert Haas
On Wed, Aug 3, 2022 at 4:14 PM Andres Freund  wrote:
> I don't think this scenario would fundamentally change - we already keep the
> set of subxids in backend local memory (e.g. either a dedicated
> TransactionStateData or an element in ->childXids) and in the locking table
> (via XactLockTableInsert()).

Sure

> The problematic part isn't keeping "actually" running subxids in memory, but
> keeping subxids that might be "considered running" in memory (i.e. subxids
> that are considered running by an old snapshot in another backend).
>
> A hashtable just containing child->parent mapping for subxids doesn't actually
> need that much memory. It'd be approximately (2 * 4 bytes) * subxids *
> (2-fillfactor) or such? So maybe ~10MB for 1 milllion subxids?  Allocating
> that on-demand doesn't strike me as prohibitive.

I mean the worst case is ~2 bn, no?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




ERREUR: cache lookup failed for function 0 with PostgreSQL 15 beta 2, no error with PostgreSQL 14.4

2022-08-04 Thread Phil Florent
Hi,

A DSS developer from my company, Julien Roze, reported me an error I cannot 
explained. Is it a new behavior or a bug ?

Original query is much more complicated but here is a simplified test case with 
postgresql 14 and 15 beta 2 on Debian 11, packages from pgdg :

Ver Cluster Port Status OwnerData directory  Log file
14  main5432 online postgres /var/lib/postgresql/14/main 
/var/log/postgresql/postgresql-14-main.log
15  main5433 online postgres /var/lib/postgresql/15/main 
/var/log/postgresql/postgresql-15-main.log

psql -p 5432

select version();
   version  
 
-
 PostgreSQL 14.4 (Debian 14.4-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by 
gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 ligne)


with fakedata as (
   select 'hello' word
   union all
   select 'world' word
)
select *
from (
   select word, count(*) over (partition by word) nb from fakedata
) t where nb = 1;
 
  word  | nb 
---+
 hello |  1
 world |  1
(2 lignes)


with fakedata as (
   select 'hello' word
   union all
   select 'world' word
)
select *
from (
   select word, count(*) nb from fakedata group by word
) t where nb = 1;
 
  word  | nb 
---+
 hello |  1
 world |  1
(2 lignes)

psql -p 5433

 select version();
  version   


 PostgreSQL 15beta2 (Debian 15~beta2-1.pgdg110+1) on x86_64-pc-linux-gnu, 
compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 ligne)

with fakedata as (
   select 'hello' word
   union all
   select 'world' word
)
select *
from (
   select word, count(*) over (partition by word) nb from fakedata
) t where nb = 1;
ERREUR:  cache lookup failed for function 0

with fakedata as (
   select 'hello' word
   union all
   select 'world' word
)
select *
from (
   select word, count(*) nb from fakedata group by word
) t where nb = 1;
 
 word  | nb 
---+
 hello |  1
 world |  1
(2 lignes)


Best regards,
Phil



Re: enable/disable broken for statement triggers on partitioned tables

2022-08-04 Thread Amit Langote
On Thu, Aug 4, 2022 at 9:56 PM Alvaro Herrera  wrote:
> Another point for backpatch: EnableDisableTrigger() changes API, which
> is potentially not good.  In backbranches I'll keep the function
> unchanged and add another function with the added argument,
> EnableDisableTriggerNew().

+1

> So extensions that want to be compatible with both old and current
> versions (assuming any users of that function exist out of core; I
> didn't find any) could do something like
>
> #if PG_VERSION_NUM <= 16
> EnableDisableTriggerNew( all args )
> #else
> EnableDisableTrigger( all args )
> #endif
>
> and otherwise they're compatible as compiled today.
>
> Since there are no known users of this interface, it doesn't seem to
> warrant any more convenient treatment.

Makes sense.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: ERREUR: cache lookup failed for function 0 with PostgreSQL 15 beta 2, no error with PostgreSQL 14.4

2022-08-04 Thread Justin Pryzby
On Thu, Aug 04, 2022 at 01:19:59PM +, Phil Florent wrote:
> A DSS developer from my company, Julien Roze, reported me an error I cannot 
> explained. Is it a new behavior or a bug ?
> 
> Original query is much more complicated but here is a simplified test case 
> with postgresql 14 and 15 beta 2 on Debian 11, packages from pgdg :

Thanks for simplifying and reporting it.

It looks like an issue with window run conditions (commit 9d9c02ccd).

+David

(gdb) b pg_re_throw
(gdb) bt
#0  pg_re_throw () at elog.c:1795
#1  0x557c85645e69 in errfinish (filename=, 
filename@entry=0x557c858db7da "fmgr.c", lineno=lineno@entry=183, 
funcname=funcname@entry=0x557c858dc410 <__func__.24841> 
"fmgr_info_cxt_security") at elog.c:588
#2  0x557c85650e21 in fmgr_info_cxt_security 
(functionId=functionId@entry=0, finfo=finfo@entry=0x557c86a05ad0, 
mcxt=, ignore_security=ignore_security@entry=false) at fmgr.c:183
#3  0x557c85651284 in fmgr_info (functionId=functionId@entry=0, 
finfo=finfo@entry=0x557c86a05ad0) at fmgr.c:128
#4  0x557c84b32c73 in ExecInitFunc (scratch=scratch@entry=0x7ffc369a9cf0, 
node=node@entry=0x557c869f59b8, args=0x557c869f5a68, funcid=funcid@entry=0, 
inputcollid=inputcollid@entry=0, state=state@entry=0x557c86a05620)
at execExpr.c:2748
#5  0x557c84b27904 in ExecInitExprRec (node=node@entry=0x557c869f59b8, 
state=state@entry=0x557c86a05620, resv=resv@entry=0x557c86a05628, 
resnull=resnull@entry=0x557c86a05625) at execExpr.c:1147
#6  0x557c84b33a1d in ExecInitQual (qual=0x557c869f5b18, 
parent=parent@entry=0x557c86a05080) at execExpr.c:253
#7  0x557c84c8eadb in ExecInitWindowAgg (node=node@entry=0x557c869f4d20, 
estate=estate@entry=0x557c86a04e10, eflags=eflags@entry=16) at 
nodeWindowAgg.c:2420
#8  0x557c84b8edda in ExecInitNode (node=node@entry=0x557c869f4d20, 
estate=estate@entry=0x557c86a04e10, eflags=eflags@entry=16) at 
execProcnode.c:345
#9  0x557c84b70ea2 in InitPlan (queryDesc=queryDesc@entry=0x557c8695af50, 
eflags=eflags@entry=16) at execMain.c:938
#10 0x557c84b71658 in standard_ExecutorStart 
(queryDesc=queryDesc@entry=0x557c8695af50, eflags=16, eflags@entry=0) at 
execMain.c:265
#11 0x557c84b71ca4 in ExecutorStart 
(queryDesc=queryDesc@entry=0x557c8695af50, eflags=0) at execMain.c:144
#12 0x557c8525292b in PortalStart (portal=portal@entry=0x557c869a45e0, 
params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0) at 
pquery.c:517
#13 0x557c8524b2a4 in exec_simple_query (
query_string=query_string@entry=0x557c86938af0 "with fakedata as (\n", ' ' 
, "select 'hello' word\n", ' ' , "union 
all\n", ' ' , "select 'world' word\n)\nselect *\nfrom (\n", 
' ' , "select word, count(*) over (partition by word) nb 
fro"...) at postgres.c:1204
#14 0x557c8524e8bd in PostgresMain (dbname=, 
username=username@entry=0x557c86964298 "pryzbyj") at postgres.c:4505
#15 0x557c85042db6 in BackendRun (port=port@entry=0x557c8695a910) at 
postmaster.c:4490
#16 0x557c8504a79a in BackendStartup (port=port@entry=0x557c8695a910) at 
postmaster.c:4218
#17 0x557c8504ae12 in ServerLoop () at postmaster.c:1808
#18 0x557c8504c926 in PostmasterMain (argc=3, argv=) at 
postmaster.c:1480
#19 0x557c84ce4209 in main (argc=3, argv=0x557c86933000) at main.c:197

(gdb) fr 7 
#7  0x557c84c8eadb in ExecInitWindowAgg (node=node@entry=0x557c869f4d20, 
estate=estate@entry=0x557c86a04e10, eflags=eflags@entry=16) at 
nodeWindowAgg.c:2420
2420winstate->runcondition = ExecInitQual(node->runCondition,

-- 
Justin




Re: fix typos

2022-08-04 Thread Tom Lane
John Naylor  writes:
> On Tue, Aug 2, 2022 at 1:11 AM Justin Pryzby  wrote:
>   ereport(ERROR,
>   (errcode(ERRCODE_OBJECT_IN_USE),
> - errmsg("could not drop replication origin with OID %d, in use by PID %d",
> + errmsg("could not drop replication origin with OID %u, in use by PID %d",

> RepOriginId is a typedef for uint16, so this can't print the wrong answer,
> but it is inconsistent with other uses. So it seems we don't need to
> backpatch this one?

Um ... if it's int16, then it can't be an OID, so I'd say this message has
far worse problems than %d vs %u.  It should not use that terminology.

regards, tom lane




Re: Introduce wait_for_subscription_sync for TAP tests

2022-08-04 Thread Tom Lane
Masahiko Sawada  writes:
> Yes. I've attached patches for backbranches.

FWIW, I'd recommend waiting till after next week's wrap before
pushing these.  While I'm definitely in favor of doing this,
the odds of introducing a bug are nonzero, so right before a
release deadline doesn't seem like a good time.

regards, tom lane




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Jonathan S. Katz

On 8/3/22 4:19 PM, Tom Lane wrote:

"Jonathan S. Katz"  writes:

I did rule out wanting to do the "xid + $X" check after reviewing some
of the output. I think that both $X could end up varying, and it really
feels like a bandaid.


It is that.  I wouldn't feel comfortable with $X less than 100 or so,
which is probably sloppy enough to draw Robert's ire.  Still, realizing
that what we want right now is a band-aid for 15beta3, I don't think
it's an unreasonable short-term option.


Attached is the "band-aid / sloppy" version of the patch. Given from the 
test examples I kept seeing deltas over 100 for relfrozenxid, I chose 
1000. The mxid delta was less, but I kept it at 1000 for consistency 
(and because I hope this test is short lived in this state), but can be 
talked into otherwise.



Andres suggested upthread using "txid_current()" -- for the comparison,
that's one thing I looked at. Would any of the XID info from
"pg_control_checkpoint()" also serve for this test?


I like the idea of txid_current(), but we have no comparable
function for mxid do we?  While you could get both numbers from
pg_control_checkpoint(), I doubt that's sufficiently up-to-date.


...unless we force a checkpoint in the test?

Jonathandiff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl 
b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 4cbc75644c..ced889601f 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -226,6 +226,8 @@ ORDER BY c.oid::regclass::text
 EOM
 $horizon_query =~ s/\s+/ /g; # run it together on one line
 my $horizon1 = $oldnode->safe_psql('regression', $horizon_query);
+chomp($horizon1);
+my @horizon1_values = split(/\|/, $horizon1);
 
 # In a VPATH build, we'll be started in the source directory, but we want
 # to run pg_upgrade in the build directory so that any files generated finish
@@ -313,6 +315,8 @@ $newnode->command_ok(
 
 # And record the horizons from the upgraded cluster as well.
 my $horizon2 = $newnode->safe_psql('regression', $horizon_query);
+chomp($horizon2);
+my @horizon2_values = split(/\|/, $horizon2);
 
 # Compare the two dumps, there should be no differences.
 my $compare_res = compare("$tempdir/dump1.sql", "$tempdir/dump2.sql");
@@ -331,8 +335,13 @@ if ($compare_res != 0)
print "=== EOF ===\n";
 }
 
-# Compare the horizons, there should be no differences.
-my $horizons_ok = $horizon1 eq $horizon2;
+# Compare the horizons. There should be no change in the OID, so we will test
+# for equality. However, because autovacuum may have run between the two
+# horizions, we will check if the updated horizon XIDs are greater than or
+# equal to the originals.
+my $horizons_ok = $horizon2_values[0] eq $horizon1_values[0] &&
+   int($horizon2_values[1]) <= int($horizon1_values[1]) + 1000 &&
+   int($horizon2_values[2]) <= int($horizon1_values[2]) + 1000;
 ok($horizons_ok, 'old and new horizons match after pg_upgrade');
 
 # Provide more context if the horizons do not match.


Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Robert Haas
On Wed, Aug 3, 2022 at 7:19 PM Tom Lane  wrote:
> "Jonathan S. Katz"  writes:
> > I did rule out wanting to do the "xid + $X" check after reviewing some
> > of the output. I think that both $X could end up varying, and it really
> > feels like a bandaid.
>
> It is that.  I wouldn't feel comfortable with $X less than 100 or so,
> which is probably sloppy enough to draw Robert's ire.  Still, realizing
> that what we want right now is a band-aid for 15beta3, I don't think
> it's an unreasonable short-term option.

100 << 2^32, so it's not terrible, but I'm honestly coming around to
the view that we ought to just nuke this test case.

>From my point of view, the assertion that disabling autovacuum during
this test case would make the test case useless seems to be incorrect.
The original purpose of the test was to make sure that the pre-upgrade
schema matched the post-upgrade schema. If having autovacuum running
or not running affects that, we have a serious problem, but this test
case isn't especially likely to find it, because whether autovacuum
runs or not during the brief window where the test is running is
totally unpredictable. Furthermore, if we do have such a problem, it
would probably indicate that vacuum is using the wrong horizons to
prune or test the visibility of the tuples. To find that out, we might
want to compare values upon which the behavior of vacuum might depend,
like relfrozenxid. But to do that, we have to disable autovacuum, so
that the value can't change under us. From my point of view, that's
making test coverage better, not worse, because any bugs in this area
that can be found without explicit testing of relevant horizons are
dependent on low-probability race conditions materializing in the
buildfarm. If we disable autovacuum and then compare relfrozenxid and
whatever else we care about explicitly, we can find bugs in that
category reliably.

However, if people don't accept that argument, then this new test case
is kind of silly. It's not the worst idea in the world to use a
threshold of 100 XIDs or something, but without disabling autovacuum,
we're basically comparing two things that can't be expected to be
equal, so we test and see if they're approximately equal and then call
that good enough. I don't know that I believe we'll ever find a bug
that way, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


enable-autovacuum-later.patch
Description: Binary data


Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Robert Haas
On Thu, Aug 4, 2022 at 10:02 AM Jonathan S. Katz  wrote:
> Attached is the "band-aid / sloppy" version of the patch. Given from the
> test examples I kept seeing deltas over 100 for relfrozenxid, I chose
> 1000. The mxid delta was less, but I kept it at 1000 for consistency
> (and because I hope this test is short lived in this state), but can be
> talked into otherwise.

ISTM that you'd need to loop over the rows and do this for each row.
Otherwise I think you're just comparing results for the first relation
and ignoring all the rest.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Tom Lane
"Jonathan S. Katz"  writes:
> On 8/3/22 4:19 PM, Tom Lane wrote:
>> I like the idea of txid_current(), but we have no comparable
>> function for mxid do we?  While you could get both numbers from
>> pg_control_checkpoint(), I doubt that's sufficiently up-to-date.

> ...unless we force a checkpoint in the test?

Hmm ... maybe if you take a snapshot and hold that open while
forcing the checkpoint and doing the subsequent checks.  That
seems messy though.  Also, while that should serve to hold back
global xmin, I'm not at all clear on whether that has a similar
effect on minmxid.

regards, tom lane




Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-04 Thread Noah Misch
On Thu, Aug 04, 2022 at 06:32:38AM -0400, David Steele wrote:
> On 8/4/22 04:06, Kyotaro Horiguchi wrote:
> >At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch  wrote in
> I think in this case a HINT might be sufficient to at least keep people 
> from
> wasting time tracking down a problem that has already been fixed.
> >>
> >>Here's a patch to add that HINT.  I figure it's better to do this before 
> >>next
> >>week's minor releases.  In the absence of objections, I'll push this around
> >>2022-08-05 14:00 UTC.
> >
> >+1
> 
> Looks good to me as well.

Thanks for reviewing.

> However, there is another issue [1] that might argue for a back patch if
> this patch (as I believe) would fix the issue.
> >>>
> [1] 
> https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
> >>>
> >>>That makes sense.  Each iteration of the restartpoint recycle loop has a 
> >>>1/N
> >>>chance of failing.  Recovery adds >N files between restartpoints.  Hence, 
> >>>the
> >>>WAL directory grows without bound.  Is that roughly the theory in mind?
> >>
> >>On further reflection, I don't expect it to happen that way.  Each failure
> >>message is LOG-level, so the remaining recycles still happen.  (The race
> >>condition can yield an ERROR under PreallocXlogFiles(), but recycling is
> >>already done at that point.)
> >
> >I agree to this.
> 
> Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing
> without bound. Even if we are not sure what is causing it, how confident are
> we that the patches applied to v15 would fix it?

I'll say 7%; certainly too low to stop investigating.  The next things I'd want
to see are:

- select name, setting, source from pg_settings where setting <> boot_val;
- log_checkpoints log entries, and other log entries having the same PID

If the theory about checkpoint skipping holds, there should be a period where
the volume of WAL replayed greatly exceeds max_wal_size, yet 0-1 restartpoints
begin and 0 restartpoints complete.




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Tom Lane
Robert Haas  writes:
> 100 << 2^32, so it's not terrible, but I'm honestly coming around to
> the view that we ought to just nuke this test case.

I'd hesitated to suggest that, but I think that's a fine solution.
Especially since we can always put it back in later if we think
of a more robust way.

regards, tom lane




Re: [doc] fix a potential grammer mistake

2022-08-04 Thread Daniel Gustafsson
> On 4 Aug 2022, at 00:44, Junwang Zhao  wrote:

> Attachment is a patch with the "just" removed.

I think this is a change for better, so I've pushed it.  Thanks for the
contribution!

--
Daniel Gustafsson   https://vmware.com/





Re: Question about user/database-level parameters

2022-08-04 Thread Japin Li


On Thu, 04 Aug 2022 at 19:29, Daniel Verite  wrote:
>   Japin Li wrote:
>
>> However, if the database is in production, we cannot go into single-user
>> mode, should we provide an option to change this behavior on the fly?
>
> It already exists, through PGOPTIONS, which appears to work
> for local_preload_libraries, in a quick test.
>
> That is, you can log in by invoking psql with:
> PGOPTIONS="-c local_preload_libraries="
> and issue the ALTER USER to reset things back to normal
> without stopping the instance.

Oh, great! Thank you very much!

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: Use fadvise in wal replay

2022-08-04 Thread Andrey Borodin



> On 18 Jul 2022, at 22:55, Robert Haas  wrote:
> 
> On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak  wrote:
>> Cool. As for GUC I'm afraid there's going to be resistance of adding yet 
>> another GUC (to avoid many knobs). Ideally it would be nice if we had some 
>> advanced/deep/hidden parameters , but there isn't such thing.
>> Maybe another option would be to use (N * maintenance_io_concurrency * 
>> XLOG_BLCKSZ), so N=1 that's 80kB and N=2 160kB (pretty close to default 
>> value, and still can be tweaked by enduser). Let's wait what others say?
> 
> I don't think adding more parameters is a problem intrinsically. A
> good question to ask, though, is how the user is supposed to know what
> value they should configure. If we don't have any idea what value is
> likely to be optimal, odds are users won't either.
We know that 128KB is optimal on some representative configuration and that 
changing value won't really affect performance much. 128KB is marginally better 
then 8KB and removes some theoretical concerns about extra syscalls.

> It's not very clear to me that we have any kind of agreement on what
> the basic approach should be here, though.

Actually, the only question is offset from current read position: should it be 
1 block or full readehead chunk. Again, this does not change anything, just a 
matter of choice.


Thanks!

Best regards, Andrey Borodin.



Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Robert Haas
On Thu, Aug 4, 2022 at 10:26 AM Tom Lane  wrote:
> Robert Haas  writes:
> > 100 << 2^32, so it's not terrible, but I'm honestly coming around to
> > the view that we ought to just nuke this test case.
>
> I'd hesitated to suggest that, but I think that's a fine solution.
> Especially since we can always put it back in later if we think
> of a more robust way.

IMHO it's 100% clear how to make it robust. If you want to check that
two values are the same, you can't let one of them be overwritten by
an unrelated event in the middle of the check. There are many specific
things we could do here, a few of which I proposed in my previous
email, but they all boil down to "don't let autovacuum screw up the
results".

But if you don't want to do that, and you also don't want to have
random failures, the only alternatives are weakening the check and
removing the test. It's kind of hard to say which is better, but I'm
inclined to think that if we just weaken the test we're going to think
we've got coverage for this kind of problem when we really don't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Andres Freund
Hi,

On 2022-08-04 12:43:49 -0400, Robert Haas wrote:
> On Thu, Aug 4, 2022 at 10:26 AM Tom Lane  wrote:
> > Robert Haas  writes:
> > > 100 << 2^32, so it's not terrible, but I'm honestly coming around to
> > > the view that we ought to just nuke this test case.
> >
> > I'd hesitated to suggest that, but I think that's a fine solution.
> > Especially since we can always put it back in later if we think
> > of a more robust way.
> 
> IMHO it's 100% clear how to make it robust. If you want to check that
> two values are the same, you can't let one of them be overwritten by
> an unrelated event in the middle of the check. There are many specific
> things we could do here, a few of which I proposed in my previous
> email, but they all boil down to "don't let autovacuum screw up the
> results".
>
> But if you don't want to do that, and you also don't want to have
> random failures, the only alternatives are weakening the check and
> removing the test. It's kind of hard to say which is better, but I'm
> inclined to think that if we just weaken the test we're going to think
> we've got coverage for this kind of problem when we really don't.

Why you think it's better to not have the test than to have a very limited
amount of fuzziness (by using the next xid as an upper limit). What's the bug
that will reliably pass the nextxid fuzzy comparison, but not an exact
comparison?

Greetings,

Andres Freund




Re: Patch to address creation of PgStat* contexts with null parent context

2022-08-04 Thread Reid Thompson
On Fri, 2022-07-29 at 11:53 +0900, Kyotaro Horiguchi wrote:
> 
> That makes the memorycontext-tree structure unstable because
> CacheMemoryContext can be created on-the-fly.
> 
> Honestly I don't like to call CreateCacheMemoryContext in the two
> functions on-the-fly.  Since every process that calls
> pgstat_initialize() necessarily calls pgstat_setup_memcxt() at latest
> at process termination, I think we can create at least
> CacheMemoryContext in pgstat_initialize(). 

Attached is a patch creating CacheMemoryContext() in pgstat_initialize()
rather than the two previous patch locations.

> Or couldn't we create the
> all three contexts in the function, instead of calling
> pgstat_setup_memcxt() on-the fly?

You note that that pgstat_setup_memcxt() is called at latest at process
termination -- was the intent to hold off on requesting memory for these
two contexts until it was needed?

> regards.
> 
> -- 
> Kyotaro Horiguchi
> NTT Open Source Software Center


-- 
Reid Thompson
Senior Software Engineer
Crunchy Data, Inc.

reid.thomp...@crunchydata.com
www.crunchydata.com



diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index ffd096405e..b5b69ead2b 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -532,6 +532,8 @@ pgstat_initialize(void)
 	/* Set up a process-exit hook to clean up */
 	before_shmem_exit(pgstat_shutdown_hook, 0);
 
+	CreateCacheMemoryContext();
+
 #ifdef USE_ASSERT_CHECKING
 	pgstat_is_initialized = true;
 #endif


Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Tom Lane
Robert Haas  writes:
> IMHO it's 100% clear how to make it robust. If you want to check that
> two values are the same, you can't let one of them be overwritten by
> an unrelated event in the middle of the check. There are many specific
> things we could do here, a few of which I proposed in my previous
> email, but they all boil down to "don't let autovacuum screw up the
> results".

It doesn't really matter how robust a test case is, if it isn't testing
the thing you need to have tested.  So I remain unwilling to disable
autovac in a way that won't match real-world usage.

Note that the patch you proposed at [1] will not fix anything.
It turns off autovac in the new node, but the buildfarm failures
we've seen appear to be due to autovac running on the old node.
(I believe that autovac in the new node is *also* a hazard, but
it seems to be a lot less of one, presumably because of timing
considerations.)  To make it work, we'd have to shut off autovac
in the old node before starting pg_upgrade, and that would make it
unacceptably (IMHO) different from what real users will do.

Conceivably, we could move all of this processing into pg_upgrade
itself --- autovac disable/re-enable and capturing of the horizon
data --- and that would address my complaint.  I don't really want
to go there though, especially when in the final analysis IT IS
NOT A BUG if a rel's horizons advance a bit during pg_upgrade.
It's only a bug if they become inconsistent with the rel's data,
which is not what this test is testing for.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoZkBcMi%2BNikxfc54dgkWj41Q%3DZ4nuyHpheTcxA-qfS5Qg%40mail.gmail.com




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Peter Geoghegan
On Thu, Aug 4, 2022 at 9:44 AM Robert Haas  wrote:
> But if you don't want to do that, and you also don't want to have
> random failures, the only alternatives are weakening the check and
> removing the test. It's kind of hard to say which is better, but I'm
> inclined to think that if we just weaken the test we're going to think
> we've got coverage for this kind of problem when we really don't.

Perhaps amcheck's verify_heapam() function can be used here. What
could be better than exhaustively verifying that the relfrozenxid (and
relminmxid) invariants hold for every single tuple in the table? Those
are the exact conditions that we care about, as far as
relfrozenxid/relminmxid goes.

My sense is that that has a much better chance of detecting a real bug
at some point. This approach is arguably an example of property-based
testing.

-- 
Peter Geoghegan




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Tom Lane
Peter Geoghegan  writes:
> Perhaps amcheck's verify_heapam() function can be used here. What
> could be better than exhaustively verifying that the relfrozenxid (and
> relminmxid) invariants hold for every single tuple in the table?

How much will that add to the test's runtime?  I could get behind this
idea if it's not exorbitantly expensive.

regards, tom lane




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Peter Geoghegan
On Thu, Aug 4, 2022 at 11:07 AM Tom Lane  wrote:
> How much will that add to the test's runtime?  I could get behind this
> idea if it's not exorbitantly expensive.

I'm not sure offhand, but I suspect it wouldn't be too bad.

-- 
Peter Geoghegan




Re: [PATCH] CF app: add "Returned: Needs more interest"

2022-08-04 Thread Jacob Champion
Hi Andres,

My intention had not quite been for this to be a referendum on the
decision for every patch -- we can do that if it helps, but I don't
think we necessarily have to have unanimity on the bucketing for every
patch in order for the new state to be useful.

On 8/3/22 12:46, Andres Freund wrote:
>> - https://commitfest.postgresql.org/38/2482/
> 
> Hm - "Returned: Needs more interest" doesn't seem like it'd have been more
> descriptive? It was split off a patchset that was committed at the tail end of
> 15 (and which still has *severe* code quality issues). Imo having a CF entry
> before the rest of the jsonpath stuff made it in doesn't seem like a good
> idea
There were no comments about code quality issues on the thread that I
can see, and there were three people who independently said "I don't
know why this isn't getting review." Seems like a shoe-in for "needs
more interest".

>> - https://commitfest.postgresql.org/38/3338/
> 
> Here it'd have fit.

Okay. That's one.

>> - https://commitfest.postgresql.org/38/3181/
> 
> FWIW, I mentioned at least once that I didn't think this was worth pursuing.

(I don't see that comment on that thread? You mentioned it needed a rebase.)

IMO, mentioning that something is not worth pursuing is not actionable
feedback. It's a declaration of non-interest in the mildest case, and a
Rejection in the strongest case. But let's please not say "meh" and then
Return with Feedback; an author can't do anything with that.

>> - https://commitfest.postgresql.org/38/2918/
> 
> Hm, certainly not a lot of review activity.

That's two.

>> - https://commitfest.postgresql.org/38/2710/
> 
> A good bit of this was committed in some form with a decent amount of review
> activity for a while.

But then the rest of it stalled. Something has to be done with the open
entry.

>> - https://commitfest.postgresql.org/38/2266/ (this one was particularly
>> miscommunicated during the first RwF)
> 
> I'd say misunderstanding than miscommunication...

The CFM sending it said, "It seems there has been no activity since last
version of the patch so I don't think RwF is correct" [1], and then the
email sent said "you are encouraged to send a new patch [...] with the
suggested changes." But there were no suggested changes left to make.

This really highlights, for me, why the two states should not be
combined into one.

> It seems partially stalled due to the potential better approach based on
> https://www.postgresql.org/message-id/flat/15848.1576515643%40sss.pgh.pa.us ?
> In which case RwF doesn't seem to inappropriate.

Those comments are, as far as I can tell, not in the thread. (And the
new thread you linked is also stalled.)

>> - https://commitfest.postgresql.org/38/2218/
> 
> Yep.

That's three.

>> - https://commitfest.postgresql.org/38/3256/
> 
> Yep.

That's four.

>> - https://commitfest.postgresql.org/38/3310/
> 
> I don't really understand why this has been RwF'd, doesn't seem that long
> since the last review leading to changes.

Eight months without feedback, when we expect authors to turn around a
patch in two weeks or less to avoid being RwF'd, is a long time IMHO. I
don't think a patch should sit motionless in CF for eight months; it's
not at all fair to the author.

>> - https://commitfest.postgresql.org/38/3050/
> 
> Given that a non-author did a revision of the patch, listed a number of TODO
> items and said "I'll create regression tests firstly." - I don't think "lacks
> interest" would have been appropriate, and RwF is?

That was six months ago, and prior to that there was another six month
silence. I'd say that lacks interest, and I don't feel like it's
currently reviewable in CF.

>> (Even if they'd all received skeptical feedback, if the author replies in
>> good faith and is met with silence for months, we need to not keep stringing
>> them along.)
> 
> I agree very much with that - just am doubtful that "lacks interest" is a good
> way of dealing with it, unless we just want to treat it as a nicer sounding
> "rejected".
Tom summed up my position well: there's a difference between those two
that is both meaningful and actionable for contributors. Is there an
alternative you'd prefer?

Thanks for the discussion!
--Jacob

[1] https://www.postgresql.org/message-id/20211004071249.GA6304%40ahch-to





Re: Clarifying Commitfest policies

2022-08-04 Thread Jacob Champion
On Wed, Aug 3, 2022 at 2:05 PM Matthias van de Meent
 wrote:
> On Wed, 3 Aug 2022 at 20:04, Jacob Champion  wrote:
> > Is that enough, or should we do more?
>
> "The CF Checklist" seems to refer to only the page that is (or seems
> to be) intended for the CFM only. We should probably also update the
> pages of "Commitfest", "Submitting a patch", "Reviewing a Patch", "So,
> you want to be a developer?", and the "Developer FAQ" page, which
> doesn't have to be more than removing outdated information and
> refering to any (new) documentation on how to participate in the
> PostgreSQL development and/or Commitfest workflow as a non-CFM.

Agreed, a sweep of those materials would be helpful as well. I'm
personally focused on CFM tasks, since it's fresh in my brain and
documentation is almost non-existent for it, but if you have ideas for
those areas, I definitely don't want to shut down that line of the
conversation.

> Additionally, we might want to add extra text to the "developers"
> section of the main website [0] to refer to any new documentation.
> This suggestion does depend on whether the new documentation has a
> high value for potential community members.

Right. So what kinds of info do we want to highlight in this
documentation, to make it high-quality?

Drawing from some of the questions I've seen recently, we could talk about
- CF "power" structure (perhaps simply to highlight that the CFM has
no additional authority to get patches in)
- the back-and-forth process on the mailing list, maybe including
expected response times
- what to do when a patch is returned (or rejected)

> As an example, the GitHub mirror of the main PostgreSQL repository
> receives a decent amount of pull request traffic. When a project has a
> CONTRIBUTING.md -file at the top level people writing the pull request
> message will be pointed to those contributing guidelines. This could

(I think some text got cut here.)

The mirror bot will point you to the "So, you want to be a developer?"
wiki when you open a PR, but I agree that a CONTRIBUTING doc would
help prevent that small embarrassment.

--Jacob




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Robert Haas
On Thu, Aug 4, 2022 at 1:49 PM Tom Lane  wrote:
> Note that the patch you proposed at [1] will not fix anything.
> It turns off autovac in the new node, but the buildfarm failures
> we've seen appear to be due to autovac running on the old node.
> (I believe that autovac in the new node is *also* a hazard, but
> it seems to be a lot less of one, presumably because of timing
> considerations.)  To make it work, we'd have to shut off autovac
> in the old node before starting pg_upgrade,

Yeah, that's a fair point.

> and that would make it
> unacceptably (IMHO) different from what real users will do.

I don't agree with that, but as you say, it is a matter of opinion. In
any case, what exactly do you want to do now?

Jonathon Katz has proposed a patch to do the fuzzy comparison which I
believe to be incorrect because I think it compares, at most, the
horizons for one table in the database.

I could go work on a better version of that, or he could, or you
could, but it seems like we're running out of time awfully quick here,
given that you wanted to have this resolved today and it's almost the
end of today.

I think the most practical alternative is to put this file back to the
way it was before I started tinkering with it, and revisit this issue
after the release. If you want to do something else, that's fine, but
I'm not going to be available to work on this issue over the weekend,
so if you want to do something else, you or someone else is going to
have to take responsibility for whatever further stabilization that
other approach may require between now and the release.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Tom Lane
Robert Haas  writes:
> I think the most practical alternative is to put this file back to the
> way it was before I started tinkering with it, and revisit this issue
> after the release.

Yeah, that seems like the right thing.  We are running too low on time
to have any confidence that a modified version of the test will be
reliable.

regards, tom lane




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Robert Haas
On Thu, Aug 4, 2022 at 12:59 PM Andres Freund  wrote:
> Why you think it's better to not have the test than to have a very limited
> amount of fuzziness (by using the next xid as an upper limit). What's the bug
> that will reliably pass the nextxid fuzzy comparison, but not an exact
> comparison?

I don't know. I mean, I guess one possibility is that the nextXid
value could be wrong too, because I doubt we have any separate test
that would catch that. But more generally, I don't have a lot of
confidence in fuzzy tests. It's too easy for things to look like
they're working when they really aren't.

Let's say that the value in the old cluster was 100 and the nextXid in
the new cluster is 1000. Well, it's not like every value between 100
and 1000 is OK. The overwhelming majority of those values could never
be produced, neither from the old cluster nor from any subsequent
vacuum. Given that the old cluster is suffering no new write
transactions, there's probably exactly two values that are legal: one
being the value from the old cluster, which we know, and the other
being whatever a vacuum of that table would produce, which we don't
know, although we do know that it's somewhere in that range. Let's
flip the question on its head: why should some hypothetical future bug
that we have in this area produce a value outside that range?

If it's a failure to set the value at all, or if it generates a value
at random, we'd likely still catch it. And those are pretty likely, so
maybe the value of such a test is not zero. On the other hand, subtle
breakage might be more likely to survive developer testing than
complete breakage.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Peter Geoghegan
On Thu, Aug 4, 2022 at 12:15 PM Robert Haas  wrote:
> Given that the old cluster is suffering no new write
> transactions, there's probably exactly two values that are legal: one
> being the value from the old cluster, which we know, and the other
> being whatever a vacuum of that table would produce, which we don't
> know, although we do know that it's somewhere in that range.

What about autoanalyze?

-- 
Peter Geoghegan




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Robert Haas
On Thu, Aug 4, 2022 at 3:23 PM Peter Geoghegan  wrote:
> On Thu, Aug 4, 2022 at 12:15 PM Robert Haas  wrote:
> > Given that the old cluster is suffering no new write
> > transactions, there's probably exactly two values that are legal: one
> > being the value from the old cluster, which we know, and the other
> > being whatever a vacuum of that table would produce, which we don't
> > know, although we do know that it's somewhere in that range.
>
> What about autoanalyze?

What about it?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Robert Haas
On Thu, Aug 4, 2022 at 3:10 PM Tom Lane  wrote:
> Robert Haas  writes:
> > I think the most practical alternative is to put this file back to the
> > way it was before I started tinkering with it, and revisit this issue
> > after the release.
>
> Yeah, that seems like the right thing.  We are running too low on time
> to have any confidence that a modified version of the test will be
> reliable.

Done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Peter Geoghegan
On Thu, Aug 4, 2022 at 12:31 PM Robert Haas  wrote:
> > What about autoanalyze?
>
> What about it?

It has a tendency to consume an XID, here or there, quite
unpredictably. I've noticed that this often involves an analyze of
pg_statistic. Have you accounted for that?

You said upthread that you don't like "fuzzy" tests, because it's too
easy for things to look like they're working when they really aren't.
I suppose that there may be some truth to that, but ISTM that there is
also a lot to be said for a test that can catch failures that weren't
specifically anticipated. Users won't be running pg_upgrade with
autovacuum disabled. And so ISTM that just testing that relfrozenxid
has been carried forward is more precise about one particular detail
(more precise than alternative approaches to testing), but less
precise about the thing that we actually care about.

-- 
Peter Geoghegan




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Robert Haas
On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar  wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

I have reviewed this patch and I don't see a problem with it. However,
it would be nice if Andres or someone else who understands this area
well (Tom? Thomas?) would also review it, because I also reviewed
what's in the tree now and that turns out to be buggy, which leads me
to conclude that I don't understand this area as well as would be
desirable.

I'm inclined to hold off on committing this until next week, not only
for that reason, but also because there's a wrap planned on Monday,
and committing anything now seems like it might have too much of a
risk of upsetting that plan.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Pluggable toaster

2022-08-04 Thread Nikita Malakhov
Hi hackers!

I've made a rebase according to Andres and Aleksander comments.
Rebased branch resides here:
https://github.com/postgrespro/postgres/tree/toasterapi_clean

I've decided to leave only the first 2 patches for review and send less
significant
changes after the main patch will be straightened out.
So, here is
v13-0001-toaster-interface.patch - main TOAST API patch, with reference
TOAST
mechanics left as-is.
v13-0002-toaster-default.patch - reference TOAST re-implemented via TOAST
API.


>I'm a bit confused by the patch numbering - why isn't there a patch 0001
and
>0005?
Sorry for the confusion, my fault. The first patch (CREATE TABLE .. SET
STORAGE)
is already committed into v15, and several of the late patches weren't
included.
I've rearranged patch numbers in this iteration.

>I think 0002 needs to be split further - the relevant part isn't that it
>introduces the "dummy toaster" module, it's a large change doing lots of
>things, the addition of the contrib module is irrelevant comparatively.

Done, contrib /dummy_toaster excluded from main patch and placed in branch
as a separate commit.

>As is the patches unfortunately are close to unreviewable. Lots of code
gets
>moved around in one patch, then again in the next patch, then again in the
>next.

So I've decided to put here only the first one while I'm working on the
latter to clean
this up - I agree, code in latter patches needs some refactoring. Working
on it.

>Unfortunately, scanning through these patches, it seems this is a lot of
>complexity, with a (for me) comensurate benefit. There's a lot of more
general
>improvements to toasting and the json type that we can do, that are a lot
less
>complex than this.

We have very significant improvements for storing large JSON and a couple of
other TOAST improvements which make a lot of sense, but they are based on
this API. But in the first patch reference TOAST is left as-is, and does
not use
TOAST API.

>> 2) New VARATT_CUSTOM data structure with fixed header and variable
>> tail to store custom toasted data, with according macros set;

>That's adding overhead to every toast interaction, independent of any new
>infrastructure being used.

We've performed some tests on this and haven't detected significant
overhead,

>So we're increasing pg_attribute - often already the largest catalog table
in
>a database.

A little bit, with an OID column storing Toaster OID. We do not see any
other way
to keep track of Toaster used by the table's column, because it could be
changed
any time by ALTER TABLE ... SET TOASTER.

>Am I just missing something, or is atttoaster not actually used in this
patch?
>So most of the contrib module added is unreachable code?

It is necessary for Toasters implemented via TOAST API, the first patch
does not
use it directly because reference TOAST is left unchanged. The second one
which
implements reference TOAST via TOAST API uses it.

>That seems not great.

About Toasters deletion - we forbid dropping Toasters because if Toaster is
dropped
the data TOASTed with it is lost,  and as was mentioned before, we think
that there
won't be a lot of custom Toasters, likely seems to be less then a dozen.

>That move the whole list around! On a cache hit. Tthis would likely
already be
>slower than syscache.

Thank you for the remark, it is questionable approach. I've changed this in
current iteration
(patch in attach) to keep Toaster list appended-only if Toaster was not
found, and leave
Toaster cache as a straight list - first element in is the head of the list.

Also, documentation on TOAST API is provided in README.toastapi in the
first patch -
I'd be grateful for comments on it.

Thanks for the feedback!

On Tue, Aug 2, 2022 at 6:37 PM Andres Freund  wrote:

> Hi,
>
> On 2022-08-02 09:15:12 +0300, Nikita Malakhov wrote:
> > Attach includes:
> > v11-0002-toaster-interface.patch - contains TOAST API with default
> Toaster
> > left as-is (reference implementation) and Dummy toaster as an example
> > (will be removed later as a part of refactoring?).
> >
> > v11-0003-toaster-default.patch - implements reference TOAST as Default
> > Toaster
> > via TOAST API, so Heap AM calls Toast only via API, and does not have
> direct
> > calls to Toast functionality.
> >
> > v11-0004-toaster-snapshot.patch - supports row versioning for TOASTed
> values
> > and some refactoring.
>
> I'm a bit confused by the patch numbering - why isn't there a patch 0001
> and
> 0005?
>
> I think 0002 needs to be split further - the relevant part isn't that it
> introduces the "dummy toaster" module, it's a large change doing lots of
> things, the addition of the contrib module is irrelevant comparatively.
>
> As is the patches unfortunately are close to unreviewable. Lots of code
> gets
> moved around in one patch, then again in the next patch, then again in the
> next.
>
>
> Unfortunately, scanning through these patches, it seems this is a lot of
> complexity, with a (for me) comensurate benefit. There's a lot of mo

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Justin Pryzby
On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote:
> I'm inclined to hold off on committing this until next week, not only

+1

I don't see any reason to hurry to fix problems that occur when DROP DATABASE
is interrupted.

Sorry to beat up your patches so much and for such crappy test cases^C

-- 
Justin




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane
Justin Pryzby  writes:
> On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote:
>> I'm inclined to hold off on committing this until next week, not only

> +1

+1 ... there are some other v15 open items that I don't think we'll
see fixed for beta3, either.

regards, tom lane




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund
Hi,

On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

What's the motivation behind the explicit close?


> @@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char 
> *srcpath)
>   Pagepage;
>   List   *rlocatorlist = NIL;
>   LockRelId   relid;
> - Relationrel;
>   Snapshotsnapshot;
> + SMgrRelationsmgr;
>   BufferAccessStrategy bstrategy;
>  
>   /* Get pg_class relfilenumber. */
> @@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char 
> *srcpath)
>   rlocator.dbOid = dbid;
>   rlocator.relNumber = relfilenumber;
>  
> - /*
> -  * We can't use a real relcache entry for a relation in some other
> -  * database, but since we're only going to access the fields related to
> -  * physical storage, a fake one is good enough. If we didn't do this and
> -  * used the smgr layer directly, we would have to worry about
> -  * invalidations.
> -  */
> - rel = CreateFakeRelcacheEntry(rlocator);
> - nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
> - FreeFakeRelcacheEntry(rel);
> + smgr = smgropen(rlocator, InvalidBackendId);
> + nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
> + smgrclose(smgr);

Why are you opening and then closing again? Part of the motivation for the
question is that a local SMgrRelation variable may lead to it being used
further, opening up interrupt processing issues.


> + rlocator.locator = src_rlocator;
> + smgrcloserellocator(rlocator);
> +
> + rlocator.locator = dst_rlocator;
> + smgrcloserellocator(rlocator);

As mentioned above, it's not clear to me why this is now done...

Otherwise looks good to me.

Greetings,

Andres Freund




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund
Hi,

On 2022-08-04 16:07:01 -0400, Robert Haas wrote:
> On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar  wrote:
> > Another version of the patch which closes the smgr at the end using
> > smgrcloserellocator() and I have also added a commit message.
> 
> I have reviewed this patch and I don't see a problem with it. However,
> it would be nice if Andres or someone else who understands this area
> well (Tom? Thomas?) would also review it, because I also reviewed
> what's in the tree now and that turns out to be buggy, which leads me
> to conclude that I don't understand this area as well as would be
> desirable.

I don't think this issue is something I'd have caught "originally"
either. It's probably more of a "fake relcache entry" issue (or undocumented
limitation) than a bug in the new code.

I did a quick review upthread - some minor quibbles aside, I think it looks
good.


> I'm inclined to hold off on committing this until next week, not only
> for that reason, but also because there's a wrap planned on Monday,
> and committing anything now seems like it might have too much of a
> risk of upsetting that plan.

Makes sense. Unlikely to be a blocker for anybody.

Greetings,

Andres Freund




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund
Hi,

On 2022-08-04 16:38:35 +0530, Dilip Kumar wrote:
> So basically, from this we can say it is completely a problem with
> drop databases, I mean I can produce any behavior by interrupting drop
> database
> 1. If we created some tables/inserted data and the drop database got
> cancelled, it might have a database directory and those objects are
> lost.
> 2.  Or you can even drop the database directory and then get cancelled
> before deleting the pg_database entry then also you will end up with a
> corrupted database, doesn't matter whether you created it with WAL LOG
> or FILE COPY.

Yea. I think at the very least we need to start holding interrupts before the
DropDatabaseBuffers() and do so until commit. That's probably best done by
doing the transaction commit inside dropdb.

Greetings,

Andres Freund




Re: [PATCH] CF app: add "Returned: Needs more interest"

2022-08-04 Thread Andres Freund
Hi,

On 2022-08-04 11:19:28 -0700, Jacob Champion wrote:
> My intention had not quite been for this to be a referendum on the
> decision for every patch -- we can do that if it helps, but I don't
> think we necessarily have to have unanimity on the bucketing for every
> patch in order for the new state to be useful.

Sorry, I should have been clearer. It wasn't mine either! I was just trying to
understand what you see as the usecase / get a better feel for it. I'm now a
bit more convinced it's useful than before.


> >> - https://commitfest.postgresql.org/38/3310/
> > 
> > I don't really understand why this has been RwF'd, doesn't seem that long
> > since the last review leading to changes.
> 
> Eight months without feedback, when we expect authors to turn around a
> patch in two weeks or less to avoid being RwF'd, is a long time IMHO.

Why is it better to mark it as lacks interested than RwF if there actually
*has* been feedback?


> I don't think a patch should sit motionless in CF for eight months; it's not
> at all fair to the author.

It's not great, I agree, but wishes don't conjure up resources :(


> >> - https://commitfest.postgresql.org/38/3050/
> > 
> > Given that a non-author did a revision of the patch, listed a number of TODO
> > items and said "I'll create regression tests firstly." - I don't think 
> > "lacks
> > interest" would have been appropriate, and RwF is?
> 
> That was six months ago, and prior to that there was another six month
> silence. I'd say that lacks interest, and I don't feel like it's
> currently reviewable in CF.

I don't think the entry needs more review - it needs changes:
https://www.postgresql.org/message-id/CAOKkKFtc45uNFoWYOCo4St19ayxrh-_%2B4TnZtwxGZz6-3k_GSA%40mail.gmail.com
That contains quite a few things that should be changed.

A patch that has gotten feedback, but that feedback hasn't been processed
pretty much is the definition of RwF, no?


> >> (Even if they'd all received skeptical feedback, if the author replies in
> >> good faith and is met with silence for months, we need to not keep 
> >> stringing
> >> them along.)
> > 
> > I agree very much with that - just am doubtful that "lacks interest" is a 
> > good
> > way of dealing with it, unless we just want to treat it as a nicer sounding
> > "rejected".
>
> Tom summed up my position well: there's a difference between those two
> that is both meaningful and actionable for contributors. Is there an
> alternative you'd prefer?

I agree that "lacks interest" could be useful. But I'm wary of it becoming
just a renaming if we end up marking patches that should be RwF or rejected as
"lacks interest".

Greetings,

Andres Freund




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane
Robert Haas  writes:
> I have reviewed this patch and I don't see a problem with it. However,
> it would be nice if Andres or someone else who understands this area
> well (Tom? Thomas?) would also review it, because I also reviewed
> what's in the tree now and that turns out to be buggy, which leads me
> to conclude that I don't understand this area as well as would be
> desirable.

FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry
here, because I do not think that mechanism is meant to be used
outside of WAL replay.  However, this patch fails to remove it from
CreateAndCopyRelationData, which seems likely to be just as much
at risk.

The "invalidation" comment bothered me for awhile, but I think it's
fine: we know that no other backend can connect to the source DB
because we have it locked, and we know that no other backend can
connect to the destination DB because it doesn't exist yet according
to the catalogs, so nothing could possibly occur to invalidate our
idea of where the physical files are.  It would be nice to document
these assumptions, though, rather than merely remove all the relevant
commentary.

While I'm at it, I would like to strenuously object to the current
framing of CreateAndCopyRelationData as a general-purpose copying
mechanism.  Because of the above assumptions, I think it's utterly
unsafe to use anywhere except in CREATE DATABASE.  The header comment
fails to warn about that at all, and placing it in bufmgr.c rather
than static in dbcommands.c is just an invitation to future misuse.
Perhaps I'm overly sensitive to that because I just finished cleaning
up somebody's misuse of non-general-purpose code (1aa8dad41), but
as this stands I think it's positively dangerous.

regards, tom lane




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane
Andres Freund  writes:
> Yea. I think at the very least we need to start holding interrupts before the
> DropDatabaseBuffers() and do so until commit. That's probably best done by
> doing the transaction commit inside dropdb.

We've talked before about ignoring interrupts across commit, but
I find the idea a bit scary.  In any case, DROP DATABASE is far
from the only place with a problem.

regards, tom lane




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-04 Thread Nathan Bossart
On Thu, Aug 04, 2022 at 02:58:14PM +0700, John Naylor wrote:
> Were you considering adding the new function to simd.h now that that's
> committed? It's a bit up in the air what should go in there, but this new
> function is low-level and generic enough to be a candidate...

I don't have a strong opinion.  I went with a separate file because I
envisioned a variety of possible linear search functions (e.g., char,
uint16, uint32), and some might not use SIMD instructions.  Futhermore, it
seemed less obvious to look in simd.h for linear search functions.  That
being said, it might make sense to just add it here for now.

> I wonder if the "pg_" prefix is appropriate here, as that is most often
> used for things that hide specific details *and* where the base name would
> clash, like OS calls or C library functions. I'm not quite sure where the
> line is drawn, but I mean that "linearsearch" is a generic algorithm and
> not a specific API we are implementing, if that makes sense.

Yeah, I was concerned about clashing with lsearch() and lfind().  I will
drop the prefix.

> The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(),
> pg_popcount32, etc). We'll likely want to search bytes sometime, so
> something to keep in mind as far as naming ("_int" vs "_byte"?).

How about something like lsearch32 or linearsearch32?

> I'm not a fan of "its" as a variable name, and I'm curious what it's
> intended to convey.

It's short for "iterations."  I'll spell it out completely to avoid this
kind of confusion.

> All the __m128i vars could technically be declared const, although I think
> it doesn't matter -- it's just a hint to the reader.

Will do.

> Out of curiosity do we know how much we get by loading four registers
> rather than two?

The small program I've been using for testing takes about 40% more time
with the two register approach.  The majority of this test involves
searching for elements that either don't exist in the array or that live
near the end of the array, so this is probably close to the worst case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Justin Pryzby
On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote:
> The "invalidation" comment bothered me for awhile, but I think it's
> fine: we know that no other backend can connect to the source DB
> because we have it locked,

About that - is it any problem that the currently-connected db can be used as a
template?  It's no issue for 2-phase commit, because "create database" cannot
run in an txn.

-- 
Justin




Re: ERREUR: cache lookup failed for function 0 with PostgreSQL 15 beta 2, no error with PostgreSQL 14.4

2022-08-04 Thread David Rowley
On Fri, 5 Aug 2022 at 01:20, Phil Florent  wrote:
> with fakedata as (
>select 'hello' word
>union all
>select 'world' word
> )
> select *
> from (
>select word, count(*) over (partition by word) nb from fakedata
> ) t where nb = 1;
> ERREUR:  cache lookup failed for function 0

> A DSS developer from my company, Julien Roze, reported me an error I cannot 
> explained. Is it a new behavior or a bug ?

Thank you for the report and the minimal self-contained test case.
That's highly useful for us.

I've now committed a fix for this ([1]).  It will appear in the next
beta release for PG15.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=270eb4b5d4986534f2d522ebb19f67396d13cf44




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund
Hi,

On 2022-08-04 18:05:25 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Yea. I think at the very least we need to start holding interrupts before 
> > the
> > DropDatabaseBuffers() and do so until commit. That's probably best done by
> > doing the transaction commit inside dropdb.
> 
> We've talked before about ignoring interrupts across commit, but
> I find the idea a bit scary.

I'm not actually suggesting to do so across commit, just until the
commit. Maintaining that seems easiest if dropdb() does the commit internally.


> In any case, DROP DATABASE is far from the only place with a problem.

What other place has a database corrupting potential of this magnitude just
because interrupts are accepted?  We throw valid s_b contents away and then
accept interrupts before committing - with predictable results. We also accept
interrupts as part of deleting the db data dir (due to catalog access).

Greetings,

Andres Freund




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane
I wrote:
> While I'm at it, I would like to strenuously object to the current
> framing of CreateAndCopyRelationData as a general-purpose copying
> mechanism.

And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
not completely broken?

/* Read block from source relation. */
srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
   RBM_NORMAL, bstrategy_src,
   permanent);
srcPage = BufferGetPage(srcBuf);
if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
{
ReleaseBuffer(srcBuf);
continue;
}

/* Use P_NEW to extend the destination relation. */
dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
   RBM_NORMAL, bstrategy_dst,
   permanent);

You can't skip pages just because they are empty.  Well, maybe you could
if you were doing something to ensure that you zero-fill the corresponding
blocks on the destination side.  But this isn't doing that.  It's using
P_NEW for dstBuf, which will have the effect of silently collapsing out
such pages.  Maybe in isolation a heap could withstand that, but its
indexes won't be happy (and I guess t_ctid chain links won't either).

I think you should just lose the if() stanza.  There's no optimization to
be had here that's worth any extra complication.

(This seems worth fixing before beta3, as it looks like a rather
nasty data corruption hazard.)

regards, tom lane




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane
I wrote:
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?

[pile^2]  Also, what is the rationale for locking the target buffer
but not the source buffer?  That seems pretty hard to justify from
here, even granting the assumption that we don't expect any other
processes to be interested in these buffers (which I don't grant,
because checkpointer).

regards, tom lane




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund
Hi,

On 2022-08-04 19:01:06 -0400, Tom Lane wrote:
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?
> 
> /* Read block from source relation. */
> srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
>RBM_NORMAL, bstrategy_src,
>permanent);
> srcPage = BufferGetPage(srcBuf);
> if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
> {
> ReleaseBuffer(srcBuf);
> continue;
> }
> 
> /* Use P_NEW to extend the destination relation. */
> dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
>RBM_NORMAL, bstrategy_dst,
>permanent);
> 
> You can't skip pages just because they are empty.  Well, maybe you could
> if you were doing something to ensure that you zero-fill the corresponding
> blocks on the destination side.  But this isn't doing that.  It's using
> P_NEW for dstBuf, which will have the effect of silently collapsing out
> such pages.  Maybe in isolation a heap could withstand that, but its
> indexes won't be happy (and I guess t_ctid chain links won't either).
> 
> I think you should just lose the if() stanza.  There's no optimization to
> be had here that's worth any extra complication.
> 
> (This seems worth fixing before beta3, as it looks like a rather
> nasty data corruption hazard.)

Ugh, yes. And even with this fixed I think this should grow at least an
assertion that the block numbers match, probably even an elog.

Greetings,

Andres




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane
Andres Freund  writes:
> On 2022-08-04 18:05:25 -0400, Tom Lane wrote:
>> In any case, DROP DATABASE is far from the only place with a problem.

> What other place has a database corrupting potential of this magnitude just
> because interrupts are accepted?  We throw valid s_b contents away and then
> accept interrupts before committing - with predictable results. We also accept
> interrupts as part of deleting the db data dir (due to catalog access).

Those things would be better handled by moving the data-discarding
steps to post-commit.  Maybe that argues for having an internal
commit halfway through DROP DATABASE: remove pg_database row,
commit, start new transaction, clean up.

regards, tom lane




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane
Andres Freund  writes:
> On 2022-08-04 19:01:06 -0400, Tom Lane wrote:
>> (This seems worth fixing before beta3, as it looks like a rather
>> nasty data corruption hazard.)

> Ugh, yes. And even with this fixed I think this should grow at least an
> assertion that the block numbers match, probably even an elog.

Yeah, the assumption that P_NEW would automatically match the source block
was making me itchy too.  An explicit test-and-elog seems worth the
cycles.

regards, tom lane




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund
Hi, 

On August 4, 2022 4:11:13 PM PDT, Tom Lane  wrote:
>I wrote:
>> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
>> not completely broken?
>
>[pile^2]  Also, what is the rationale for locking the target buffer
>but not the source buffer?  That seems pretty hard to justify from
>here, even granting the assumption that we don't expect any other
>processes to be interested in these buffers (which I don't grant,
>because checkpointer).

I'm not arguing it's good or should stay that way, but it's probably okayish 
that checkpointer / bgwriter have access, given that they will never modify 
buffers. They just take a lock to prevent concurrent modifications, which 
RelationCopyStorageUsingBuffer hopefully doesn't do. 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Andres Freund
Hi, 

On August 4, 2022 4:20:16 PM PDT, Tom Lane  wrote:
>Yeah, the assumption that P_NEW would automatically match the source block
>was making me itchy too.  An explicit test-and-elog seems worth the
>cycles.

Is there a good reason to rely on P_NEW at all? Both from an efficiency and 
robustness POV it seems like it'd be better to use smgrextend to bulk extend 
just after smgrcreate() and then fill s_b u using RBM_ZERO (or whatever it is 
called).  That bulk smgrextend would later be a good point to use fallocate so 
the FS can immediately size the file correctly, without a lot of pointless 
writes of zeroes. 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane
Andres Freund  writes:
> On August 4, 2022 4:11:13 PM PDT, Tom Lane  wrote:
>> [pile^2]  Also, what is the rationale for locking the target buffer
>> but not the source buffer?  That seems pretty hard to justify from
>> here, even granting the assumption that we don't expect any other
>> processes to be interested in these buffers (which I don't grant,
>> because checkpointer).

> I'm not arguing it's good or should stay that way, but it's probably okayish 
> that checkpointer / bgwriter have access, given that they will never modify 
> buffers. They just take a lock to prevent concurrent modifications, which 
> RelationCopyStorageUsingBuffer hopefully doesn't do. 

I'm not arguing that it's actively broken today --- but AFAIR,
every other access to a shared buffer takes a buffer lock.
It does not seem to me to be very future-proof for this code to
decide it's exempt from that rule, without so much as a comment
justifying it.  Furthermore, what's the gain?  We aren't expecting
contention here, I think.  If we were, then it probably *would* be
actively broken.

regards, tom lane




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Tom Lane
Andres Freund  writes:
> Is there a good reason to rely on P_NEW at all? Both from an efficiency
> and robustness POV it seems like it'd be better to use smgrextend to
> bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO
> (or whatever it is called).  That bulk smgrextend would later be a good
> point to use fallocate so the FS can immediately size the file
> correctly, without a lot of pointless writes of zeroes.

Hmm ... makes sense.  We'd be using smgrextend to write just the last page
of the file, relying on its API spec "Note that we assume writing a block
beyond current EOF causes intervening file space to become filled with
zeroes".  I don't know that we're using that assumption aggressively
today, but as long as it doesn't confuse the kernel it'd probably be fine.

regards, tom lane




Re: annoyance with .git-blame-ignore-revs

2022-08-04 Thread Peter Geoghegan
On Mon, Jul 11, 2022 at 12:30 PM Alvaro Herrera  wrote:
> $ git blame configure
> fatal: could not open object name list: .git-blame-ignore-revs
>
> My first workaround was to add empty .git-blame-ignore-revs in all
> checkouts.  This was moderately ok (shrug), until after a recent `tig`
> upgrade the empty file started to show up in the history as an untracked
> file.

Ping? Would be nice to get this done soon. I don't think that it
requires a great deal of care. If I was doing this myself, I would
probably make sure that the backbranch copies of the file won't
reference commits from later releases. But even that probably doesn't
matter; just backpatching the file from HEAD as-is wouldn't break
anybody's workflow.

Again, to reiterate: I see no reason to do anything on the
backbranches here more than once.

I mentioned already that somebody proposed a patch that fixes the
problem at the git level, which seems to have stalled. Here is the
discussion:

https://public-inbox.org/git/xmqq5ywehb69.fsf@gitster.g/T/

ISTM that we're working around what is actually a usability problem
with git (imagine that!). I think that that's fine. Just thought that
it was worth acknowledging it as such. We're certainly not the first
people to run into this exact annoyance.

-- 
Peter Geoghegan




Re: Allow file inclusion in pg_hba and pg_ident files

2022-08-04 Thread Michael Paquier
On Tue, Aug 02, 2022 at 07:32:54PM +0900, Michael Paquier wrote:
> As a quick update from my side, I intend to look and apply 0001~0003
> (not double-checked yet) shortly.

And a couple of days later, these look fine so done as of 47ab1ac and
718fe0a.  0002 and 0003 have been merged together.
--
Michael


signature.asc
Description: PGP signature


RE: Introduce wait_for_subscription_sync for TAP tests

2022-08-04 Thread shiy.f...@fujitsu.com
On Thu, Aug 4, 2022 5:49 PM shiy.f...@fujitsu.com  wrote:
> 
> On Thu, Aug 4, 2022 2:28 PM Masahiko Sawada 
> wrote:
> >
> > On Thu, Aug 4, 2022 at 10:37 AM Amit Kapila 
> > wrote:
> > >
> > > On Wed, Aug 3, 2022 at 10:21 AM Amit Kapila
> 
> > wrote:
> > > >
> > > > Pushed this one and now I'll look at your other patch.
> > > >
> > >
> > > I have pushed the second patch as well after making minor changes in
> > > the comments. Alvaro [1] and Tom [2] suggest to back-patch this and
> > > they sound reasonable to me. Will you be able to produce back branch
> > > patches?
> >
> > Yes. I've attached patches for backbranches. The updates are
> > straightforward on v11 - v15. However, on v10, we don't use
> > wait_for_catchup() in some logical replication test cases. The commit
> > bbd3363e128dae refactored the tests to use wait_for_catchup but it's
> > not backpatched. So in the patch for v10, I didn't change the code
> > that was changed by the commit. Also, since wait_for_catchup requires
> > to specify $target_lsn, unlike the one in v11 or later, I changed
> > wait_for_subscription_sync() accordingly.
> >
> 
> Thanks for your patches.
> 
> In the patches for pg11 ~ pg14, it looks we need to add a "=pod" before the
> current change in PostgresNode.pm. Right?
> 

By the way, I notice that in 002_types.pl (on master branch), it seems the "as
well" in the following comment should be removed. Is it worth being fixed?

$node_subscriber->safe_psql('postgres',
"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' 
PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)"
);

# Wait for initial sync to finish as well
$node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub');

Regards,
Shi yu



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Robert Haas
On Thu, Aug 4, 2022 at 6:02 PM Tom Lane  wrote:
> Robert Haas  writes:
> > I have reviewed this patch and I don't see a problem with it. However,
> > it would be nice if Andres or someone else who understands this area
> > well (Tom? Thomas?) would also review it, because I also reviewed
> > what's in the tree now and that turns out to be buggy, which leads me
> > to conclude that I don't understand this area as well as would be
> > desirable.
>
> FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry
> here, because I do not think that mechanism is meant to be used
> outside of WAL replay.  However, this patch fails to remove it from
> CreateAndCopyRelationData, which seems likely to be just as much
> at risk.

It looks to me like it does?

> The "invalidation" comment bothered me for awhile, but I think it's
> fine: we know that no other backend can connect to the source DB
> because we have it locked, and we know that no other backend can
> connect to the destination DB because it doesn't exist yet according
> to the catalogs, so nothing could possibly occur to invalidate our
> idea of where the physical files are.  It would be nice to document
> these assumptions, though, rather than merely remove all the relevant
> commentary.

I don't think that's the point. We could always suffer a sinval reset
or a PROCSIGNAL_BARRIER_SMGRRELEASE. But since the code avoids ever
reusing the smgr, it should be OK. I think.

> While I'm at it, I would like to strenuously object to the current
> framing of CreateAndCopyRelationData as a general-purpose copying
> mechanism.  Because of the above assumptions, I think it's utterly
> unsafe to use anywhere except in CREATE DATABASE.  The header comment
> fails to warn about that at all, and placing it in bufmgr.c rather
> than static in dbcommands.c is just an invitation to future misuse.
> Perhaps I'm overly sensitive to that because I just finished cleaning
> up somebody's misuse of non-general-purpose code (1aa8dad41), but
> as this stands I think it's positively dangerous.

OK. No objection to you revising the comments however you feel is appropriate.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Robert Haas
On Thu, Aug 4, 2022 at 7:01 PM Tom Lane  wrote:
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?

Ouch. That's pretty bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Robert Haas
On Thu, Aug 4, 2022 at 7:11 PM Tom Lane  wrote:
> [pile^2]  Also, what is the rationale for locking the target buffer
> but not the source buffer?  That seems pretty hard to justify from
> here, even granting the assumption that we don't expect any other
> processes to be interested in these buffers (which I don't grant,
> because checkpointer).

Ooph. I agree.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-08-04 Thread Kyotaro Horiguchi
At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe  
wrote in 
> While this may mitigate the problem, I don't think it will deal with
> all the cases which could cause a transaction to end up committed locally,
> but not on the synchronous standby.  I think that only using the full
> power of two-phase commit can make this bulletproof.
> 
> Is it worth adding additional complexity that is not a complete solution?

I would agree to this. Likewise 2PC, whatever we do to make it
perfect, we're left with unresolvable problems at least for now.

Doesn't it meet your requirements if we have a means to know the last
transaction on the current session is locally committed or aborted?

We are already internally managing last committed LSN. I think we can
do the same thing about transaction abort and last inserted LSN and we
can expose them any way. This is way simpler than the (maybe)
uncompletable attempt to fill up the deep gap.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Introduce wait_for_subscription_sync for TAP tests

2022-08-04 Thread Amit Kapila
On Thu, Aug 4, 2022 at 7:13 PM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > Yes. I've attached patches for backbranches.
>
> FWIW, I'd recommend waiting till after next week's wrap before
> pushing these.  While I'm definitely in favor of doing this,
> the odds of introducing a bug are nonzero, so right before a
> release deadline doesn't seem like a good time.
>

Agreed. I was planning to do it only after next week's wrap. Thanks
for your suggestion.

-- 
With Regards,
Amit Kapila.




Re: BTMaxItemSize seems to be subtly incorrect

2022-08-04 Thread Peter Geoghegan
On Thu, Jun 9, 2022 at 11:20 AM Robert Haas  wrote:
> I think I'd feel more comfortable if you wrote the patch, if that's possible.

Okay, pushed a fix just now.

Thanks
-- 
Peter Geoghegan




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-04 Thread John Naylor
On Fri, Aug 5, 2022 at 5:15 AM Nathan Bossart 
wrote:
>
> On Thu, Aug 04, 2022 at 02:58:14PM +0700, John Naylor wrote:
> > Were you considering adding the new function to simd.h now that that's
> > committed? It's a bit up in the air what should go in there, but this
new
> > function is low-level and generic enough to be a candidate...
>
> I don't have a strong opinion.  I went with a separate file because I
> envisioned a variety of possible linear search functions (e.g., char,
> uint16, uint32), and some might not use SIMD instructions.  Futhermore, it
> seemed less obvious to look in simd.h for linear search functions.

That is a good point. Maybe potential helpers in simd.h should only deal
specifically with vector registers, with it's users providing C fallbacks.
I don't have any good ideas of where to put the new function, though.

> > I wonder if the "pg_" prefix is appropriate here, as that is most often
> > used for things that hide specific details *and* where the base name
would
> > clash, like OS calls or C library functions. I'm not quite sure where
the
> > line is drawn, but I mean that "linearsearch" is a generic algorithm and
> > not a specific API we are implementing, if that makes sense.
>
> Yeah, I was concerned about clashing with lsearch() and lfind().  I will
> drop the prefix.

Hmm, I didn't know about those. lfind() is similar enough that it would
make sense to have pg_lfind32() etc in src/include/port/pg_lsearch.h, at
least for the v4 version that returns the pointer. We already declare
bsearch_arg() in src/include/port.h and that's another kind of array
search. Returning bool is different enough to have a different name.
pg_lfind32_ispresent()?  *_noptr()? Meh.

Having said all that, the man page under BUGS [1] says "The naming is
unfortunate."

> > Out of curiosity do we know how much we get by loading four registers
> > rather than two?
>
> The small program I've been using for testing takes about 40% more time
> with the two register approach.  The majority of this test involves
> searching for elements that either don't exist in the array or that live
> near the end of the array, so this is probably close to the worst case.

Ok, sounds good.

[1] https://man7.org/linux/man-pages/man3/lsearch.3.html#BUGS

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Dilip Kumar
On Fri, Aug 5, 2022 at 4:31 AM Tom Lane  wrote:
>
> I wrote:
> > While I'm at it, I would like to strenuously object to the current
> > framing of CreateAndCopyRelationData as a general-purpose copying
> > mechanism.
>
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?
>
> /* Read block from source relation. */
> srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
>RBM_NORMAL, bstrategy_src,
>permanent);
> srcPage = BufferGetPage(srcBuf);
> if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
> {
> ReleaseBuffer(srcBuf);
> continue;
> }
>
> /* Use P_NEW to extend the destination relation. */
> dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
>RBM_NORMAL, bstrategy_dst,
>permanent);
>
> You can't skip pages just because they are empty.  Well, maybe you could
> if you were doing something to ensure that you zero-fill the corresponding
> blocks on the destination side.  But this isn't doing that.  It's using
> P_NEW for dstBuf, which will have the effect of silently collapsing out
> such pages.  Maybe in isolation a heap could withstand that, but its
> indexes won't be happy (and I guess t_ctid chain links won't either).
>
> I think you should just lose the if() stanza.  There's no optimization to
> be had here that's worth any extra complication.
>
> (This seems worth fixing before beta3, as it looks like a rather
> nasty data corruption hazard.)

Yeah this is broken.

--
Dilip
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Handle infinite recursion in logical replication setup

2022-08-04 Thread Amit Kapila
On Thu, Aug 4, 2022 at 6:31 PM Masahiko Sawada  wrote:
>
> On Tue, Aug 2, 2022 at 8:59 PM Amit Kapila  wrote:
> >
> > On Tue, Jul 26, 2022 at 9:07 AM Amit Kapila  wrote:
> >
> > Let me try to summarize the discussion so that it is easier for others
> > to follow. The work in this thread is to avoid loops, and duplicate
> > data in logical replication when the operations happened on the same
> > table in multiple nodes. It has been explained in email [1] with an
> > example of how a logical replication setup can lead to duplicate or
> > inconsistent data.
> >
> > The idea to solve this problem is that we don't replicate data that is
> > not generated locally which we can normally identify based on origin
> > of data in WAL. The commit 366283961a achieves that for replication
> > but still the problem can happen during initial sync which is
> > performed internally via copy. We can't differentiate the data in heap
> > based on origin. So, we decided to prohibit the subscription
> > operations that can perform initial sync (ex. Create Subscription,
> > Alter Subscription ... Refresh) by detecting that the publisher has
> > subscribed to the same table from some other publisher.
> >
> > To prohibit the subscription operations, the currently proposed patch
> > throws an error.  Then, it also provides a new copy_data option
> > 'force' under which the user will still be able to perform the
> > operation. This could be useful when the user intentionally wants to
> > replicate the initial data even if it contains data from multiple
> > nodes (for example, when in a multi-node setup, one decides to get the
> > initial data from just one node and then allow replication of data to
> > proceed from each of respective nodes).
> >
> > The other alternative discussed was to just give a warning for
> > subscription operations and probably document the steps for users to
> > avoid it. But the problem with that is once the user sees this
> > warning, it won't be able to do anything except recreate the setup, so
> > why not give an error in the first place?
> >
> > Thoughts?
>
> Thank you for the summary!
>
> I understand that this feature could help some cases, but I'm really
> not sure adding new value 'force' for them is worthwhile, and
> concerned it could reduce the usability.
>
> IIUC this feature would work only when origin = 'none' and copy_data =
> 'on', and the purpose is to prevent the data from being
> duplicated/conflicted by the initial table sync. But there are cases
> where duplication/conflict doesn't happen even if origin = 'none' and
> copy_data = 'on'. For instance, the table on the publisher might be
> empty.
>

Right, but if we want we can check the tables on publishers to ensure
that. Now, another case could be where the corresponding subscription
was disabled on publisher during create subscription but got enabled
just before copy, we can even catch that situation as we are doing for
column lists in fetch_remote_table_info(). Are there other cases of
false positives you can think of?  I see your point that we can
document that users should be careful with certain configurations to
avoid data inconsistency but not able completely convince myself about
the same. I think the main thing to decide here is how much we want to
ask users to be careful by referring them to docs. Now, if you are not
convinced with giving an ERROR here then we can probably show a
WARNING (that we might copy data for multiple origins during initial
sync in spite of the user having specified origin as NONE)?

>
 Also, even with origin = 'any', copy_data = 'on', there is a
> possibility of data duplication/conflict. Why do we need to address
> only the case where origin = 'none'?
>

Because the user has specifically asked not to replicate any remote
data by specifying origin = NONE, which should be dealt with
differently whereas 'any' doesn't have such a requirement. Now,
tomorrow, if we want to support replication based on specific origin
names say origin = 'node-1' then also we won't be able to identify the
data during initial sync but I think 'none' is a special case where
giving some intimation to user won't be a bad idea especially because
we can identify the same.

> I think that using origin =
> 'none' doesn't necessarily mean using bi-directional (or N-way)
> replication. Even when using uni-directional logical replication with
> two nodes, they may use origin = 'none'.
>

It is possible but still, I think it is a must for bi-directional (or
N-way) replication, otherwise, there is a risk of loops.

-- 
With Regards,
Amit Kapila.




Re: How is this possible "publication does not exist"

2022-08-04 Thread Kyotaro Horiguchi
At Thu, 4 Aug 2022 10:56:45 +0200, Marco Slot  wrote in 
> On Wed, Aug 11, 2021 at 1:37 PM Amit Kapila  wrote:
> > I think it is and the context is generated via
> > output_plugin_error_callback. Is this reproducible for you and if so,
> > can you share a test case or some steps to reproduce this? Does this
> > work and suddenly start giving errors or it happens the very first
> > time you tried to set up publication/subscription? I think some more
> > details are required about your setup and steps to analyze this
> > problem. You might want to check publication-side logs but not sure if
> > get any better clue there.
> 
> This seems to regularly reproduce the issue on PostgreSQL 14.4:
> 
> drop subscription if exists local_sub;
> drop publication if exists local_pub;
> drop table if exists local;
> 
> select pg_create_logical_replication_slot('test','pgoutput');
> create table local (x int, y int);
> insert into local values (1,2);
> create publication local_pub for table local;
> create subscription local_sub connection 'host=localhost port=5432'
> publication local_pub with (create_slot = false, slot_name = 'test',
> copy_data = false);

> The local_pub does appear in pg_publication, but it seems a bit like
> the change_cb is using an old snapshot when reading from the catalog
> in GetPublicationByName.

Yes.  So the slot should be created *after* the publication is
created.  A discussion [1] was held on allowing missing publications
in such a case but it has not been rached a conclusion..

[1] 
https://www.postgresql.org/message-id/CAA4eK1LwQAEPJMTwVe3UYODeNMkK2QHf-WZF5aXp5ZcjDRcrUA%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: fix typos

2022-08-04 Thread John Naylor
On Thu, Aug 4, 2022 at 8:41 PM Tom Lane  wrote:
>
> John Naylor  writes:

> > RepOriginId is a typedef for uint16, so this can't print the wrong
answer,
> > but it is inconsistent with other uses. So it seems we don't need to
> > backpatch this one?
>
> Um ... if it's int16, then it can't be an OID, so I'd say this message has
> far worse problems than %d vs %u.  It should not use that terminology.

The catalog has the following. Since it's not a real oid, maybe this column
should be rethought?

CATALOG(pg_replication_origin,6000,ReplicationOriginRelationId)
BKI_SHARED_RELATION
{
/*
* Locally known id that get included into WAL.
*
* This should never leave the system.
*
* Needs to fit into an uint16, so we don't waste too much space in WAL
* records. For this reason we don't use a normal Oid column here, since
* we need to handle allocation of new values manually.
*/
Oid roident;
[...]

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Dilip Kumar
On Fri, Aug 5, 2022 at 5:36 AM Andres Freund  wrote:
>
> Hi,
>
> On August 4, 2022 4:20:16 PM PDT, Tom Lane  wrote:
> >Yeah, the assumption that P_NEW would automatically match the source block
> >was making me itchy too.  An explicit test-and-elog seems worth the
> >cycles.
>
> Is there a good reason to rely on P_NEW at all?

I think there were 2 arguments for which we used bufmgr instead of
smgrextend for the destination database

1) (Comment from Andres) The big benefit would be that the *target*
database does not have to be written out / shared buffer is
immediately populated. [1]
2) (Comment from Robert) We wanted to avoid writing new code which
bypasses the shared buffers.

[1]https://www.postgresql.org/message-id/20210905202800.ji4fnfs3xzhvo7l6%40alap3.anarazel.de

 Both from an efficiency and robustness POV it seems like it'd be
better to use smgrextend to bulk extend just after smgrcreate() and
then fill s_b u using RBM_ZERO (or whatever it is called).  That bulk
smgrextend would later be a good point to use fallocate so the FS can
immediately size the file correctly, without a lot of pointless writes
of zeroes.

Yeah okay, so you mean since we already know the nblocks in the source
file so we can directly do smgrextend in bulk before the copy loop and
then we can just copy block by block using bufmgr with proper blkno
instead of P_NEW.  Yeah I think this looks optimized to me and this
will take care of the above 2 points I mentioned that we will still
have the target database pages in the shared buffers and we are not
bypassing the shared buffers also.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-04 Thread Dilip Kumar
On Fri, Aug 5, 2022 at 2:59 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote:
> > Another version of the patch which closes the smgr at the end using
> > smgrcloserellocator() and I have also added a commit message.
>
> What's the motivation behind the explicit close?
>
>
> > @@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char 
> > *srcpath)
> >   Pagepage;
> >   List   *rlocatorlist = NIL;
> >   LockRelId   relid;
> > - Relationrel;
> >   Snapshotsnapshot;
> > + SMgrRelationsmgr;
> >   BufferAccessStrategy bstrategy;
> >
> >   /* Get pg_class relfilenumber. */
> > @@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char 
> > *srcpath)
> >   rlocator.dbOid = dbid;
> >   rlocator.relNumber = relfilenumber;
> >
> > - /*
> > -  * We can't use a real relcache entry for a relation in some other
> > -  * database, but since we're only going to access the fields related 
> > to
> > -  * physical storage, a fake one is good enough. If we didn't do this 
> > and
> > -  * used the smgr layer directly, we would have to worry about
> > -  * invalidations.
> > -  */
> > - rel = CreateFakeRelcacheEntry(rlocator);
> > - nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
> > - FreeFakeRelcacheEntry(rel);
> > + smgr = smgropen(rlocator, InvalidBackendId);
> > + nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
> > + smgrclose(smgr);
>
> Why are you opening and then closing again? Part of the motivation for the
> question is that a local SMgrRelation variable may lead to it being used
> further, opening up interrupt processing issues.

Yeah okay, I think there is no reason to close, in the previous
version I had like below and I think that's a better idea.

nblocks = smgrnblocks(smgropen(rlocator, InvalidBackendId), MAIN_FORKNUM)

>
> > + rlocator.locator = src_rlocator;
> > + smgrcloserellocator(rlocator);
> > +
> > + rlocator.locator = dst_rlocator;
> > + smgrcloserellocator(rlocator);
>
> As mentioned above, it's not clear to me why this is now done...
>
> Otherwise looks good to me.

Yeah maybe it is not necessary to close as these unowned smgr will
automatically get closed on the transaction end.  Actually the
previous person of the patch had both these comments fixed.  The
reason for explicitly closing it is that I have noticed that most of
the places we explicitly close the smgr where we do smgropen e.g.
index_copy_data(), heapam_relation_copy_data() OTOH some places we
don't close it e.g. IssuePendingWritebacks().   So now I think that in
our case better we do not close it because I do not like this specific
code at the end to close the smgr.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: BTMaxItemSize seems to be subtly incorrect

2022-08-04 Thread Thomas Munro
On Fri, Aug 5, 2022 at 3:56 PM Peter Geoghegan  wrote:
> On Thu, Jun 9, 2022 at 11:20 AM Robert Haas  wrote:
> > I think I'd feel more comfortable if you wrote the patch, if that's 
> > possible.
>
> Okay, pushed a fix just now.

FYI florican and lapwing showed:

2022-08-05 01:04:29.903 EDT [34485:5] FATAL:  deduplication failed to
add heap tid to pending posting list
2022-08-05 01:04:29.903 EDT [34485:6] CONTEXT:  WAL redo at 0/49708D8
for Btree/DEDUP: nintervals 4; blkref #0: rel 1663/16384/2674, blk 1




Re: BTMaxItemSize seems to be subtly incorrect

2022-08-04 Thread Peter Geoghegan
On Thu, Aug 4, 2022 at 10:25 PM Thomas Munro  wrote:
> FYI florican and lapwing showed:
>
> 2022-08-05 01:04:29.903 EDT [34485:5] FATAL:  deduplication failed to
> add heap tid to pending posting list
> 2022-08-05 01:04:29.903 EDT [34485:6] CONTEXT:  WAL redo at 0/49708D8
> for Btree/DEDUP: nintervals 4; blkref #0: rel 1663/16384/2674, blk 1

This very likely has something to do with the way nbtdedup.c uses
BTMaxItemSize(), which apparently won't work on these 32-bit systems
now.

I'll fix this tomorrow morning.

-- 
Peter Geoghegan




Re: Support logical replication of DDLs

2022-08-04 Thread Peter Smith
Hi Hou-san, here are my review comments for the patch v15-0001:

==

1. Commit Message

CREATE/ALTER/DROP TABLE (*)

At first, I thought "(*)" looks like a SQL syntax element.

SUGGESTION:

CREATE/ALTER/DROP TABLE - - Note #1, Note #2
...
Note #1 – blah blah
Note #2 – yada yada

==

2. src/backend/commands/ddl_deparse.c - General

2.1
Lots of the deparse_XXX function are in random places scattered around
in this module. Since there are so many, I think it's better to have
functions arrange alphabetically to make them easier to find. (And if
there are several functions that logically "belong" together then
those should be re-named so they will be group together
alphabetically...

Same applies to other functions – not just the deparse_XXX ones

2.2
There are lots of 'tmp' (or 'tmp2') variables in this file. Sometimes
'tmp' is appropriate (or maybe 'tmpobj' would be  better) but in other
cases it seems like there should be a better name than 'tmp'. Please
search all these and replace where you can use a more meaningful name
than tmp.

2.3
Pointer NULL comparisons are not done consistently all through the
file. E.g. Sometimes you do tree->fmtinfo == NULL, but in others you
do like if (!tree->fmtinfo). It's no big deal whichever way you want
to use, but at least for the comparisons involving the same variables
IMO should use the same style consistently.

~~~

3. src/backend/commands/ddl_deparse.c - format_type_detailed

3.1
+ * - typename is set to the type name, without quotes

But the param is called 'typname', not 'typename'

3.2
+ * - typmod is set to the typemod, if any, as a string with parens

I think you mean:
"typmod is set" -> "typemodstr is set"

3.3
+ if (IsTrueArrayType(typeform) &&
+ typeform->typstorage != TYPSTORAGE_PLAIN)
+ {
+ /* Switch our attention to the array element type */
+ ReleaseSysCache(tuple);
+ tuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(array_base_type));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for type %u", type_oid);
+
+ typeform = (Form_pg_type) GETSTRUCT(tuple);
+ type_oid = array_base_type;
+ *typarray = true;
+ }
+ else
+ *typarray = false;

Maybe this if/else can be simplified

*typarray = IsTrueArrayType(typeform) && typeform->typstorage !=
TYPSTORAGE_PLAIN;
if (*typarray)
{
...
}

3.4
+ /* otherwise, WITH TZ is added by typmod. */

Uppercase comment

~~~

4. src/backend/commands/ddl_deparse.c - append_object_to_format_string

+ for (cp = sub_fmt; cp < end_ptr; cp++)
+ {
+ if (*cp == '{')
+ {
+ start_copy = true;
+ continue;
+ }

What's this logic going to do if it encounters "{{" - it looks like it
will just keep going but wouldn't that be a name error to have a "{"
in it?

~~~

5. src/backend/commands/ddl_deparse.c - append_bool_object

+append_bool_object(ObjTree *tree, char *sub_fmt, bool value)
+{
+ ObjElem  *param;
+ char *object_name = sub_fmt;
+ bool   is_present_flag = false;
+
+ Assert(sub_fmt);
+
+ if (strcmp(sub_fmt, "present") == 0)
+ {
+ is_present_flag = true;
+ tree->present = value;
+ }
+
+ if (!verbose && !tree->present)
+ return;
+
+ if (!is_present_flag)
+ object_name = append_object_to_format_string(tree, sub_fmt);
+
+ param = new_object(ObjTypeBool, object_name);
+ param->value.boolean = value;
+ append_premade_object(tree, param);
+}

It feels like there is some subtle trickery going on here with the
conditions. Is there a simpler way to write this, or maybe it is just
lacking some explanatory comments?

~~~

6. src/backend/commands/ddl_deparse.c - append_array_object

+ if (!verbose)
+ {
+ ListCell *lc;
+
+ /* Extract the ObjElems whose present flag is true */
+ foreach(lc, array)
+ {
+ ObjElem *elem = (ObjElem *) lfirst(lc);
+
+ Assert(elem->objtype == ObjTypeObject);
+
+ if (!elem->value.object->present)
+ array = foreach_delete_current(array, lc);
+ }
+
+ if (list_length(array) == 0)
+ return;
+ }

Maybe it is OK as-is. I'm just wondering if this list_length(array)
check should be outside of the !verbose check?

~~~

7. src/backend/commands/ddl_deparse.c - objtree_to_jsonb_element

+ case ObjTypeObject:
+ /* recursively add the object into the existing parse state */

Uppercase comment

~~~

8. src/backend/commands/ddl_deparse.c - new_objtree_for_qualname_id

8.1
+ *
+ * Elements "schemaname" and "objname" are set.  If the object is a temporary
+ * object, the schema name is set to "pg_temp".

I'm not sure if this is the right place to say this, since it is not
really this function that sets that "pg_temp".

8.2
+ if (isnull)
+ elog(ERROR, "unexpected NULL namespace");
+ objname = heap_getattr(catobj, Anum_name, RelationGetDescr(catalog),
+&isnull);

Missing blank line after the elog?

~~~

9. src/backend/commands/ddl_deparse.c - deparse_ColumnIdentity

+ /* Definition elemets */

typo "elemets"

~~~

10. src/backend/commands/ddl_deparse.c - deparse_CreateTrigStmt

10.1
+ else
+ elog(ERROR, "unrecognized trigger timing value %d", node->timing);

should that say "unrecognized trigger timing type" (e.g. 

  1   2   >