RE: Perform streaming logical transactions by background workers and parallel apply

2022-08-11 Thread houzj.f...@fujitsu.com
On Wednesday, August 10, 2022 11:39 AM Amit Kapila 
wrote:
> 
> On Tue, Aug 9, 2022 at 5:39 PM Amit Kapila  wrote:
> >
> > On Tue, Aug 9, 2022 at 11:09 AM Dilip Kumar 
> wrote:
> > >
> > > Some more comments
> > >
> > > +/*
> > > + * Exit if any relation is not in the READY state and if any worker 
> > > is
> > > + * handling the streaming transaction at the same time. Because for
> > > + * streaming transactions that is being applied in apply background
> > > + * worker, we cannot decide whether to apply the change for a
> relation
> > > + * that is not in the READY state (see should_apply_changes_for_rel) 
> > > as
> we
> > > + * won't know remote_final_lsn by that time.
> > > + */
> > > +if (list_length(ApplyBgworkersFreeList) !=
> > > list_length(ApplyBgworkersList) &&
> > > +!AllTablesyncsReady())
> > > +{
> > > +ereport(LOG,
> > > +(errmsg("logical replication apply workers for
> > > subscription \"%s\" will restart",
> > > +MySubscription->name),
> > > + errdetail("Cannot handle streamed replication
> > > transaction by apply "
> > > +   "background workers until all tables are
> > > synchronized")));
> > > +
> > > +proc_exit(0);
> > > +}
> > >
> > > How this situation can occur? I mean while starting a background
> > > worker itself we can check whether all tables are sync ready or not
> > > right?
> > >
> >
> > We are already checking at the start in apply_bgworker_can_start() but
> > I think it is required to check at the later point of time as well
> > because the new rels can be added to pg_subscription_rel via Alter
> > Subscription ... Refresh. I feel if that reasoning is correct then we
> > can probably expand comments to make it clear.
> >
> > > +/* Check the status of apply background worker if any. */
> > > +apply_bgworker_check_status();
> > > +
> > >
> > > What is the need to checking each worker status on every commit?  I
> > > mean if there are a lot of small transactions along with some
> > > steamiing transactions then it will affect the apply performance for
> > > those small transactions?
> > >
> >
> > I don't think performance will be a concern because this won't do any
> > costly operation unless invalidation happens in which case it will
> > access system catalogs. However, if my above understanding is correct
> > that new tables can be added during the apply process then not sure
> > doing it at commit time is sufficient/correct because it can change
> > even during the transaction.
> >
> 
> One idea that may handle it cleanly is to check for SUBREL_STATE_SYNCDONE
> state in should_apply_changes_for_rel() and error out for apply_bg_worker().
> For the SUBREL_STATE_READY state, it should return true and for any other
> state, it can return false. The one advantage of this approach could be that 
> the
> parallel apply worker will give an error only if the corresponding transaction
> has performed any operation on the relation that has reached the SYNCDONE
> state.
> OTOH, checking at each transaction end can also lead to erroring out of
> workers even if the parallel apply transaction doesn't perform any operation 
> on
> the relation which is not in the READY state.

I agree that it would be better to check at should_apply_changes_for_rel().

In addition, I think we should report an error if the table is not in READY 
state,
otherwise return true. Currently(on HEAD), if the table state is NOT READY, we
will skip all the changes related to the relation in a transaction because we
invoke process_syncing_tables only at transaction end which means the state of
table won't be changed during applying a transaction.

But while the apply bgworker is applying the streaming transaction, the
main apply worker could have applied serval normal transaction which could
change the state of table serval times(FROM INIT -> READY). So, to prevent the
risk case that we skip part of changes before state comes to READY and then
start to apply the changes after READY during on transaction, we'd better error
out if the table is not in READY state and restart without apply background
worker.

Best regards,
Hou zj


RE: Perform streaming logical transactions by background workers and parallel apply

2022-08-11 Thread houzj.f...@fujitsu.com
On Tuesday, August 9, 2022 4:49 PM Kuroda, Hayato/黒田 隼人 
 wrote:
> Dear Wang,
> 
> Thanks for updating patch sets! Followings are comments about v20-0001.
> 
> 1. config.sgml
> 
> ```
>
> Specifies maximum number of logical replication workers. This includes
> both apply workers and table synchronization workers.
>
> ```
> 
> I think you can add a description in the above paragraph, like
> " This includes apply main workers, apply background workers, and table
> synchronization workers."

Changed.

> 2. logical-replication.sgml
> 
> 2.a Configuration Settings
> 
> ```
>max_logical_replication_workers must be set to at
> least
>the number of subscriptions, again plus some reserve for the table
>synchronization.
> ```
> 
> I think you can add a description in the above paragraph, like
> "... the number of subscriptions, plus some reserve for the table
> synchronization
>  and the streaming transaction."

I think it's not a must to add the number of streaming transactions here as
it also works even if no worker is available for apply bgworker as explained in
the document of streaming option.

> 2.b Monitoring
> 
> ```
>   
>Normally, there is a single apply process running for an enabled
>subscription.  A disabled subscription or a crashed subscription will have
>zero rows in this view.  If the initial data synchronization of any
>table is in progress, there will be additional workers for the tables
>being synchronized.
>   
> ```
> 
> I think you can add a sentence in the above paragraph, like
> "... synchronized. Moreover, if the streaming transaction is applied 
> parallelly,
> there will be additional workers"

Added

> 3. launcher.c
> 
> ```
> +   /* Sanity check : we don't support table sync in subworker. */
> ```
> 
> I think "Sanity check :" should be "Sanity check:", per other files.


Changed.

> 4. worker.c
> 
> 4.a handle_streamed_transaction()
> 
> ```
> -   /* not in streaming mode */
> -   if (!in_streamed_transaction)
> +   /* Not in streaming mode */
> +   if (!(in_streamed_transaction || am_apply_bgworker()))
> ```
> 
> I think the comment should also mention about apply background worker case.

Added.

> 4.b handle_streamed_transaction()
> 
> ```
> -   Assert(stream_fd != NULL);
> ```
> 
> I think this assersion seems reasonable in case of stream='on'.
> Could you revive it and move to later part in the function, like after
> subxact_info_add(current_xid)?

Added.

> 4.c apply_handle_prepare_internal()
> 
> ```
>  * BeginTransactionBlock is necessary to balance the
> EndTransactionBlock
>  * called within the PrepareTransactionBlock below.
>  */
> -   BeginTransactionBlock();
> +   if (!IsTransactionBlock())
> +   BeginTransactionBlock();
> +
> ```
> 
> I think the comment should be "We must be in transaction block to balance...".

Changed.

> 4.d apply_handle_stream_prepare()
> 
> ```
> - *
> - * Logic is in two parts:
> - * 1. Replay all the spooled operations
> - * 2. Mark the transaction as prepared
>   */
>  static void
>  apply_handle_stream_prepare(StringInfo s)
> ```
> 
> I think these comments are useful when stream='on',
> so it should be moved to later part.

I think we already have similar comments in later part.

> 5. applybgworker.c
> 
> 5.a apply_bgworker_setup()
> 
> ```
> +   elog(DEBUG1, "setting up apply worker #%u",
> list_length(ApplyBgworkersList) + 1);
> ```
> 
> "apply worker" should be "apply background worker".
> 
> 5.b LogicalApplyBgwLoop()
> 
> ```
> +   elog(DEBUG1, "[Apply BGW #%u] ended
> processing streaming chunk,"
> +"waiting on shm_mq_receive",
> shared->worker_id);
> ```
> 
> A blank is needed after comma. I checked serverlog, and the message
> outputed like:
> 
> ```
> [Apply BGW #1] ended processing streaming chunk,waiting on
> shm_mq_receive
> ```

Changed.

> 6.
> 
> When I started up the apply background worker and did `SELECT * from
> pg_stat_subscription`, I got following lines:
> 
> ```
> postgres=# select * from pg_stat_subscription;
>  subid | subname |  pid  | relid | received_lsn |  last_msg_send_time
> | last_msg_receipt_time | latest_end_lsn |latest_end
> _time
> ---+-+---+---+--+
> ---+---++--
> -
>  16400 | sub | 22383 |   |  | -infinity   
>   |
> -infinity || -infinity
>  16400 | sub | 22312 |   | 0/6734740| 2022-08-09
> 07:40:19.367676+00 | 2022-08-09 07:40:19.375455+00 | 0/6734740  |
> 2022-08-09 07:40:
> 19.367676+00
> (2 rows)
> ```
> 
> 
> 6.a
> 
> It seems that the upper line represents the apply background worker, but I
> think last_msg_send_time and last_msg_receipt_time should be null.
> Is it

RE: Perform streaming logical transactions by background workers and parallel apply

2022-08-11 Thread houzj.f...@fujitsu.com
On Wednesday, August 10, 2022 5:40 PM Peter Smith  wrote:
> 
> Here are some review comments for the patch v20-0001:
> ==
> 
> 1. doc/src/sgml/catalogs.sgml
> 
> +   p = apply changes directly using a background
> +   worker, if available, otherwise, it behaves the same as 't'
> 
> The different char values 'f','t','p' are separated by comma (,) in
> the list, which is normal for the pgdocs AFAIK. However, because of
> this I don't think it is a good idea to use those other commas within
> the description for  'p', I suggest you remove those ones to avoid
> ambiguity with the separators.

Changed.

> ==
> 
> 2. doc/src/sgml/protocol.sgml
> 
> @@ -3096,7 +3096,7 @@ psql "dbname=postgres replication=database" -c
> "IDENTIFY_SYSTEM;"
>   
>
> Protocol version. Currently versions 1,
> 2,
> -   and 3 are supported.
> +   3 and 4 are supported.
>
> 
> Put a comma after the penultimate value like it had before.
> 


Changed.

> ==
> 
> 3. src/backend/replication/logical/applybgworker.c - 
> 
> There are multiple function comments and other code comments in this
> file that are missing a terminating period (.)
> 
> ==
> 

Changed.

> 4. src/backend/replication/logical/applybgworker.c - apply_bgworker_start
> 
> +/*
> + * Try to get a free apply background worker.
> + *
> + * If there is at least one worker in the free list, then take one. 
> Otherwise,
> + * try to start a new apply background worker. If successful, cache it in
> + * ApplyBgworkersHash keyed by the specified xid.
> + */
> +ApplyBgworkerState *
> +apply_bgworker_start(TransactionId xid)
> 
> SUGGESTION (for function comment)
> Return the apply background worker that will be used for the specified xid.
> 
> If an apply background worker is found in the free list then re-use
> it, otherwise start a fresh one. Cache the worker ApplyBgworkersHash
> keyed by the specified xid.
> 
> ~~~
> 

Changed.

> 5.
> 
> + /* Try to get a free apply background worker */
> + if (list_length(ApplyBgworkersFreeList) > 0)
> 
> if (list_length(ApplyBgworkersFreeList) > 0)
> 
> AFAIK a non-empty list is guaranteed to be not NIL, and an empty list
> is guaranteed to be NIL. So if you want to the above can simply be
> written as:
> 
> if (ApplyBgworkersFreeList)
> 

Both ways are fine to me, so I kept the current style.

> ~~~
> 
> 6. src/backend/replication/logical/applybgworker.c - apply_bgworker_find
> 
> +/*
> + * Try to look up worker assigned before (see function
> apply_bgworker_get_free)
> + * inside ApplyBgworkersHash for requested xid.
> + */
> +ApplyBgworkerState *
> +apply_bgworker_find(TransactionId xid)
> 
> SUGGESTION (for function comment)
> Find the worker previously assigned/cached for this xid. (see function
> apply_bgworker_start)
> 

Changed.

> ~~~
> 
> 7.
> 
> + Assert(status == APPLY_BGWORKER_BUSY);
> +
> + return entry->wstate;
> + }
> + else
> + return NULL;
> 
> IMO here it is better to just remove that 'else' and unconditionally
> return NULL at the end of this function.
> 

Changed.

> ~~~
> 
> 8. src/backend/replication/logical/applybgworker.c -
> apply_bgworker_subxact_info_add
> 
> + * Inside apply background worker we can figure out that new subtransaction
> was
> + * started if new change arrived with different xid. In that case we can 
> define
> + * named savepoint, so that we were able to commit/rollback it separately
> + * later.
> + * Special case is if the first change comes from subtransaction, then
> + * we check that current_xid differs from stream_xid.
> + */
> +void
> +apply_bgworker_subxact_info_add(TransactionId current_xid)
> 
> It is not quite English. Can you improve it a bit?
> 
> SUGGESTION (maybe like this?)
> The apply background worker can figure out if a new subtransaction was
> started by checking if the new change arrived with different xid. In
> that case define a named savepoint, so that we are able to
> commit/rollback it separately later. A special case is when the first
> change comes from subtransaction – this is determined by checking if
> the current_xid differs from stream_xid.
> 

Changed.

> ==
> 
> 9. src/backend/replication/logical/launcher.c -
> WaitForReplicationWorkerAttach
> 
> + *
> + * Return false if the attach fails. Otherwise return true.
>   */
> -static void
> +static bool
>  WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
> 
> Why not just say "Return whether the attach was successful."
> 

Changed.

> ~~~
> 
> 10. src/backend/replication/logical/launcher.c - logicalrep_worker_stop
> 
> + /* Found the main worker, then try to stop it. */
> + if (worker)
> + logicalrep_worker_stop_internal(worker);
> 
> IMO the comment is kind of pointless because it only says what the
> code is clearly doing. If you really wanted to reinforce this worker
> is a main apply worker then you can do that with code like:
> 
> if (worker)
> {
> Assert(!worker->subworker);
> logicalrep_worker_stop_internal(worker);
> }
> 

Changed.

> ~~~
> 
> 

Re: making relfilenodes 56 bits

2022-08-11 Thread Dilip Kumar
On Thu, Aug 11, 2022 at 10:58 AM Dilip Kumar  wrote:
>
> On Tue, Aug 9, 2022 at 8:51 PM Robert Haas  wrote:
> >
> > On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar  wrote:
> > > I think even if we start the range from the 4 billion we can not avoid
> > > keeping two separate ranges for system and user tables otherwise the
> > > next upgrade where old and new clusters both have 56 bits
> > > relfilenumber will get conflicting files.  And, for the same reason we
> > > still have to call SetNextRelFileNumber() during upgrade.
> >
> > Well, my proposal to move everything from the new cluster up to higher
> > numbers would address this without requiring two ranges.
> >
> > > So the idea is, we will be having 2 ranges for relfilenumbers, system
> > > range will start from 4 billion and user range maybe something around
> > > 4.1 (I think we can keep it very small though, just reserve 50k
> > > relfilenumber for system for future expansion and start user range
> > > from there).
> >
> > A disadvantage of this is that it basically means all the file names
> > in new clusters are going to be 10 characters long. That's not a big
> > disadvantage, but it's not wonderful. File names that are only 5-7
> > characters long are common today, and easier to remember.
>
> That's correct.
>
> > > So now system tables have no issues and also the user tables from the
> > > old cluster have no issues.  But pg_largeobject might get conflict
> > > when both old and new cluster are using 56 bits relfilenumber, because
> > > it is possible that in the new cluster some other system table gets
> > > that relfilenumber which is used by pg_largeobject in the old cluster.
> > >
> > > This could be resolved if we allocate pg_largeobject's relfilenumber
> > > from the user range, that means this relfilenumber will always be the
> > > first value from the user range.  So now if the old and new cluster
> > > both are using 56bits relfilenumber then pg_largeobject in both
> > > cluster would have got the same relfilenumber and if the old cluster
> > > is using the current 32 bits relfilenode system then the whole range
> > > of the new cluster is completely different than that of the old
> > > cluster.
> >
> > I think this can work, but it does rely to some extent on the fact
> > that there are no other tables which need to be treated like
> > pg_largeobject. If there were others, they'd need fixed starting
> > RelFileNumber assignments, or some other trick, like renumbering them
> > twice in the cluster, first two a known-unused value and then back to
> > the proper value. You'd have trouble if in the other cluster
> > pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
> > cluster the reverse, without some hackery.
>
> Agree, if it has more catalog like pg_largeobject then it would
> require some hacking.
>
> > I do feel like your idea here has some advantages - my proposal
> > requires rewriting all the catalogs in the new cluster before we do
> > anything else, and that's going to take some time even though they
> > should be small. But I also feel like it has some disadvantages: it
> > seems to rely on complicated reasoning and special cases more than I'd
> > like.
>
> One other advantage with your approach is that since we are starting
> the "nextrelfilenumber" after the old cluster's relfilenumber range.
> So only at the beginning we need to set the "nextrelfilenumber" but
> after that while upgrading each object we don't need to set the
> nextrelfilenumber every time because that is already higher than the
> complete old cluster range.  In other 2 approaches we will have to try
> to set the nextrelfilenumber everytime we preserve the relfilenumber
> during upgrade.

I was also thinking that whether we will get the max "relfilenumber"
from the old cluster at the cluster level or per database level?  I
mean if we want to get database level we can run simple query on
pg_class and get it but there also we will need to see how to handle
the mapped relation if they are rewritten?  I don't think we can get
the max relfilenumber from the old cluster at the cluster level.
Maybe in the newer version we can expose a function from the server to
just return the NextRelFileNumber and that would be the max
relfilenumber but I'm not sure how to do that in the old version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-11 Thread Matthias van de Meent
On Mon, 8 Aug 2022 at 18:48, Peter Geoghegan  wrote:
>
> On Mon, Aug 8, 2022 at 9:17 AM Matthias van de Meent
>  wrote:
> > Because if a subset of the pages of a relation contains more tuples
> > than your current total expected tuples in the table, you should
> > update your expectations regardless of which blocks or which number of
> > blocks you've scanned - the previous stored value is a strictly worse
> > estimation than your last measurement.
>
> The previous stored value could be -1, which represents the idea that
> we don't know the tuple density yet. So it doesn't necessarily follow
> that the new estimate is strictly better, even in this exact scenario.
>
> > A 33-block relation with first 32 1-tuple pages is still enough to
> > have a last page with 250 tuples, which would be ignored in that
> > scheme and have a total tuple count of 33 or so.
>
> The simple fact is that there is only so much we can do with the
> limited information/context that we have. Heuristics are not usually
> free of all bias. Often the bias is the whole point -- the goal can be
> to make sure that we have the bias that we know we can live with, and
> not the opposite bias, which is much worse. Details of which are
> usually very domain specific.
>
> I presented my patch with a very simple test case -- a very clear
> problem. Can you do the same for this scenario?

CREATE TABLE tst (id int primary key generated by default as identity,
payload text) with (fillfactor 50); -- fillfactor to make pages fill
up fast
INSERT INTO tst (payload) select repeat('a', 5000) from
generate_series(32); -- 32 pages filled with large tuples
INSERT INTO tst (payload) select repeat('a', 4); -- small tuple at last page
vacuum (verbose, freeze) tst; -- 33 tuples on 33 pages, with lots of
space left on last page
INSERT INTO tst(payload) select repeat('a', 4) from
generate_series(1,63); -- now, we have 64 tuples on the last page
vacuum verbose tst; -- with your patch it reports only 33 tuples
total, while the single page that was scanned contains 64 tuples, and
the table contains 96 tuples.

> I accept that it is possible that we'll keep an old reltuples which is
> provably less accurate than doing something with the latest
> information from vacuumlazy.c. But the conditions under which this can
> happen are *very* narrow. I am not inclined to do anything about it
> for that reason.

I think I understand your reasoning, but I don't agree with the
conclusion. The attached patch 0002 does fix that skew too, at what I
consider negligible cost. 0001 is your patch with a new version
number.

I'm fine with your patch as is, but would appreciate it if known
estimate mistakes would also be fixed.

An alternative solution could be doing double-vetting, where we ignore
tuples_scanned if <2% of pages AND <2% of previous estimated tuples
was scanned.

Kind regards,

Matthias van de Meent


v2-0001-Avoid-reltuples-distortion-in-very-small-tables.patch
Description: Binary data


v2-0002-Avoid-reltuples-distortion-in-very-small-tables.patch
Description: Binary data


Re: Cleaning up historical portability baggage

2022-08-11 Thread Thomas Munro
Here's a new batch of these:

Remove configure probe for sys/uio.h.
Remove configure probes for sys/un.h and struct sockaddr_un.
Remove configure probe for sys/select.h.
Remove configure probes for sys/ipc.h, sys/sem.h, sys/shm.h.
Remove configure probe for sys/resource.h and refactor.
Remove configure probe for shl_load library.

The most interesting things to say about these ones are:
 * The concept of a no-Unix-socket build is removed.  We should be
able to do that now, right?  Peter E seemed to say approximately that
in the commit message for 797129e5.  Or is there a thought that a new
operating system might show up that doesn't have 'em and we'd wish
we'd kept this stuff well marked out?
 * Instead of eg HAVE_SYS_RESOURCE_H I supplied  and
moved the relevant declarations there from the big random-win2-stuff
header, and likewise for 

(I still plan to get rid of no-threads-builds soon, but I got stuck
figuring out how to make sure I don't break the recent magic ldap
library detection... more soon.)


0001-Remove-configure-probe-for-sys-uio.h.patch
Description: Binary data


0002-Remove-configure-probes-for-sys-un.h-and-struct-sock.patch
Description: Binary data


0003-Remove-configure-probe-for-sys-select.h.patch
Description: Binary data


0004-Remove-configure-probes-for-sys-ipc.h-sys-sem.h-sys-.patch
Description: Binary data


0005-Remove-configure-probe-for-sys-resource.h-and-refact.patch
Description: Binary data


0006-Remove-configure-probe-for-shl_load-library.patch
Description: Binary data


Re: designated initializers

2022-08-11 Thread Alvaro Herrera
Hello

On 2022-Aug-10, Andres Freund wrote:

> +1 I've fought with this one when fixing a conflict when rebasing a patch...

Right -- pushed, thanks.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: CFM Manager

2022-08-11 Thread Ibrar Ahmed
On Tue, Jul 5, 2022 at 4:31 PM Ibrar Ahmed  wrote:

>
>
> On Tue, Jul 5, 2022 at 8:50 AM Michael Paquier 
> wrote:
>
>> On Tue, Jul 05, 2022 at 08:17:26AM +0500, Ibrar Ahmed wrote:
>> > If nobody has already volunteered for the next upcoming commitfest.
>> > I'd like to volunteer. I think early to say is better as always.
>>
>> Jacob and Daniel have already volunteered.  Based on the number of
>> patches at hand (305 in total), getting more help is always welcome, I
>> guess.
>> --
>> Michael
>>
> I am happy to help, but I am talking about the next one.
>
>
> --
> Ibrar Ahmed
>
Is anybody else volunteer for that, if not I am ready to take that
resposibility.


-- 
Ibrar Ahmed


Re: logical replication restrictions

2022-08-11 Thread Amit Kapila
On Tue, Aug 9, 2022 at 3:52 AM Euler Taveira  wrote:
>
> On Wed, Aug 3, 2022, at 10:27 AM, Amit Kapila wrote:
>
> Your explanation makes sense to me. The other point to consider is
> that there can be cases where we may not apply operation for the
> transaction because of empty transactions (we don't yet skip empty
> xacts for prepared transactions). So, won't it be better to apply the
> delay just before we apply the first change for a transaction? Do we
> want to apply the delay during table sync as we sometimes do need to
> enter apply phase while doing table sync?
>
> I thought about the empty transactions but decided to not complicate the code
> mainly because skipping transactions is not a code path that will slow down
> this feature. As explained in the documentation, there is no harm in delaying 
> a
> transaction for more than min_apply_delay; it cannot apply earlier. Having 
> said
> that I decided to do nothing. I'm also not sure if it deserves a comment or if
> this email is a possible explanation for this decision.
>

I don't know what makes you think it will complicate the code. But
anyway thinking further about the way apply_delay is used at various
places in the patch, as pointed out by Peter Smith it seems it won't
work for the parallel apply feature where we start applying the
transaction immediately after start stream. I was wondering why don't
we apply delay after each commit of the transaction rather than at the
begin command. We can remember if the transaction has made any change
and if so then after commit, apply the delay. If we can do that then
it will alleviate the concern of empty and skipped xacts as well.

Another thing I was wondering how to determine what is a good delay
time for tests and found that current tests in replay_delay.pl uses
3s, so should we use the same for apply delay tests in this patch as
well?

-- 
With Regards,
Amit Kapila.




Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)

2022-08-11 Thread Peter Eisentraut

On 01.08.22 19:08, Ranier Vilela wrote:
Like how 
https://github.com/postgres/postgres/commit/9fd45870c1436b477264c0c82eb195df52bc0919 


New attempt to remove more MemSet calls, that are safe.

Attached v3 patch.


Note that struct initialization does not set padding bits.  So any 
struct that is used as a hash key or that goes to disk or something 
similar needs to be set with memset/MemSet instead.  Various places in 
the code make explicit comments about that, which your patch deletes, 
which is a mistake.  This patch needs to be adjusted carefully with this 
in mind before it can be considered.






Re: Checking pgwin32_is_junction() errors

2022-08-11 Thread r . zharkov

On 2022-08-11 07:55, Thomas Munro wrote:

I checked a few variants:

21.07.2022  15:11 HOME [\??\Volume{GUID}\]
09.08.2022  15:06 Test1 [\\?\Volume{GUID}\]
09.08.2022  15:06 Test2 [\\.\Volume{GUID}\]
09.08.2022  15:17 Test3 [\??\Volume{GUID}\]
09.08.2022  15:27 Test4 [C:\temp\1]
09.08.2022  15:28 Test5 [C:\HOME\Temp\1]


One more thing I wondered about, now that we're following junctions
outside PGDATA: can a junction point to another junction?  If so, I
didn't allow for that: stat() gives up after one hop, because I
figured that was enough for the stuff we expect inside PGDATA and I
couldn't find any evidence in the man pages that referred to chains.
But if you *are* allowed to create a junction "c:\huey" that points to
junction "c:\dewey" that points to "c:\louie", and then you do initdb
-D c:\huey\pgdata, then I guess it would fail.  Would you mind
checking if that is a real possibility, and if so, testing this
chain-following patch to see if it fixes it?


I made some junctions and rechecked both patches.

11.08.2022  16:11 donald [C:\huey]
11.08.2022  13:23 huey [C:\dewey]
11.08.2022  13:23 dewey [C:\louie]
11.08.2022  16:57  louie

With the small attached patch initdb succeeded in any of these
"directories". If the junction chain is too long, initdb fails with
"could not create directory" as expected.

initdb -D huey/pgdata
...
Success.

initdb -N -D donald
...
Success.

11.08.2022  17:32  1
11.08.2022  17:32 2 [C:\1]
11.08.2022  17:32 3 [C:\2]
11.08.2022  17:32 4 [C:\3]
11.08.2022  17:32 5 [C:\4]
11.08.2022  17:32 6 [C:\5]
11.08.2022  17:32 7 [C:\6]
11.08.2022  17:32 8 [C:\7]
11.08.2022  17:32 9 [C:\8]
11.08.2022  17:32 10 [C:\9]

initdb -D 10/pgdata
...
creating directory 10/pgdata ... initdb: error: could not create 
directory "10": File exists


initdb -D 9/pgdata
...
Success.
--- win32stat.c.bak	2022-08-11 17:15:10.775412600 +0700
+++ win32stat.c	2022-08-11 16:47:41.992805700 +0700
@@ -186,9 +186,12 @@
 {
 	int			loops = 0;
 	int			ret;
+	char		curr[MAXPGPATH];
 
 	ret = _pglstat64(name, buf);
 
+	strlcpy(curr, name, MAXPGPATH);
+
 	/* Do we need to follow a symlink (junction point)? */
 	while (ret == 0 && S_ISLNK(buf->st_mode))
 	{
@@ -211,7 +214,7 @@
 		 * That could be optimized, but stat() on symlinks is probably rare
 		 * and this way is simple.
 		 */
-		size = readlink(name, next, sizeof(next));
+		size = readlink(curr, next, sizeof(next));
 		if (size < 0)
 		{
 			if (errno == EACCES &&
@@ -230,6 +233,7 @@
 		next[size] = 0;
 
 		ret = _pglstat64(next, buf);
+		strcpy(curr, next);
 	}
 
 	return ret;


Re: Allow logical replication to copy tables in binary format

2022-08-11 Thread Amit Kapila
On Thu, Aug 11, 2022 at 7:34 AM Euler Taveira  wrote:
>
> On Wed, Aug 10, 2022, at 12:03 PM, Melih Mutlu wrote:
>
> I see that logical replication subscriptions have an option to enable binary 
> [1].
> When it's enabled, subscription requests publisher to send data in binary 
> format.
> But this is only the case for apply phase. In tablesync, tables are still 
> copied as text.
>
> This option could have been included in the commit 9de77b54531; it wasn't.
> Maybe it wasn't considered because the initial table synchronization can be a
> separate step in your logical replication setup idk. I agree that the binary
> option should be available for the initial table synchronization.
>
> To copy tables, COPY command is used and that command supports copying in 
> binary. So it seemed to me possible to copy in binary for tablesync too.
> I'm not sure if there is a reason to always copy tables in text format. But I 
> couldn't see why not to do it in binary if it's enabled.
>
> The reason to use text format is that it is error prone. There are 
> restrictions
> while using the binary format. For example, if your schema has different data
> types for a certain column, the copy will fail.
>

Won't such restrictions hold true even during replication?

-- 
With Regards,
Amit Kapila.




Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)

2022-08-11 Thread Ranier Vilela
Em qui., 11 de ago. de 2022 às 07:38, Peter Eisentraut <
peter.eisentr...@enterprisedb.com> escreveu:

> On 01.08.22 19:08, Ranier Vilela wrote:
> > Like how
> >
> https://github.com/postgres/postgres/commit/9fd45870c1436b477264c0c82eb195df52bc0919
> > <
> https://github.com/postgres/postgres/commit/9fd45870c1436b477264c0c82eb195df52bc0919
> >
> > New attempt to remove more MemSet calls, that are safe.
> >
> > Attached v3 patch.
>
> Note that struct initialization does not set padding bits.

According to:
https://interrupt.memfault.com/blog/c-struct-padding-initialization

2. individually set all members to 0:

struct foo a = {
  .i = 0,
  .b = 0,};

Suffer from this problem.

3. use { 0 } zero-initializer, not.

  So any
> struct that is used as a hash key or that goes to disk or something
> similar needs to be set with memset/MemSet instead.  Various places in
> the code make explicit comments about that, which your patch deletes,
> which is a mistake.  This patch needs to be adjusted carefully with this
> in mind before it can be considered.
>
I think this needs better comprovation?

regards,
Ranier Vilela


Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)

2022-08-11 Thread Alvaro Herrera
On 2022-Aug-11, Ranier Vilela wrote:

> According to:
> https://interrupt.memfault.com/blog/c-struct-padding-initialization

Did you actually read it?

https://interrupt.memfault.com/blog/c-struct-padding-initialization#structure-zero-initialization

: This looks great! However, it’s not obvious (from looking at those snippets)
: what the value loaded into the padding region will be.
:
: The unfortunate answer is: it depends
:
: The C11 standard, chapter §6.2.6.1/6 says this:
:
: : When a value is stored in an object of structure or union type, including 
in a
: : member object, the bytes of the object representation that correspond to any
: : padding bytes take unspecified values.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)




Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)

2022-08-11 Thread Ranier Vilela
Em qui., 11 de ago. de 2022 às 08:48, Alvaro Herrera <
alvhe...@alvh.no-ip.org> escreveu:

> On 2022-Aug-11, Ranier Vilela wrote:
>
> > According to:
> > https://interrupt.memfault.com/blog/c-struct-padding-initialization
>
> Did you actually read it?
>
Yes, today.


>
>
> https://interrupt.memfault.com/blog/c-struct-padding-initialization#structure-zero-initialization
>
> : This looks great! However, it’s not obvious (from looking at those
> snippets)
> : what the value loaded into the padding region will be.
> :
> : The unfortunate answer is: it depends
> :
> : The C11 standard, chapter §6.2.6.1/6 says this:
> :
> : : When a value is stored in an object of structure or union type,
> including in a
> : : member object, the bytes of the object representation that correspond
> to any
> : : padding bytes take unspecified values.
>
Did you see the Strategy 3 table, { 0 } ?

regards,
Ranier Vilela


Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)

2022-08-11 Thread Julien Rouhaud
On Thu, Aug 11, 2022 at 08:51:53AM -0300, Ranier Vilela wrote:
> Em qui., 11 de ago. de 2022 às 08:48, Alvaro Herrera <
> alvhe...@alvh.no-ip.org> escreveu:
> 
> > On 2022-Aug-11, Ranier Vilela wrote:
> >
> > > According to:
> > > https://interrupt.memfault.com/blog/c-struct-padding-initialization
> >
> Did you see the Strategy 3 table, { 0 } ?

It explicitly shows that at least Ubuntu clang version 13.0.0-2 with -01
doesn't do anything about the padding bytes (and that's after testing only 2
different compilers).  Even if those compilers didn't show any problem, we
still couldn't rely on an undefined behavior and assume that no other compilers
behave differently.




Use array as object (src/fe_utils/parallel_slot.c)

2022-08-11 Thread Ranier Vilela
Hi,

One other case suspicious, which I think deserves a conference.
At function wait_on_slots (src/fe_utils/parallel_slot.c)
The variable "slots" are array, but at function call SetCancelConn,
"slots" are used as an object, which at the very least would be suspicious.

cancelconn wouldn't that be the correct argument?

regards,
Ranier Vilela


001-fix-cancel-conn.patch
Description: Binary data


Re: Allow logical replication to copy tables in binary format

2022-08-11 Thread Euler Taveira
On Thu, Aug 11, 2022, at 8:04 AM, Amit Kapila wrote:
> On Thu, Aug 11, 2022 at 7:34 AM Euler Taveira  wrote:
> >
> > The reason to use text format is that it is error prone. There are 
> > restrictions
> > while using the binary format. For example, if your schema has different 
> > data
> > types for a certain column, the copy will fail.
> >
> 
> Won't such restrictions hold true even during replication?
I expect that the COPY code matches the proto.c code. The point is that table
sync is decoupled from the logical replication. Hence, we should emphasize in
the documentation that the restrictions *also* apply to the initial table
synchronization.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Allow logical replication to copy tables in binary format

2022-08-11 Thread Melih Mutlu
Euler Taveira , 11 Ağu 2022 Per, 16:27 tarihinde şunu
yazdı:

> On Thu, Aug 11, 2022, at 8:04 AM, Amit Kapila wrote:
>
> On Thu, Aug 11, 2022 at 7:34 AM Euler Taveira  wrote:
> >
> > The reason to use text format is that it is error prone. There are
> restrictions
> > while using the binary format. For example, if your schema has different
> data
> > types for a certain column, the copy will fail.
> >
>
> Won't such restrictions hold true even during replication?
>
> I expect that the COPY code matches the proto.c code. The point is that
> table
> sync is decoupled from the logical replication. Hence, we should emphasize
> in
> the documentation that the restrictions *also* apply to the initial table
> synchronization.
>

If such restrictions are already the case for replication phase after
initial table sync, then it shouldn't prevent us from enabling binary
option for table sync. Right?
But yes, it needs to be stated somewhere.

Euler Taveira , 11 Ağu 2022 Per, 05:03 tarihinde şunu
yazdı

> I have a few points about your implementation.
>
> * Are we considering to support prior Postgres versions too? These releases
>   support binary mode but it could be an unexpected behavior (initial sync
> in
>   binary mode) for a publisher using 14 or 15 and a subscriber using 16.
> IMO
>   you should only allow it for publisher on 16 or later.
>

How is any issue that might occur due to version mismatch being handled
right now in repliaction after table sync?
What I understand from the documentation is if replication can fail due to
using different pg versions, it just fails. So binary option cannot be
used. [1]
Do you think that this is more serious for table sync and we need to
restrict binary option with different publisher and subscriber versions?
But not for replication?

* Docs should say that the binary option also applies to initial table
>   synchronization and possibly emphasize some of the restrictions.
> * Tests. Are the current tests enough? 014_binary.pl.
>

You're right on both points. I just wanted to know your opinions on this
first. Then the patch will need some tests and proper documentation.

[1] https://www.postgresql.org/docs/15/sql-createsubscription.html


Best,
Melih


Re: tests and meson - test names and file locations

2022-08-11 Thread Andrew Dunstan


On 2022-08-11 Th 00:04, Andres Freund wrote:
>
> The runner now creates a test.start at the start of a test and either
> test.success or test.failure at the end. That should make it pretty easy for
> e.g. the buildfarm and CI to make the logs for a failed test easily
> accessible. I've spent far too much time going through the ~hundred logs in
> src/test/recovery/ that the buildfarm displays as one thing.


I do have work in hand to improve that markedly, just need a little time
to finish it.


>
>
> I really like having all the test data separately from the sources, but I get
> that that's not what we've done so far. It's doable to just mirror the current
> choice, but I don't think we should. But I won't push too hard against keeping
> things the same.


I also like that. I think we should take this opportunity for some
serious rationalization of this. Tests and associated data have grown
rather like Topsy, and we should fix that. So please don't feel too
constrained by current practice.


>
>
> I do wonder if we should put test data and log files in a separate directory
> tree, but that'd be a bit more work probably.
>
>
> Any comments on the above?
>
>
> = Example outputs =


[...]

> Full log written to /tmp/meson/meson-logs/testlog.txt


/tmp ?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: tests and meson - test names and file locations

2022-08-11 Thread Tom Lane
Andrew Dunstan  writes:
> I also like that. I think we should take this opportunity for some
> serious rationalization of this. Tests and associated data have grown
> rather like Topsy, and we should fix that. So please don't feel too
> constrained by current practice.

I'm definitely -1 on that.  Major rearrangement of the test scripts
would be a huge blocking factor for almost any back-patch.  I don't
care much if you want to rearrange how the tests are invoked, but
please don't touch the individual .sql and .pl scripts.

regards, tom lane




Re: tests and meson - test names and file locations

2022-08-11 Thread Andres Freund
Hi,

On 2022-08-11 10:06:35 -0400, Andrew Dunstan wrote:
> > Full log written to /tmp/meson/meson-logs/testlog.txt
>
> /tmp ?

I often put throwaway buildtrees in /tmp. So this is just because my buildtree
is in /tmp/meson, i.e. the log always is in $build_root/meson-logs/testlog.txt
(there's also a log of the configure run in there etc).

Greetings,

Andres Freund




Re: Cleaning up historical portability baggage

2022-08-11 Thread Tom Lane
Thomas Munro  writes:
> The most interesting things to say about these ones are:
>  * The concept of a no-Unix-socket build is removed.  We should be
> able to do that now, right?  Peter E seemed to say approximately that
> in the commit message for 797129e5.  Or is there a thought that a new
> operating system might show up that doesn't have 'em and we'd wish
> we'd kept this stuff well marked out?

I'm kind of down on removing that.  I certainly think it's premature
to do so today, when we haven't even yet shipped a release that
assumes we can always define it on Windows -- I won't be too surprised
if we get pushback on that after 15.0 is out.  But in general, Unix
sockets seem like kind of an obvious thing that might not be there
on some future platform.

Instead of what you did in 0002, I propose putting
"#define HAVE_UNIX_SOCKETS 1" in pg_config_manual.h, and keeping
the #if's that reference it as-is.

All the rest of this seems fine on a cursory readthrough.  Even if
we discover that header foo.h is less universal than we thought,
putting back one or two configure tests won't be hard.

regards, tom lane




Re: tests and meson - test names and file locations

2022-08-11 Thread Andres Freund
Hi,

On 2022-08-11 10:20:42 -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > I also like that. I think we should take this opportunity for some
> > serious rationalization of this. Tests and associated data have grown
> > rather like Topsy, and we should fix that. So please don't feel too
> > constrained by current practice.
> 
> I'm definitely -1 on that.  Major rearrangement of the test scripts
> would be a huge blocking factor for almost any back-patch.  I don't
> care much if you want to rearrange how the tests are invoked, but
> please don't touch the individual .sql and .pl scripts.

I don't precisely know what Andrew was thinking of, but the relocation of log
files for example doesn't require many changes to .pl files - one change to
Utils.pm. The one exception to that is 010_tab_completion.pl, which encodes
tmp_check/ in its output.

Greetings,

Andres Freund




Re: pg_upgrade test writes to source directory

2022-08-11 Thread Andres Freund
Hi,

On 2022-06-01 10:55:28 -0400, Tom Lane wrote:
> [...] I'm definitely not happy with the proposed changes to
> 010_tab_completion.pl.  My recollection is that those tests
> were intentionally written to test tab completion involving a
> directory name, but this change just loses that aspect entirely.

How about creating a dedicated directory for the created files, to maintain
that? My goal of being able to redirect the test output elsewhere can be
achieved with just a hunk like this:

@@ -70,11 +70,13 @@ delete $ENV{LS_COLORS};
 # to run in the build directory so that we can use relative paths to
 # access the tmp_check subdirectory; otherwise the output from filename
 # completion tests is too variable.
-if ($ENV{TESTDIR})
+if ($ENV{TESTOUTDIR})
 {
-chdir $ENV{TESTDIR} or die "could not chdir to \"$ENV{TESTDIR}\": $!";
+chdir "$ENV{TESTOUTDIR}" or die "could not chdir to \"$ENV{TESTOUTDIR}\": 
$!";
 }
 
+mkdir "tmp_check" unless -d "tmp_check";
+
 # Create some junk files for filename completion testing.
 my $FH;
 open $FH, ">", "tmp_check/somefile"


Of course it'd need a comment adjustment etc. It's a bit ugly to use a
otherwise empty tmp_check/ directory just to reduce the diff size, but it's
also not too bad.

Greetings,

Andres Freund




Re: pg_upgrade test writes to source directory

2022-08-11 Thread Tom Lane
Andres Freund  writes:
> On 2022-06-01 10:55:28 -0400, Tom Lane wrote:
>> [...] I'm definitely not happy with the proposed changes to
>> 010_tab_completion.pl.  My recollection is that those tests
>> were intentionally written to test tab completion involving a
>> directory name, but this change just loses that aspect entirely.

> How about creating a dedicated directory for the created files, to maintain
> that? My goal of being able to redirect the test output elsewhere can be
> achieved with just a hunk like this:

Sure, there's no need for these files to be in the exact same place that
the output is collected.  I just want to keep their same relationship
to the test's CWD.

> Of course it'd need a comment adjustment etc. It's a bit ugly to use a
> otherwise empty tmp_check/ directory just to reduce the diff size, but it's
> also not too bad.

Given that it's no longer going to be the same tmp_check dir used
elsewhere, maybe we could s/tmp_check/tab_comp_dir/g or something
like that?  That'd add some clarity I think.

regards, tom lane




Re: tests and meson - test names and file locations

2022-08-11 Thread Tom Lane
Andres Freund  writes:
> I don't precisely know what Andrew was thinking of, but the relocation of log
> files for example doesn't require many changes to .pl files - one change to
> Utils.pm. The one exception to that is 010_tab_completion.pl, which encodes
> tmp_check/ in its output.

Ah.  That seems perfectly tolerable.  Andrew seemed to be thinking of
moving the tests' source files around.

regards, tom lane




Re: Switching XLog source from archive to streaming when primary available

2022-08-11 Thread Bharath Rupireddy
On Fri, Jul 8, 2022 at 9:16 PM Bharath Rupireddy
 wrote:
>
> On Sat, Jun 25, 2022 at 1:31 AM Cary Huang  wrote:
> >
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  tested, passed
> > Implements feature:   tested, passed
> > Spec compliant:   not tested
> > Documentation:not tested
> >
> > Hello
> >
> > I tested this patch in a setup where the standby is in the middle of 
> > replicating and REDOing primary's WAL files during a very large data 
> > insertion. During this time, I keep killing the walreceiver process to 
> > cause a stream failure and force standby to read from archive. The system 
> > will restore from archive for "wal_retrieve_retry_interval" seconds before 
> > it attempts to steam again. Without this patch, once the streaming is 
> > interrupted, it keeps reading from archive until standby reaches the same 
> > consistent state of primary and then it will switch back to streaming 
> > again. So it seems that the patch does the job as described and does bring 
> > some benefit during a very large REDO job where it will try to re-stream 
> > after restoring some WALs from archive to speed up this "catch up" process. 
> > But if the recovery job is not a large one, PG is already switching back to 
> > streaming once it hits consistent state.
>
> Thanks a lot Cary for testing the patch.
>
> > Here's a v1 patch that I've come up with. I'm right now using the
> > existing GUC wal_retrieve_retry_interval to switch to stream mode from
> > archive mode as opposed to switching only after the failure to get WAL
> > from archive mode. If okay with the approach, I can add tests, change
> > the docs and add a new GUC to control this behaviour. I'm open to
> > thoughts and ideas here.
>
> It will be great if I can hear some thoughts on the above points (as
> posted upthread).

Here's the v2 patch with a separate GUC, new GUC was necessary as the
existing GUC wal_retrieve_retry_interval is loaded with multiple
usages. When the feature is enabled, it will let standby to switch to
stream mode i.e. fetch WAL from primary before even fetching from
archive fails. The switching to stream mode from archive happens in 2
scenarios: 1) when standby is in initial recovery 2) when there was a
failure in receiving from primary (walreceiver got killed or crashed
or timed out, or connectivity to primary was broken - for whatever
reasons).

I also added test cases to the v2 patch.

Please review the patch.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/


v2-0001-Switch-WAL-source-to-stream-from-archive.patch
Description: Binary data


Use SetInstallXLogFileSegmentActive() for setting XLogCtl->InstallXLogFileSegmentActive

2022-08-11 Thread Bharath Rupireddy
Hi,

Here's a small patch replacing the explicit setting of
XLogCtl->InstallXLogFileSegmentActive with the existing function
SetInstallXLogFileSegmentActive(), removes duplicate code and saves 4
LOC.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/


v1-0001-Use-SetInstallXLogFileSegmentActive.patch
Description: Binary data


Re: Assertion failure in WaitForWALToBecomeAvailable state machine

2022-08-11 Thread Bharath Rupireddy
On Mon, Feb 14, 2022 at 1:44 PM Kyotaro Horiguchi
 wrote:
>
> At Fri, 11 Feb 2022 22:25:49 +0530, Bharath Rupireddy 
>  wrote in
> > > I don't think
> > > > just making InstallXLogFileSegmentActive false is enough. By looking
> > > > at the comment [1], it doesn't make sense to move ahead for restoring
> > > > from the archive location without the WAL receiver fully stopped.
> > > > IMO, the real fix is to just remove WalRcvStreaming() and call
> > > > XLogShutdownWalRcv() unconditionally. Anyways, we have the
> > > > Assert(!WalRcvStreaming()); down below. I don't think it will create
> > > > any problem.
> > >
> > > If WalRcvStreaming() is returning false that means walreceiver is
> > > already stopped so we don't need to shutdown it externally.  I think
> > > like we are setting this flag outside start streaming we can reset it
> > > also outside XLogShutdownWalRcv.  Or I am fine even if we call
> > > XLogShutdownWalRcv() because if walreceiver is stopped it will just
> > > reset the flag we want it to reset and it will do nothing else.
> >
> > As I said, I'm okay with just calling XLogShutdownWalRcv()
> > unconditionally as it just returns in case walreceiver has already
> > stopped and updates the InstallXLogFileSegmentActive flag to false.
> >
> > Let's also hear what other hackers have to say about this.
>
> Firstly, good catch:)  And the direction seems right.
>
> It seems like an overlook of cc2c7d65fc. We cannot install new wal
> segments only while we're in archive recovery.  Conversely, we must
> turn off it when entering archive recovery (not exiting streaming
> recovery).  So, *I* feel like to do that at the beginning of
> XLOG_FROM_ARCHIVE/PG_WAL rather than the end of XLOG_FROM_STREAM.
>
> (And I would like to remove XLogShutDownWalRcv() and turn off the flag
>  in StartupXLOG explicitly, but it would be overdone.)
>
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -12800,6 +12800,16 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
> randAccess,
>  */
> Assert(!WalRcvStreaming());
>
> +   /*
> +* WAL segment installation conflicts with 
> archive
> +* recovery. Make sure it is turned off.  
> XLogShutdownWalRcv()
> +* does that but it is not done when the 
> process has voluntary
> +* exited for example for replication timeout.
> +*/
> +   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +   XLogCtl->InstallXLogFileSegmentActive = false;
> +   LWLockRelease(ControlFileLock);
> +
> /* Close any old file we might have open. */
> if (readFile >= 0)

Today I encountered the assertion failure [2] twice while working on
another patch [1]. The pattern seems to be that the walreceiver got
killed or crashed and set it's state to WALRCV_STOPPING or
WALRCV_STOPPED by the team the WAL state machine moves to archive and
hence the following XLogShutdownWalRcv() code will not get hit:

/*
 * Before we leave XLOG_FROM_STREAM state, make sure that
 * walreceiver is not active, so that it won't overwrite
 * WAL that we restore from archive.
 */
if (WalRcvStreaming())
ShutdownWalRcv();

I agree with Kyotaro-san to reset InstallXLogFileSegmentActive before
entering into the archive mode. Hence I tweaked the code introduced by
the following commit a bit, the result v1 patch is attached herewith.
Please review it.

commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16
Author: Noah Misch 
Date:   Mon Jun 28 18:34:56 2021 -0700

Skip WAL recycling and preallocation during archive recovery.

[1] 
https://www.postgresql.org/message-id/CALj2ACUYz1z6QPduGn5gguCkfd-ko44j4hKcOMtp6fzv9xEWgw%40mail.gmail.com
[2]
2022-08-11 05:10:29.051 UTC [1487086] FATAL:  terminating walreceiver
due to timeout
cp: cannot stat '/home/ubuntu/archived_wal/0002.history': No such
file or directory
2022-08-11 05:10:29.054 UTC [1487075] LOG:  switched WAL source from
stream to archive after failure
2022-08-11 05:10:29.067 UTC [1487075] LOG:  restored log file
"00010034" from archive
TRAP: FailedAssertion("!IsInstallXLogFileSegmentActive()", File:
"xlogrecovery.c", Line: 4149, PID: 1487075)
postgres: startup waiting for
00010034(ExceptionalCondition+0xd6)[0x559d20e6deb1]
postgres: startup waiting for
00010034(+0x1e4e14)[0x559d20813e14]
postgres: startup waiting for
00010034(+0x1e50c0)[0x559d208140c0]
postgres: startup waiting for
00010034(+0x1e4354)[0x559d20813354]
postgres: startup wa

Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-08-11 Thread vignesh C
On Wed, Aug 10, 2022 at 2:52 PM Amit Kapila  wrote:
>
> On Wed, Aug 10, 2022 at 10:58 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2022-08-09 20:21:19 -0700, Mark Dilger wrote:
> > > > On Aug 9, 2022, at 7:26 PM, Andres Freund  wrote:
> > > >
> > > > The relevant code triggering it:
> > > >
> > > > newbuf = XLogInitBufferForRedo(record, 1);
> > > > _hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
> > > >   xlrec->new_bucket_flag, true);
> > > > if (!IsBufferCleanupOK(newbuf))
> > > > elog(PANIC, "hash_xlog_split_allocate_page: failed to 
> > > > acquire cleanup lock");
> > > >
> > > > Why do we just crash if we don't already have a cleanup lock? That 
> > > > can't be
> > > > right. Or is there supposed to be a guarantee this can't happen?
> > >
> > > Perhaps the code assumes that when xl_hash_split_allocate_page record was
> > > written, the new_bucket field referred to an unused page, and so during
> > > replay it should also refer to an unused page, and being unused, that 
> > > nobody
> > > will have it pinned.  But at least in heap we sometimes pin unused pages
> > > just long enough to examine them and to see that they are unused.  Maybe
> > > something like that is happening here?
> >
> > I don't think it's a safe assumption that nobody would hold a pin on such a
> > page during recovery. While not the case here, somebody else could have used
> > pg_prewarm to read it in.
> >
> > But also, the checkpointer or bgwriter could have it temporarily pinned, to
> > write it out, or another backend could try to write it out as a victim 
> > buffer
> > and have it temporarily pinned.
> >
> >
> > static int
> > SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext 
> > *wb_context)
> > {
> > ...
> > /*
> >  * Pin it, share-lock it, write it.  (FlushBuffer will do nothing 
> > if the
> >  * buffer is clean by the time we've locked it.)
> >  */
> > PinBuffer_Locked(bufHdr);
> > LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
> >
> >
> > As you can see we acquire a pin without holding a lock on the page (and that
> > can't be changed!).
> >
>
> I think this could be the probable reason for failure though I didn't
> try to debug/reproduce this yet. AFAIU, this is possible during
> recovery/replay of WAL record XLOG_HASH_SPLIT_ALLOCATE_PAGE as via
> XLogReadBufferForRedoExtended, we can mark the buffer dirty while
> restoring from full page image. OTOH, because during normal operation
> we didn't mark the page dirty SyncOneBuffer would have skipped it due
> to check (if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))).

I'm trying to simulate the scenario in streaming replication using the below:
CREATE TABLE pvactst (i INT, a INT[], p POINT) with (autovacuum_enabled = off);
CREATE INDEX hash_pvactst ON pvactst USING hash (i);
INSERT INTO pvactst SELECT i, array[1,2,3], point(i, i+1) FROM
generate_series(1,1000) i;

With the above scenario, it will be able to replay allocation of page
for split operation. I will slightly change the above statements and
try to debug and see if we can make the background writer process to
pin this buffer and simulate the scenario. I will post my findings
once I'm done with the analysis.

Regards,
Vignesh




Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)

2022-08-11 Thread Ranier Vilela
Em qui., 11 de ago. de 2022 às 09:23, Julien Rouhaud 
escreveu:

> On Thu, Aug 11, 2022 at 08:51:53AM -0300, Ranier Vilela wrote:
> > Em qui., 11 de ago. de 2022 às 08:48, Alvaro Herrera <
> > alvhe...@alvh.no-ip.org> escreveu:
> >
> > > On 2022-Aug-11, Ranier Vilela wrote:
> > >
> > > > According to:
> > > > https://interrupt.memfault.com/blog/c-struct-padding-initialization
> > >
> > Did you see the Strategy 3 table, { 0 } ?
>
> It explicitly shows that at least Ubuntu clang version 13.0.0-2 with -01
> doesn't do anything about the padding bytes (and that's after testing only
> 2
> different compilers).  Even if those compilers didn't show any problem, we
> still couldn't rely on an undefined behavior and assume that no other
> compilers
> behave differently.
>
Yeah, although not a problem in the main current compilers clang, gcc and
msvc,
it seems that this cannot be changed.
Being an undefined behavior, filling structures with holes, it seems to me
that you should always use MemSet or memset.
Since even a current structure without holes could be changed in the future
and become a bug.

regards,
Ranier Vilela


Re: tests and meson - test names and file locations

2022-08-11 Thread Tom Lane
Andres Freund  writes:
> = Log and Data locations =

> To make things like the selection of log files for a specific test easier,
> I've so far set it up so that test data and logs are stored in a separate
> directory from the sources.

> testrun///

> I do wonder if we should put test data and log files in a separate directory
> tree, but that'd be a bit more work probably.

I'm confused, didn't you just say you already did that?



> Here's an example output that you mostly should be able to make sense of now:

TBH, this seems to be almost all the same sort of useless noise that
we have worked to suppress in "make" output.  The only thing that's
of any interest at all is the report that the "cube" test failed,
and you have managed to make it so that that report is pretty
voluminous and yet contains not one useful detail.  I still have
to go and look at other log files to figure out what happened;
and if I need to see the postmaster log, it's not even apparent
where that is.  I also wonder where, say, a core dump might wind up.

I'm failing to see any advance at all here over what we have now.
If anything, the signal-to-noise ratio has gotten worse.

regards, tom lane




Re: tests and meson - test names and file locations

2022-08-11 Thread Andrew Dunstan


On 2022-08-11 Th 11:06, Andres Freund wrote:
> Hi,
>
> On 2022-08-11 10:20:42 -0400, Tom Lane wrote:
>> Andrew Dunstan  writes:
>>> I also like that. I think we should take this opportunity for some
>>> serious rationalization of this. Tests and associated data have grown
>>> rather like Topsy, and we should fix that. So please don't feel too
>>> constrained by current practice.
>> I'm definitely -1 on that.  Major rearrangement of the test scripts
>> would be a huge blocking factor for almost any back-patch.  I don't
>> care much if you want to rearrange how the tests are invoked, but
>> please don't touch the individual .sql and .pl scripts.
> I don't precisely know what Andrew was thinking of, but the relocation of log
> files for example doesn't require many changes to .pl files - one change to
> Utils.pm. The one exception to that is 010_tab_completion.pl, which encodes
> tmp_check/ in its output.


I meant test results, logs et. I thought that was the topic under
discussion. Changing the location of test sources would be a whole other
topic. Sorry if I was not clear enough.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: SQL/JSON features for v15

2022-08-11 Thread Jonathan S. Katz

On 8/10/22 11:50 AM, Andrew Dunstan wrote:


On 2022-08-09 Tu 16:58, Jonathan S. Katz wrote:

Hi,

(Personal hat, not RMT hat unless otherwise noted).

This thread[1] raised some concerns around the implementation of the
SQL/JSON features that are slated for v15, which includes an
outstanding open item[2]. Given the current state of the discussion,
when the RMT met on Aug 8, they several options, readable here[3].
Given we are now into the later part of the release cycle, we need to
make some decisions on how to proceed with this feature given the
concerns raised.

Per additional discussion on the thread, the group wanted to provide
more visibility into the discussion to get opinions on how to proceed
for the v15 release.

Without rehashing the thread, the options presented were:

1. Fixing the concerns addressed in the thread around the v15 SQL/JSON
features implementation, noting that this would likely entail at least
one more beta release and would push the GA date past our normal
timeframe.

2. Try to commit a subset of the features that caused less debate.
This was ruled out.

3. Revert the v15 SQL/JSON features work.


Based on the current release timing and the open issues presented on
the thread, and the RMT had recommended reverting, but preferred to
drive consensus on next steps.


 From a release advocacy standpoint, I need about 6 weeks lead time to
put together the GA launch. We're at the point where I typically
deliver a draft release announcement. From this, given this involves a
high visibility feature, I would want some clarity on what option we
would like to pursue. Once the announcement translation process has
begun (and this is when we have consensus on the release
announcement), it becomes more challenging to change things out.

 From a personal standpoint (restating from[3]), I would like to see
what we could do to include support for this batch of the SQL/JSON
features in v15. What is included looks like it closes most of the gap
on what we've been missing syntactically since the standard was
adopted, and the JSON_TABLE work is very convenient for converting
JSON data into a relational format. I believe having this feature set
is important for maintaining standards compliance, interoperability,
tooling support, and general usability. Plus, JSON still seems to be
pretty popular.

We're looking for additional input on what makes sense as a best
course of action, given what is presented in[3].



To preserve options I will start preparing reversion patches. Given
there are I think more than 20 commits all told that could be fun, and
will probably take me a little while. The sad part is that to the best
of my knowledge this code is producing correct results, and not
disturbing the stability or performance of anything else. There was a
performance issue but it's been dealt with AIUI.


Personally, I hope we don't need to revert. If everything from the open 
item standpoint is addressed, I want to ensure we capture and complete 
the remaining issues that were raised on the other thread, i.e.


* adding design docs
* simplifying the type-coercion code
* another other design concerns that were presented

We switched this discussion out to a different thread to get some more 
visibility on the issue and see if other folks would weigh in. Thus far, 
there has not been much additional say either way. It would be good if 
other folks chimed in.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-11 Thread Jonathan S. Katz

On 8/10/22 9:27 AM, Amit Langote wrote:

On Wed, Aug 10, 2022 at 3:57 AM Andres Freund  wrote:

One way this code could be drastically simplified is to force all
type-coercions to go through the "io coercion" path, which could be
implemented as a single execution step (which thus could trivially
start/finish a subtransaction) and would remove a lot of the complicated code
around coercions.


Could you please clarify how you think we might do the io coercion
wrapped with a subtransaction all as a single execution step?  I
would've thought that we couldn't do the sub-transaction without
leaving ExecInterpExpr() anyway, so maybe you meant the io coercion
itself was done using some code outside ExecInterpExpr()?

The current JsonExpr code does it by recursively calling
ExecInterpExpr() using the nested ExprState expressly for the
coercion.


With RMT hat on, Andres do you have any thoughts on this?

Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: Cleaning up historical portability baggage

2022-08-11 Thread Andres Freund
Hi,

On 2022-08-11 10:52:51 -0400, Tom Lane wrote:
> Thomas Munro  writes:
> > The most interesting things to say about these ones are:
> >  * The concept of a no-Unix-socket build is removed.  We should be
> > able to do that now, right?  Peter E seemed to say approximately that
> > in the commit message for 797129e5.  Or is there a thought that a new
> > operating system might show up that doesn't have 'em and we'd wish
> > we'd kept this stuff well marked out?
> 
> I'm kind of down on removing that.  I certainly think it's premature
> to do so today, when we haven't even yet shipped a release that
> assumes we can always define it on Windows

I think what might be good next step is to have tests default to using unix
sockets on windows, rather than requiring PG_TEST_USE_UNIX_SOCKETS. The
pg_regress.c and Utils.pm hunks disabling it by default on windows don't
really make sense anymore.

#if !defined(HAVE_UNIX_SOCKETS)
use_unix_sockets = false;
#elif defined(WIN32)

/*
 * We don't use Unix-domain sockets on Windows by default, even if the
 * build supports them.  (See comment at remove_temp() for a reason.)
 * Override at your own risk.
 */
use_unix_sockets = getenv("PG_TEST_USE_UNIX_SOCKETS") ? true : false;
#else
use_unix_sockets = true;
#endif

and

# Specifies whether to use Unix sockets for test setups.  On
# Windows we don't use them by default since it's not universally
# supported, but it can be overridden if desired.
$use_unix_sockets =
  (!$windows_os || defined $ENV{PG_TEST_USE_UNIX_SOCKETS});


Tests don't reliably run on windows without PG_TEST_USE_UNIX_SOCKETS, due to
the port conflict detection being incredibly racy. I see occasional failures
even without test concurrency, and with test concurrency it reliably fails.


I don't really know what to do about the warnings around remove_temp() and
trapsig(). I think we actually may be overreading the restrictions. To me the
documented restrictions read more like a high-level-ish explanation of what's
safe in a signal handler and what not. And it seems to not have caused a
problem on windows on several thousand CI cycles, including plenty failures.


Alternatively we could just default to putting the socketdir inside the data
directory on windows - I *think* windows doesn't have strict path length
limits for the socket location. If the socket dir isn't in some global temp
directory, we don't really need signal_remove_temp, given that we're ok with
leaving the much bigger data directory around.  The socket directory
determination doesn't really work on windows right now anyway, one manually
has to set a temp directory as TMPDIR isn't normally set on windows, and /tmp
doesn't exist.

char   *template = psprintf("%s/pg_regress-XX",
getenv("TMPDIR") ? getenv("TMPDIR") : 
"/tmp");

both TEMP and TMP would exist though...



> -- I won't be too surprised if we get pushback on that after 15.0 is out.

>From what angle? I think our default behaviour doesn't change because
DEFAULT_PGSOCKET_DIR is "". And OS compatibility wise we apparently are good
as well?


> But in general, Unix sockets seem like kind of an obvious thing that might
> not be there on some future platform.

Maybe not with the precise API, but I can't imagine a new platform not having
something very similar. It serves a pretty obvious need, and I think the
security benefits have become more important over time...

Greetings,

Andres Freund




Re: Allow logical replication to copy tables in binary format

2022-08-11 Thread Euler Taveira
On Thu, Aug 11, 2022, at 10:46 AM, Melih Mutlu wrote:
> If such restrictions are already the case for replication phase after initial 
> table sync, then it shouldn't prevent us from enabling binary option for 
> table sync. Right?
I didn't carefully examine the COPY code but I won't expect significant
differences (related to text vs binary mode) from the logical replication
protocol. After inspecting the git history, I took my argument back after
checking the commit 670c0a1d474. The initial commit 9de77b54531 imposes some
restrictions (user-defined arrays and composite types) as mentioned in the
commit message but it was removed in 670c0a1d474. My main concern is to break a
scenario that was previously working (14 -> 15) but after a subscriber upgrade
it won't (14 -> 16). I would say that you should test some scenarios:
014_binary.pl and also custom data types, same column with different data
types, etc.

> How is any issue that might occur due to version mismatch being handled right 
> now in repliaction after table sync?
> What I understand from the documentation is if replication can fail due to 
> using different pg versions, it just fails. So binary option cannot be used. 
> [1]
> Do you think that this is more serious for table sync and we need to restrict 
> binary option with different publisher and subscriber versions? But not for 
> replication?
It is a conservative argument. If we didn't allow a publisher to run COPY in
binary mode while using previous Postgres versions, we know that it works. (At
least there aren't bug reports for logical replication using binary option.)
Since one of the main use cases for logical replication is migration, I'm
concerned that it may not work (even if the binary option defaults to false,
someone can decide to use it for performance reasons).

I did a quick test and the failure while using binary mode is not clear. Since
you are modifying this code, you could probably provide additional patch(es) to
make it clear that there is an error (due to some documented restriction).


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-11 Thread Andres Freund
Hi,

On 2022-08-11 13:08:27 -0400, Jonathan S. Katz wrote:
> On 8/10/22 9:27 AM, Amit Langote wrote:
> > On Wed, Aug 10, 2022 at 3:57 AM Andres Freund  wrote:
> > > One way this code could be drastically simplified is to force all
> > > type-coercions to go through the "io coercion" path, which could be
> > > implemented as a single execution step (which thus could trivially
> > > start/finish a subtransaction) and would remove a lot of the complicated 
> > > code
> > > around coercions.
> > 
> > Could you please clarify how you think we might do the io coercion
> > wrapped with a subtransaction all as a single execution step?  I
> > would've thought that we couldn't do the sub-transaction without
> > leaving ExecInterpExpr() anyway, so maybe you meant the io coercion
> > itself was done using some code outside ExecInterpExpr()?
> > 
> > The current JsonExpr code does it by recursively calling
> > ExecInterpExpr() using the nested ExprState expressly for the
> > coercion.

The basic idea is to rip out all the type-dependent stuff out and replace it
with a single JSON_IOCERCE step, which has a parameter about whether to wrap
things in a subtransaction or not. That step would always perform the coercion
by calling the text output function of the input and the text input function
of the output.


> With RMT hat on, Andres do you have any thoughts on this?

I think I need to prototype how it'd look like to give a more detailed
answer. I have a bunch of meetings over the next few hours, but after that I
can give it a shot.

Greetings,

Andres Freund




Re: Cleaning up historical portability baggage

2022-08-11 Thread Tom Lane
Andres Freund  writes:
> On 2022-08-11 10:52:51 -0400, Tom Lane wrote:
>> -- I won't be too surprised if we get pushback on that after 15.0 is out.

> From what angle?

If I knew that, it'd be because we'd already received the pushback.
I'm just suspicious that very little beta testing happens on Windows,
and what does is probably mostly people running up-to-date Windows.
So I think there's plenty of chance for "hey, this no longer works"
complaints later.  Maybe we'll be able to reject it all with "sorry,
we desupported that version of Windows", but I dunno.

regards, tom lane




avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Zhihong Yu
Hi,
In cash_out(), we have the following code:

if (value < 0)
{
/* make the amount positive for digit-reconstruction loop */
value = -value;

The negation cannot be represented in type long when the value is LONG_MIN.
It seems we can error out when LONG_MIN is detected instead of continuing
with computation.

Please take a look at the patch and provide your feedback.

Thanks


cash-out-of-range.patch
Description: Binary data


Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Tom Lane
Zhihong Yu  writes:
> In cash_out(), we have the following code:
> if (value < 0)
> {
> /* make the amount positive for digit-reconstruction loop */
> value = -value;

> The negation cannot be represented in type long when the value is LONG_MIN.

Possibly not good, but it seems like the subsequent loop works anyway:

regression=# select '-92233720368547758.08'::money;
money
-
 -$92,233,720,368,547,758.08
(1 row)

Note that this exact test case appears in money.sql, so we know that
it works everywhere, not only my machine.

> It seems we can error out when LONG_MIN is detected instead of continuing
> with computation.

How could you think that that's an acceptable solution?  Once the
value is stored, we'd better be able to print it.

regards, tom lane




Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Zhihong Yu
On Thu, Aug 11, 2022 at 10:40 AM Tom Lane  wrote:

> Zhihong Yu  writes:
> > In cash_out(), we have the following code:
> > if (value < 0)
> > {
> > /* make the amount positive for digit-reconstruction loop */
> > value = -value;
>
> > The negation cannot be represented in type long when the value is
> LONG_MIN.
>
> Possibly not good, but it seems like the subsequent loop works anyway:
>
> regression=# select '-92233720368547758.08'::money;
> money
> -
>  -$92,233,720,368,547,758.08
> (1 row)
>
> Note that this exact test case appears in money.sql, so we know that
> it works everywhere, not only my machine.
>
> > It seems we can error out when LONG_MIN is detected instead of continuing
> > with computation.
>
> How could you think that that's an acceptable solution?  Once the
> value is stored, we'd better be able to print it.
>
> regards, tom lane
>

Thanks for taking a look.
I raise this thread due to the following assertion :

src/backend/utils/adt/cash.c:356:11: runtime error: negation of
-9223372036854775808 cannot be represented in type 'Cash' (aka 'long');
cast to an unsigned type to negate this value to itself

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
../../../../../../../src/postgres/src/backend/utils/adt/cash.c:356:11


Though '-92233720368547758.085'::money displays correct error message in
other builds, this statement wouldn't pass the build where
UndefinedBehaviorSanitizer is active.
I think we should fix this otherwise when there is new assertion triggered
due to future changes around Cash (or other types covered by money.sql), we
wouldn't see it.

I am open to other ways of bypassing the above assertion.

Cheers


Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Zhihong Yu
On Thu, Aug 11, 2022 at 10:55 AM Zhihong Yu  wrote:

>
>
> On Thu, Aug 11, 2022 at 10:40 AM Tom Lane  wrote:
>
>> Zhihong Yu  writes:
>> > In cash_out(), we have the following code:
>> > if (value < 0)
>> > {
>> > /* make the amount positive for digit-reconstruction loop */
>> > value = -value;
>>
>> > The negation cannot be represented in type long when the value is
>> LONG_MIN.
>>
>> Possibly not good, but it seems like the subsequent loop works anyway:
>>
>> regression=# select '-92233720368547758.08'::money;
>> money
>> -
>>  -$92,233,720,368,547,758.08
>> (1 row)
>>
>> Note that this exact test case appears in money.sql, so we know that
>> it works everywhere, not only my machine.
>>
>> > It seems we can error out when LONG_MIN is detected instead of
>> continuing
>> > with computation.
>>
>> How could you think that that's an acceptable solution?  Once the
>> value is stored, we'd better be able to print it.
>>
>> regards, tom lane
>>
>
> Thanks for taking a look.
> I raise this thread due to the following assertion :
>
> src/backend/utils/adt/cash.c:356:11: runtime error: negation of
> -9223372036854775808 cannot be represented in type 'Cash' (aka 'long');
> cast to an unsigned type to negate this value to itself
>
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
> ../../../../../../../src/postgres/src/backend/utils/adt/cash.c:356:11
>
>
> Though '-92233720368547758.085'::money displays correct error message in
> other builds, this statement wouldn't pass the build where
> UndefinedBehaviorSanitizer is active.
> I think we should fix this otherwise when there is new assertion triggered
> due to future changes around Cash (or other types covered by money.sql), we
> wouldn't see it.
>
> I am open to other ways of bypassing the above assertion.
>
> Cheers
>

Here is sample output with patch:

# SELECT '-92233720368547758.085'::money;
ERROR:  value "-92233720368547758.085" is out of range for type money
LINE 1: SELECT '-92233720368547758.085'::money;

FYI


Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-11 Thread Robert Haas
On Wed, Aug 10, 2022 at 1:01 AM Dilip Kumar  wrote:
> Done, along with that, I have also got the hunk of smgropen and
> smgrclose in ScanSourceDatabasePgClass() which I had in v1 patch[1].
> Because here we do not want to reuse the smgr of the pg_class again so
> instead of closing at the end with smgrcloserellocator() we can just
> keep the smgr reference and close immediately after getting the number
> of blocks.  Whereas in CreateAndCopyRelationData and
> RelationCopyStorageUsingBuffer() we are using the smgr of the source
> and dest relation multiple time so it make sense to not close it
> immediately and we can close while exiting the function with
> smgrcloserellocator().

As far as I know, this 0001 addresses all outstanding comments and
fixes the reported bug.

Does anyone think otherwise?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: tests and meson - test names and file locations

2022-08-11 Thread Andres Freund
Hi,

On 2022-08-11 13:06:35 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > = Log and Data locations =
> 
> > To make things like the selection of log files for a specific test easier,
> > I've so far set it up so that test data and logs are stored in a separate
> > directory from the sources.
> 
> > testrun///
> 
> > I do wonder if we should put test data and log files in a separate directory
> > tree, but that'd be a bit more work probably.
> 
> I'm confused, didn't you just say you already did that?

I did separate the source / build tree from the test log files / data files,
but not the test log files from the test data files. It'd be easy enough to
achieve for pg_regress tests, but perhaps a bit harder for things like
027_stream_regress.pl that run pg_regress themselves.


> > Here's an example output that you mostly should be able to make sense of 
> > now:
> 
> TBH, this seems to be almost all the same sort of useless noise that
> we have worked to suppress in "make" output.

Which part have we suppressed that's shown here? We've been talking about
cutting down the pointless information that pg_regress produces, but that
seems like a separate endeavor.


> The only thing that's of any interest at all is the report that the "cube"
> test failed, and you have managed to make it so that that report is pretty
> voluminous and yet contains not one useful detail.

It'll just print the list of tests and their success / failure, without
printing the test's output, if you don't pass --print-errorlogs.

The reason the commmand is shown is so you can copy-paste it to run the tests
on its own, which can be useful for debugging, it'll include all the
environment variables set when the test was run, so it's actually the same
command (not like right now, were some env variables are set via export in the
makefile).  We probably can make the command shorter - but that's pretty
similar to what's done for the make invocation of the tests.


> I still have to go and look at other log files to figure out what happened;
> and if I need to see the postmaster log, it's not even apparent where that
> is.

There's a hint at the start, but it could stand to be reformulated to point
that out more clearly:

# executing test in /tmp/meson/testrun/cube/pg_regress group cube test 
pg_regress, builddir /tmp/meson/contrib/cube


> I also wonder where, say, a core dump might wind up.

That'll depend on system configuration as before. Most commonly in the data
dir. Are you wondering where the data dir is?


> I'm failing to see any advance at all here over what we have now.
> If anything, the signal-to-noise ratio has gotten worse.

I'm surprised - I find it *vastly* more readable, because it'll show this
stuff only for the failed tests, and it'll tell you the failed tests at the
end. I don't know how many hours I've spent going backward through check-world
output to find the first failed tests, but it's many.

Greetings,

Andres Freund




Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread David Rowley
On Fri, 12 Aug 2022 at 05:58, Zhihong Yu  wrote:
> Here is sample output with patch:
>
> # SELECT '-92233720368547758.085'::money;
> ERROR:  value "-92233720368547758.085" is out of range for type money
> LINE 1: SELECT '-92233720368547758.085'::money;

I'm struggling to follow along here. There are already overflow checks
for this in cash_in(), which is exactly where they should be.

The above case already fails on master, there's even a regression test
to make sure it does for this exact case, just look at money.out:356.
So, if we're already stopping this from happening in cash_in(), why do
you think it also needs to happen in cash_out()?

I'm also not sure why you opted to use LONG_MIN for your check. The C
type "Cash" is based on int64, that's not long.

David




test failure with gcc-12 -O3 -march=native

2022-08-11 Thread Andres Freund
Hi,

For my optimized builds I've long used -O3 -march=native. After one of the
recent package updates (I'm not certain when exactly yet), the main regression
tests started to fail for me with that. Oddly enough in opr_sanity:

 -- Ask access methods to validate opclasses
 -- (this replaces a lot of SQL-level checks that used to be done in this file)
 SELECT oid, opcname FROM pg_opclass WHERE NOT amvalidate(oid);
- oid | opcname
--+-
-(0 rows)
+INFO:  operator family "array_ops" of access method hash contains function 
hash_array_extended(anyarray,bigint) with wrong signature for support number 2
+INFO:  operator family "bpchar_ops" of access method hash contains function 
hashbpcharextended(character,bigint) with wrong signature for support number 2
...
+ 16492 | part_test_int4_ops
+ 16497 | part_test_text_ops
+(43 rows)


Given that I did not encounter this problem with gcc-12 before, and that
gcc-12 has been released, it seems less likely to be a bug in our code
highlighted by a new optimization and more likely to be a bug in a gcc bugfix,
but it's definitely not clear.


I only investigated this a tiny bit so far. What fails is the
procform->prorettype != restype comparison in check_hash_func_signature().

Greetings,

Andres Freund




Re: Postgres NOT IN vs NOT EXISTS optimization

2022-08-11 Thread Bruce Momjian
On Tue, Jun 14, 2022 at 12:09:16PM -0400, Tom Lane wrote:
> "Dirschel, Steve"  writes:
> > Is Postgres able to drive the query the same way with the NOT IN as the
> > NOT EXISTS is doing or is that only available if the query has a NOT
> > EXISTS?
> 
> NOT IN is not optimized very well in PG, because of the strange
> semantics that the SQL spec demands when the sub-query produces any
> null values.  There's been some interest in detecting cases where
> we can prove that the subquery produces no nulls and then optimizing
> it into NOT EXISTS, but it seems like a lot of work for not-great
> return, so nothing's happened (yet).  Perhaps Oracle does something
> like that already, or perhaps they're just ignoring the semantics
> problem; they do not have a reputation for hewing closely to the
> spec on behavior regarding nulls.

[ Now sent to hackers, where it really belongs. ]

I was just now researching NOT IN behavior and remembered this thread,
so wanted to give a simplified example.  If you set up tables like this:

CREATE TABLE small AS
SELECT * FROM generate_series(1, 10) AS t(x);

CREATE TABLE large AS SELECT small.x
FROM small CROSS JOIN generate_series(1, 1000) AS t(x);

INSERT INTO small VALUES (11), (12);

ANALYZE small, large;

These IN and EXISTS/NOT EXISTS queries look fine. using hash joins:

EXPLAIN SELECT small.x
FROM small
WHERE small.x IN (SELECT large.x FROM large);
 QUERY PLAN

-
 Hash Join  (cost=170.22..171.49 rows=10 width=4)
   Hash Cond: (small.x = large.x)
   ->  Seq Scan on small  (cost=0.00..1.12 rows=12 width=4)
   ->  Hash  (cost=170.10..170.10 rows=10 width=4)
 ->  HashAggregate  (cost=170.00..170.10 rows=10 width=4)
   Group Key: large.x
   ->  Seq Scan on large  (cost=0.00..145.00 rows=1 
width=4)

EXPLAIN SELECT small.x
FROM small
WHERE EXISTS (SELECT large.x FROM large WHERE large.x = small.x);
 QUERY PLAN

-
 Hash Join  (cost=170.22..171.49 rows=10 width=4)
   Hash Cond: (small.x = large.x)
   ->  Seq Scan on small  (cost=0.00..1.12 rows=12 width=4)
   ->  Hash  (cost=170.10..170.10 rows=10 width=4)
 ->  HashAggregate  (cost=170.00..170.10 rows=10 width=4)
   Group Key: large.x
   ->  Seq Scan on large  (cost=0.00..145.00 rows=1 
width=4)

EXPLAIN SELECT small.x
FROM small
WHERE NOT EXISTS (SELECT large.x FROM large WHERE large.x = small.x);
  QUERY PLAN
---
 Hash Anti Join  (cost=270.00..271.20 rows=2 width=4)
   Hash Cond: (small.x = large.x)
   ->  Seq Scan on small  (cost=0.00..1.12 rows=12 width=4)
   ->  Hash  (cost=145.00..145.00 rows=1 width=4)
 ->  Seq Scan on large  (cost=0.00..145.00 rows=1 width=4)

These NOT IN queries all use sequential scans, and IS NOT NULL does not help:

EXPLAIN SELECT small.x
FROM small
WHERE small.x NOT IN (SELECT large.x FROM large);
QUERY PLAN
---
 Seq Scan on small  (cost=170.00..171.15 rows=6 width=4)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
 ->  Seq Scan on large  (cost=0.00..145.00 rows=1 width=4)

EXPLAIN SELECT small.x
FROM small
WHERE small.x NOT IN (SELECT large.x FROM large WHERE large.x IS NOT 
NULL);
QUERY PLAN
---
 Seq Scan on small  (cost=170.00..171.15 rows=6 width=4)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
 ->  Seq Scan on large  (cost=0.00..145.00 rows=1 width=4)
   Filter: (x IS NOT NULL)

Is converting NOT IN to NOT EXISTS our only option?  Couldn't we start
to create the hash and just switch to always returning NULL if we see
any NULLs while we are creating the hash?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Zhihong Yu
On Thu, Aug 11, 2022 at 12:55 PM David Rowley  wrote:

> On Fri, 12 Aug 2022 at 05:58, Zhihong Yu  wrote:
> > Here is sample output with patch:
> >
> > # SELECT '-92233720368547758.085'::money;
> > ERROR:  value "-92233720368547758.085" is out of range for type money
> > LINE 1: SELECT '-92233720368547758.085'::money;
>
> I'm struggling to follow along here. There are already overflow checks
> for this in cash_in(), which is exactly where they should be.
>
> The above case already fails on master, there's even a regression test
> to make sure it does for this exact case, just look at money.out:356.
> So, if we're already stopping this from happening in cash_in(), why do
> you think it also needs to happen in cash_out()?
>
> I'm also not sure why you opted to use LONG_MIN for your check. The C
> type "Cash" is based on int64, that's not long.
>
> David
>

Hi, David:
I am very sorry for not having looked closer at the sample SQL statement
earlier.
Indeed, the previous statement didn't trigger cash_out().

I think this was due to the fact that sanitizer assertion may be separated
from the statement triggering the assertion.
I am still going over the test output, trying to pinpoint the statement.

Meanwhile, I want to thank you for pointing out the constant shouldn't be
used for the boundary check.

How about patch v2 which uses the same check from cash_in() ?
I will see which statement triggers the assertion.

Cheers


cash-out-of-range-v2.patch
Description: Binary data


Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-08-11 Thread Andres Freund
Hi,

On 2022-08-10 14:52:36 +0530, Amit Kapila wrote:
> I think this could be the probable reason for failure though I didn't
> try to debug/reproduce this yet. AFAIU, this is possible during
> recovery/replay of WAL record XLOG_HASH_SPLIT_ALLOCATE_PAGE as via
> XLogReadBufferForRedoExtended, we can mark the buffer dirty while
> restoring from full page image. OTOH, because during normal operation
> we didn't mark the page dirty SyncOneBuffer would have skipped it due
> to check (if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))).

I think there might still be short-lived references from other paths, even if
not marked dirty, but it isn't realy important.


> > I assume this is trying to defend against some sort of deadlock by not
> > actually getting a cleanup lock (by passing get_cleanup_lock = true to
> > XLogReadBufferForRedoExtended()).
> >
> 
> IIRC, this is just following what we do during normal operation and
> based on the theory that the meta-page is not updated yet so no
> backend will access it. I think we can do what you wrote unless there
> is some other reason behind this failure.

Well, it's not really the same if you silently continue in normal operation
and PANIC during recovery... If it's an optional operation the tiny race
around not getting the cleanup lock is fine, but it's a totally different
story during recovery.

Greetings,

Andres Freund




Reducing planning time of large IN queries on primary key / unique columns

2022-08-11 Thread Souvik Bhattacherjee
Hi hackers,

At ServiceNow, we frequently encounter queries with very large IN lists
where the number of elements in the IN list range from a few hundred to
several thousand. For a significant fraction of the queries, the IN clauses
are constructed on primary key columns. While planning these queries,
Postgres query planner loops over every element in the IN clause, computing
the selectivity of each element and then uses that as an input to compute
the total selectivity of the IN clause. For IN clauses on primary key or
unique columns, it is easy to see that the selectivity of the IN predicate
is given by (number of elements in the IN clause / table cardinality) and
is independent of the selectivity of the individual elements. We use this
observation to avoid computing the selectivities of the individual
elements. This results in an improvement in the planning time especially
when the number of elements in the IN clause is relatively large.



The table below demonstrates the improvement in planning time (averaged
over 3 runs) for IN queries of the form SELECT COUNT(*) FROM table_a WHERE
sys_id IN ('000356e61b568510eabcca2b234bcb08',
'00035846db2f24101ad7f256b9961925', ...). Here sys_id is the primary key
column of type VARCHAR(32) and the table cardinality of table_a is around
10M.



Number of IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.371

0.236

1.57

5000

2.019

0.874

2.31

5

19.886

8.273

2.40



Similar to IN clauses, the selectivity of NOT IN clauses on a primary key
or unique column can be computed by not computing the selectivities of
individual elements. The query used is of the form SELECT COUNT(*) FROM
table_a WHERE sys_id NOT IN ('000356e61b568510eabcca2b234bcb08',
'00035846db2f24101ad7f256b9961925', ...).



Number of NOT IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.380

0.248

1.53

5000

2.534

0.854

2.97

5

21.316

9.009

2.36



We also obtain planning time of queries on a primary key column of type
INTEGER with 10M elements for both IN and NOT in queries.


Number of IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.370

0.208

1.78

5000

1.998

0.816

2.45

5

18.073

6.750

2.67




Number of NOT IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.342

0.203

1.68

5000

2.073

0.822

3.29

5

19.551

6.738

2.90



We see that the planning time of queries on unique columns are identical to
that we observed for primary key columns. The resulting patch file for the
changes above is small and we are happy to polish it up and share.


Best,

Souvik Bhattacherjee

(ServiceNow)


Re: Reducing planning time of large IN queries on primary key / unique columns

2022-08-11 Thread Souvik Bhattacherjee
(Re-posting with better formatting)

Hi hackers,

At ServiceNow, we frequently encounter queries with very large IN lists
where the number of elements in the IN list range from a

few hundred to several thousand. For a significant fraction of the queries,
the IN clauses are constructed on primary key columns.

While planning these queries, Postgres query planner loops over every
element in the IN clause, computing the selectivity of each

element and then uses that as an input to compute the total selectivity of
the IN clause. For IN clauses on primary key or unique

columns, it is easy to see that the selectivity of the IN predicate is
given by (number of elements in the IN clause / table cardinality)

and is independent of the selectivity of the individual elements. We use
this observation to avoid computing the selectivities of the

individual elements. This results in an improvement in the planning time
especially when the number of elements in the IN clause

is relatively large.



The table below demonstrates the improvement in planning time (averaged
over 3 runs) for IN queries of the form

SELECT COUNT(*) FROM table_a WHERE sys_id IN
('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925',
...).

Here sys_id is the primary key column of type VARCHAR(32) and the table
cardinality of table_a is around 10M.


Number of IN list elements | Planning time w/o optimization (in ms) | Planning
time w/ optimization (in ms) | Speedup

|---
 | --|--

500 | 0.371
| 0.236
  | 1.57

5000   | 2.019
| 0.874
  | 2.31

5 | 19.886
  | 8.273
| 2.40



Similar to IN clauses, the selectivity of NOT IN clauses on a primary key
or unique column can be computed by not computing the

selectivities of individual elements. The query used is of the form SELECT
COUNT(*) FROM table_a WHERE sys_id NOT IN

('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925',
...).


Number of NOT IN list elements | Planning time w/o optimization (in
ms) | Planning
time w/ optimization (in ms) | Speedup

---|---
 | --|--

500  | 0.380
  | 0.248
| 1.53

5000| 2.534
| 0.854
  | 2.97

5  | 21.316
  | 9.009
| 2.36



We also obtain planning time of queries on a primary key column of type
INTEGER with 10M elements for both IN and NOT in queries.


Number of IN list elements | Planning time w/o optimization (in ms) | Planning
time w/ optimization (in ms) | Speedup

|---
 | --|--

500 | 0.370
| 0.208
  | 1.78

5000   | 1.998
| 0.816
  | 2.45

5 | 18.073
  | 6.750
| 2.67


Number of NOT IN list elements | Planning time w/o optimization (in
ms) | Planning
time w/ optimization (in ms) | Speedup

---|---
 | --|--

500  | 0.342
  | 0.203
| 1.68

5000| 2.073
| 0.822
  | 3.29

5  |19.551
  | 6.738
| 2.90



We see that the planning time of queries on unique columns are identical to
that we observed for primary key columns.

The resulting patch file for the changes above is small and we are happy to
polish it up and share.


Best,

Souvik Bhattacherjee

(ServiceNow)

On Thu, Aug 11, 2022 at 2:42 PM Souvik Bhattacherjee 
wrote:

> Hi hackers,
>
> At ServiceNow, we frequently encounter queries with very large IN lists
> where the number of elements in the IN list range from a few hundred to
> several thousand. For a significant fraction of the queries, the IN clauses
> are constructed on primary key columns. While planning these queries,
> Postgres query planner loops over every element in the IN clause, computing
> the selectivity of each element and then uses that as an input to

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-08-11 Thread Jacob Champion
On Wed, Aug 10, 2022 at 10:48 PM Drouvot, Bertrand  wrote:
> What do you think about adding a second field in ClientConnectionInfo
> for the auth method (as suggested by Michael upthread)?

Sure -- without a followup patch, it's not really tested, though.

v2 adjusts set_authn_id() to copy the auth_method over as well. It
"passes tests" but is otherwise unexercised.

Thanks,
--Jacob
commit 69cacd5e0869b18d64ff4233ef6a73123c513496
Author: Jacob Champion 
Date:   Thu Aug 11 15:16:15 2022 -0700

squash! Allow parallel workers to read authn_id

Add a copy of hba->auth_method to ClientConnectionInfo when
set_authn_id() is called.

diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 313a6ea701..9113f04189 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -333,9 +333,9 @@ auth_failed(Port *port, int status, const char *logdetail)
 
 
 /*
- * Sets the authenticated identity for the current user.  The provided string
- * will be copied into the TopMemoryContext.  The ID will be logged if
- * log_connections is enabled.
+ * Sets the authenticated identity for the current user. The provided string
+ * will be stored into MyClientConnectionInfo, alongside the current HBA method
+ * in use. The ID will be logged if log_connections is enabled.
  *
  * Auth methods should call this routine exactly once, as soon as the user is
  * successfully authenticated, even if they have reasons to know that
@@ -365,6 +365,7 @@ set_authn_id(Port *port, const char *id)
}
 
MyClientConnectionInfo.authn_id = MemoryContextStrdup(TopMemoryContext, 
id);
+   MyClientConnectionInfo.auth_method = port->hba->auth_method;
 
if (Log_connections)
{
@@ -372,8 +373,8 @@ set_authn_id(Port *port, const char *id)
errmsg("connection authenticated: 
identity=\"%s\" method=%s "
   "(%s:%d)",
   MyClientConnectionInfo.authn_id,
-  
hba_authname(port->hba->auth_method), HbaFileName,
-  port->hba->linenumber));
+  
hba_authname(MyClientConnectionInfo.auth_method),
+  HbaFileName, port->hba->linenumber));
}
 }
 
diff --git a/src/backend/utils/init/miscinit.c 
b/src/backend/utils/init/miscinit.c
index 973103374b..155ba92c67 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -954,6 +954,8 @@ EstimateClientConnectionInfoSpace(void)
if (MyClientConnectionInfo.authn_id)
size = add_size(size, strlen(MyClientConnectionInfo.authn_id) + 
1);
 
+   size = add_size(size, sizeof(UserAuth));
+
return size;
 }
 
@@ -981,6 +983,15 @@ SerializeClientConnectionInfo(Size maxsize, char 
*start_address)
maxsize -= len;
start_address += len;
}
+
+   {
+   UserAuth   *auth_method = (UserAuth*) start_address;
+
+   Assert(sizeof(*auth_method) <= maxsize);
+   *auth_method = MyClientConnectionInfo.auth_method;
+   maxsize -= sizeof(*auth_method);
+   start_address += sizeof(*auth_method);
+   }
 }
 
 /*
@@ -1001,6 +1012,13 @@ RestoreClientConnectionInfo(char *conninfo)

  conninfo);
conninfo += strlen(conninfo) + 1;
}
+
+   {
+   UserAuth   *auth_method = (UserAuth*) conninfo;
+
+   MyClientConnectionInfo.auth_method = *auth_method;
+   conninfo += sizeof(*auth_method);
+   }
 }
 
 
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index c900411fdd..0643733765 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -111,7 +111,7 @@ typedef struct
 {
/*
 * Authenticated identity.  The meaning of this identifier is dependent 
on
-* hba->auth_method; it is the identity (if any) that the user presented
+* auth_method; it is the identity (if any) that the user presented
 * during the authentication cycle, before they were assigned a database
 * role.  (It is effectively the "SYSTEM-USERNAME" of a pg_ident usermap
 * -- though the exact string in use may be different, depending on 
pg_hba
@@ -121,6 +121,12 @@ typedef struct
 * example if the "trust" auth method is in use.
 */
const char *authn_id;
+
+   /*
+* The HBA method that determined the above authn_id. This only has 
meaning
+* if authn_id is not NULL; otherwise it's undefined.
+*/
+   UserAuthauth_method;
 } ClientConnectionInfo;
 
 /*
From 32d465527678ad6ef2f177287c797cd87feba585 Mon Sep 17 00:00:00 2001
From: Jacob Ch

Re: Reducing planning time of large IN queries on primary key / unique columns

2022-08-11 Thread Souvik Bhattacherjee
Hi hackers,

(Sorry about the re-post. Another attempt at fixing the formatting)

At ServiceNow, we frequently encounter queries with very large IN lists
where the number of elements in the IN list range from a few hundred to
several thousand. For a significant fraction of the queries, the IN clauses
are constructed on primary key columns. While planning these queries,
Postgres query planner loops over every element in the IN clause, computing
the selectivity of each element and then uses that as an input to compute
the total selectivity of the IN clause. For IN clauses on primary key or
unique columns, it is easy to see that the selectivity of the IN predicate
is given by (number of elements in the IN clause / table cardinality) and
is independent of the selectivity of the individual elements. We use this
observation to avoid computing the selectivities of the individual
elements. This results in an improvement in the planning time especially
when the number of elements in the IN clause is relatively large.



The table below demonstrates the improvement in planning time in
milliseconds (averaged over 3 runs) for IN queries of the form SELECT
COUNT(*) FROM table_a WHERE sys_id IN ('000356e61b568510eabcca2b234bcb08', '
00035846db2f24101ad7f256b9961925', ...). Here sys_id is the primary key
column of type VARCHAR(32) and the table cardinality of table_a is around
10M.


# IN elements | Plan time w/o opt | Plan time w/ opt | Speedup

---
|-|---|--

500  | 0.371   | 0.236  |
1.57

5000| 2.019   | 0.874  |
2.31

5  | 19.886 | 8.273  | 2.40



Similar to IN clauses, the selectivity of NOT IN clauses on a primary key
or unique column can be computed by not computing the selectivities of
individual elements. The query used is of the form SELECT COUNT(*) FROM
table_a WHERE sys_id NOT IN ('000356e61b568510eabcca2b234bcb08', '
00035846db2f24101ad7f256b9961925', ...).


# NOT IN elements | Plan time w/o opt | Plan time w/ opt | Speedup

---|-|--|--

500  | 0.380| 0.248
 | 1.53

5000| 2.534| 0.854
 | 2.97

5  | 21.316  | 9.009
   | 2.36



We also obtain planning time of queries on a primary key column of type
INTEGER with 10M elements for both IN and NOT in queries.


# IN elements | Plan time w/o opt | Plan time w/ opt | Speedup

|-|--|--

500  | 0.370| 0.208 |
1.78

5000| 1.998| 0.816 |
2.45

5  | 18.073  | 6.750 | 2.67


# NOT IN elements | Plan time w/o opt | Plan time w/ opt | Speedup

--
|||--

500  | 0.342   | 0.203
  | 1.68

5000| 2.073   | 0.822
| 3.29

5  |19.551  | 6.738
  | 2.90



We see that the planning time of queries on unique columns are identical to
that we observed for primary key columns. The resulting patch file for the
changes above is small and we are happy to polish it up and share.


Best,

Souvik Bhattacherjee

(ServiceNow)

On Thu, Aug 11, 2022 at 3:04 PM Souvik Bhattacherjee 
wrote:

> (Re-posting with better formatting)
>
> Hi hackers,
>
> At ServiceNow, we frequently encounter queries with very large IN lists
> where the number of elements in the IN list range from a
>
> few hundred to several thousand. For a significant fraction of the
> queries, the IN clauses are constructed on primary key columns.
>
> While planning these queries, Postgres query planner loops over every
> element in the IN clause, computing the selectivity of each
>
> element and then uses that as an input to compute the total selectivity of
> the IN clause. For IN clauses on primary key or unique
>
> columns, it is easy to see that the selectivity of the IN predicate is
> given by (number of elements in the IN clause / table cardinality)
>
> and is independent of the selectivity of the individual elements. We use
> this observation to avoid computing the selectivities of the
>
> individual elements. This results in an improvement in the planning time
> especially when the number of elements in the IN clause
>
> is relatively large.
>
>
>
> The table below demonstrates the improvement in planning time (averaged
> over 3 runs) for IN queries of the form
>
> SELECT COUNT(*) FROM table_a WHERE sys_id IN
> ('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f

Re: Use SetInstallXLogFileSegmentActive() for setting XLogCtl->InstallXLogFileSegmentActive

2022-08-11 Thread Nathan Bossart
On Thu, Aug 11, 2022 at 09:42:18PM +0530, Bharath Rupireddy wrote:
> Here's a small patch replacing the explicit setting of
> XLogCtl->InstallXLogFileSegmentActive with the existing function
> SetInstallXLogFileSegmentActive(), removes duplicate code and saves 4
> LOC.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: O(n) tasks cause lengthy startups and checkpoints

2022-08-11 Thread Nathan Bossart
On Wed, Jul 06, 2022 at 09:51:10AM -0700, Nathan Bossart wrote:
> Here's a new revision where I've attempted to address all the feedback I've
> received thus far.  Notably, the custodian now uses a queue for registering
> tasks and determining which tasks to execute.  Other changes include
> splitting the temporary file functions apart to avoid consecutive boolean
> flags, using a timestamp instead of an integer for the staging name for
> temporary directories, moving temporary directories to a dedicated
> directory so that the custodian doesn't need to scan relation files,
> ERROR-ing when something goes wrong when cleaning up temporary files,
> executing requested tasks immediately in single-user mode, and more.

Here is a rebased patch set for cfbot.  There are no other differences
between v7 and v8.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a0b28421a7a170598f6e60b2c17a8d49fb0ffd55 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 5 Jan 2022 19:24:22 +
Subject: [PATCH v8 1/6] Introduce custodian.

The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing.  This commit simply adds the new process; it does
not yet do anything useful.
---
 src/backend/postmaster/Makefile |   1 +
 src/backend/postmaster/auxprocess.c |   8 +
 src/backend/postmaster/custodian.c  | 383 
 src/backend/postmaster/postmaster.c |  44 ++-
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/lmgr/proc.c |   1 +
 src/backend/utils/activity/wait_event.c |   3 +
 src/backend/utils/init/miscinit.c   |   3 +
 src/include/miscadmin.h |   3 +
 src/include/postmaster/custodian.h  |  32 ++
 src/include/storage/proc.h  |  11 +-
 src/include/utils/wait_event.h  |   1 +
 12 files changed, 488 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/postmaster/custodian.c
 create mode 100644 src/include/postmaster/custodian.h

diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
 	bgworker.o \
 	bgwriter.o \
 	checkpointer.o \
+	custodian.o \
 	fork_process.o \
 	interrupt.o \
 	pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
 #include "pgstat.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 		case CheckpointerProcess:
 			MyBackendType = B_CHECKPOINTER;
 			break;
+		case CustodianProcess:
+			MyBackendType = B_CUSTODIAN;
+			break;
 		case WalWriterProcess:
 			MyBackendType = B_WAL_WRITER;
 			break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 			CheckpointerMain();
 			proc_exit(1);
 
+		case CustodianProcess:
+			CustodianMain();
+			proc_exit(1);
+
 		case WalWriterProcess:
 			WalWriterMain();
 			proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 00..e90f5d0d1f
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,383 @@
+/*-
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc.  Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding).  However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode).  The custodian is not
+ * an essential process and can shutdown quickly when requested.  The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   src/backend/postmaster/custodian.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(bool retry);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(Custodian

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-08-11 Thread Melanie Plageman
I've attached v27 of the patch.

I've renamed IOPATH to IOCONTEXT. I also have added assertions to
confirm that unexpected statistics are not being accumulated.

There are also assorted other cleanups and changes.

It would be good to confirm that the rows being skipped and cells that
are NULL in the view are the correct ones.
The startup process will never use a BufferAccessStrategy, right?


On Wed, Jul 20, 2022 at 12:50 PM Andres Freund  wrote:

>
> > Subject: [PATCH v26 3/4] Track IO operation statistics
>
> > @@ -978,8 +979,17 @@ ReadBuffer_common(SMgrRelation smgr, char
> relpersistence, ForkNumber forkNum,
> >
> >   bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) :
> BufHdrGetBlock(bufHdr);
> >
> > + if (isLocalBuf)
> > + io_path = IOPATH_LOCAL;
> > + else if (strategy != NULL)
> > + io_path = IOPATH_STRATEGY;
> > + else
> > + io_path = IOPATH_SHARED;
>
> Seems a bit ugly to have an if (isLocalBuf) just after an isLocalBuf ?.
>

Changed this.


>
>
> > + /*
> > +  * When a strategy is in use, reused buffers from
> the strategy ring will
> > +  * be counted as allocations for the purposes of
> IO Operation statistics
> > +  * tracking.
> > +  *
> > +  * However, even when a strategy is in use, if a
> new buffer must be
> > +  * allocated from shared buffers and added to the
> ring, this is counted
> > +  * as a IOPATH_SHARED allocation.
> > +  */
>
> There's a bit too much duplication between the paragraphs...
>

I actually think the two paragraphs are making separate points. I've
edited this, so see if you like it better now.


>
> > @@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
> >   /* flush database / relation / function / ... stats */
> >   partial_flush |= pgstat_flush_pending_entries(nowait);
> >
> > + /* flush IO Operations stats */
> > + partial_flush |= pgstat_flush_io_ops(nowait);
>
> Could you either add a note to the commit message that the stats file
> version needs to be increased, or just iclude that in the patch.
>
>
Bumped the stats file version in attached patchset.

- Melanie
From b382e216b4a3f1dae91b043c5c8d647ea17821b7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 11 Aug 2022 18:28:46 -0400
Subject: [PATCH v27 3/4] Track IO operation statistics

Introduce "IOOp", an IO operation done by a backend, and "IOContext",
the IO location source or target or IO type done by a backend. For
example, the checkpointer may write a shared buffer out. This would be
counted as an IOOp "write" on an IOContext IOCONTEXT_SHARED by
BackendType "checkpointer".

Each IOOp (alloc, extend, fsync, read, write) is counted per IOContext
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

IOCONTEXT_LOCAL and IOCONTEXT_SHARED IOContexts concern operations on
local and shared buffers.

The IOCONTEXT_STRATEGY IOContext concerns buffers
alloc'd/extended/fsync'd/read/written as part of a BufferAccessStrategy.

IOOP_ALLOC is counted for IOCONTEXT_SHARED and IOCONTEXT_LOCAL whenever
a buffer is acquired through [Local]BufferAlloc(). IOOP_ALLOC for
IOCONTEXT_STRATEGY is counted whenever a buffer already in the strategy
ring is reused. And IOOP_WRITE for IOCONTEXT_STRATEGY is counted
whenever the reused dirty buffer is written out.

Stats on IOOps for all IOContexts for a backend are initially
accumulated locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live. The accumulated stats in shared
memory could be extended in the future with per-backend stats -- useful
for per connection IO statistics and monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Author: Melanie Plageman 
Reviewed-by: Justin Pryzby , Kyotaro Horiguchi 
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml  |   2 +
 src/backend/postmaster/checkpointer.c |  12 +
 src/backend/storage/buffer/bufmgr.c   |  64 +++-
 src/backend/storage/buffer/freelist.c |  23 +-
 src/backend/storage/buffer/localbuf.c |   5 +
 src/backend/storage/sync/sync.c   |   9 +
 src/backend/utils/activity/Makefile   |   1 +
 src/backend/utils/activity/pgstat.c   |  31 ++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpoin

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-08-11 Thread Masahiko Sawada
On Thu, Aug 11, 2022 at 3:10 PM Amit Kapila  wrote:
>
> On Mon, Aug 8, 2022 at 9:34 AM Amit Kapila  wrote:
> >
> > On Wed, Aug 3, 2022 at 1:20 PM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > Oops, thanks for pointing it out. I've fixed it and attached updated
> > > patches for all branches so as not to confuse the patch version. There
> > > is no update from v12 patch on REL12 - master patches.
> > >
> >
> > Thanks for the updated patches, the changes look good to me.
> > Horiguchi-San, and others, do you have any further comments on this or
> > do you want to spend time in review of it? If not, I would like to
> > push this after the current minor version release.
> >
>
> Pushed.

Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: test failure with gcc-12 -O3 -march=native

2022-08-11 Thread Justin Pryzby
On Thu, Aug 11, 2022 at 01:03:43PM -0700, Andres Freund wrote:
> Hi,
> 
> For my optimized builds I've long used -O3 -march=native. After one of the

On what kind of arch ?

> Given that I did not encounter this problem with gcc-12 before, and that
> gcc-12 has been released, it seems less likely to be a bug in our code
> highlighted by a new optimization and more likely to be a bug in a gcc bugfix,
> but it's definitely not clear.

debian testing is now defaulting to gcc-12.
https://tracker.debian.org/news/1348007/accepted-gcc-defaults-1198-source-into-unstable/

Are you sure you were building with gcc-12 and not gcc(default) which, until 3
weeks ago, was gcc-11  ?

-- 
Justin




Re: test failure with gcc-12 -O3 -march=native

2022-08-11 Thread Andres Freund
Hi,

On 2022-08-11 20:06:02 -0500, Justin Pryzby wrote:
> On Thu, Aug 11, 2022 at 01:03:43PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > For my optimized builds I've long used -O3 -march=native. After one of the
> 
> On what kind of arch ?

x86-64 cascadelake. I've since debugged this further. It's not even -march
that's the problem, it's the difference between -mtune=broadwell and
-mtune=skylake, even with -march=x86-64.


> > Given that I did not encounter this problem with gcc-12 before, and that
> > gcc-12 has been released, it seems less likely to be a bug in our code
> > highlighted by a new optimization and more likely to be a bug in a gcc 
> > bugfix,
> > but it's definitely not clear.
> 
> debian testing is now defaulting to gcc-12.
> https://tracker.debian.org/news/1348007/accepted-gcc-defaults-1198-source-into-unstable/
> 
> Are you sure you were building with gcc-12 and not gcc(default) which, until 3
> weeks ago, was gcc-11  ?

Yes.

I'm now bisecting...

Greetings,

Andres Freund




Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Tom Lane
Zhihong Yu  writes:
> How about patch v2 which uses the same check from cash_in() ?

I'm not sure which part of this statement you're not getting:
it is completely unacceptable for cash_out to fail on valid
values of the type.  And this value is valid.  cash_in goes
out of its way to take it, and you can also produce it via
arithmetic operators.

I understand that you're trying to get rid of an analyzer warning that
negating INT64_MIN is (pedantically, not in practice) undefined behavior.
But the way to fix that is to make the behavior conform to the C spec.
Perhaps it would work to do

Cashvalue = PG_GETARG_CASH(0);
uint64  uvalue;

if (value < 0)
uvalue = -(uint64) value;
else
uvalue = value;

and then use uvalue instead of "(uint64) value" in the loop.
Of course, this begs the question of what negation means for
an unsigned value.  I believe that this formulation is allowed
by the C spec and produces the same results as what we do now,
but I'm not convinced that it's clearer for the reader.

Another possibility is

if (value < 0)
{
if (value == INT64_MIN)
uvalue = however you wanna spell -INT64_MIN;
else
uvalue = (uint64) -value;
}
else
uvalue = value;

but this really seems to be letting pedantry get the best of us.

The short answer here is that the code works fine on every platform
we support.  We know that because we have a regression test checking
this exact case.  So it's not broken and I don't think there's a
very good argument that it needs to be fixed.  Maybe the right thing
is just to add a comment pointing out what happens for INT64_MIN.

regards, tom lane




Re: test failure with gcc-12 -O3 -march=native

2022-08-11 Thread Andres Freund
Hi,

On 2022-08-11 18:24:16 -0700, Andres Freund wrote:
> > > Given that I did not encounter this problem with gcc-12 before, and that
> > > gcc-12 has been released, it seems less likely to be a bug in our code
> > > highlighted by a new optimization and more likely to be a bug in a gcc 
> > > bugfix,
> > > but it's definitely not clear.
> > 
> > debian testing is now defaulting to gcc-12.
> > https://tracker.debian.org/news/1348007/accepted-gcc-defaults-1198-source-into-unstable/
> > 
> > Are you sure you were building with gcc-12 and not gcc(default) which, 
> > until 3
> > weeks ago, was gcc-11  ?
> 
> Yes.
> 
> I'm now bisecting...

I found the commit triggering it [1]. Oddly it's a change from a few months
ago, and I can reconstruct from dpkg.log and shell history that I definitely
ran the tests many times since upgrading the compiler.  I did however clean my
ccache cache yesterday, I wonder if somehow the 'old' version got stuck in
it. ccache says it checks the compiler's mtime though.

Greetings,

Andres Freund

[1] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=1ceddd7497e




Re: avoid negating LONG_MIN in cash_out()

2022-08-11 Thread Zhihong Yu
On Thu, Aug 11, 2022 at 6:28 PM Tom Lane  wrote:

> Zhihong Yu  writes:
> > How about patch v2 which uses the same check from cash_in() ?
>
> I'm not sure which part of this statement you're not getting:
> it is completely unacceptable for cash_out to fail on valid
> values of the type.  And this value is valid.  cash_in goes
> out of its way to take it, and you can also produce it via
> arithmetic operators.
>
> I understand that you're trying to get rid of an analyzer warning that
> negating INT64_MIN is (pedantically, not in practice) undefined behavior.
> But the way to fix that is to make the behavior conform to the C spec.
> Perhaps it would work to do
>
> Cashvalue = PG_GETARG_CASH(0);
> uint64  uvalue;
>
> if (value < 0)
> uvalue = -(uint64) value;
> else
> uvalue = value;
>
> and then use uvalue instead of "(uint64) value" in the loop.
> Of course, this begs the question of what negation means for
> an unsigned value.  I believe that this formulation is allowed
> by the C spec and produces the same results as what we do now,
> but I'm not convinced that it's clearer for the reader.
>
> Another possibility is
>
> if (value < 0)
> {
> if (value == INT64_MIN)
> uvalue = however you wanna spell -INT64_MIN;
> else
> uvalue = (uint64) -value;
> }
> else
> uvalue = value;
>
> but this really seems to be letting pedantry get the best of us.
>
> The short answer here is that the code works fine on every platform
> we support.  We know that because we have a regression test checking
> this exact case.  So it's not broken and I don't think there's a
> very good argument that it needs to be fixed.  Maybe the right thing
> is just to add a comment pointing out what happens for INT64_MIN.
>
> regards, tom lane
>
Hi,
Thanks for taking the time to contemplate various possibilities.

I thought of using uint64 as well - but as you have shown, the readability
isn't better.

I will keep this in the back of my mind.

Cheers


Re: test failure with gcc-12 -O3 -march=native

2022-08-11 Thread Andres Freund
Hi,

On 2022-08-11 19:08:14 -0700, Andres Freund wrote:
> On 2022-08-11 18:24:16 -0700, Andres Freund wrote:
> > I'm now bisecting...
> 
> I found the commit triggering it [1]. Oddly it's a change from a few months
> ago, and I can reconstruct from dpkg.log and shell history that I definitely
> ran the tests many times since upgrading the compiler.  I did however clean my
> ccache cache yesterday, I wonder if somehow the 'old' version got stuck in
> it. ccache says it checks the compiler's mtime though.

Spent a fair bit of time reducing the problem to something triggering the
problem in isolation. This is somewhat scary - I'd be quite surprised if this
were the only place triggering the bug.

And I suspect that it doesn't actually require -mtune=skylake, but I'm not
sure.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106590

Greetings,

Andres Freund




Remove log_checkpoints = true from .pl tests

2022-08-11 Thread Bharath Rupireddy
Hi,

With commit 64da07c41a8c0a680460cdafc79093736332b6cf making default
value of log_checkpoints to on, do we need to remove explicit settings
in perl tests to save some (5) LOC?

Although, it's harmless, here's a tiny patch to remove them.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/


v1-0001-Remove-log_checkpoints-true-from-.pl-tests.patch
Description: Binary data


Re: Remove log_checkpoints = true from .pl tests

2022-08-11 Thread Tom Lane
Bharath Rupireddy  writes:
> With commit 64da07c41a8c0a680460cdafc79093736332b6cf making default
> value of log_checkpoints to on, do we need to remove explicit settings
> in perl tests to save some (5) LOC?

I'm not particularly eager to do that, because I think defaulting
log_checkpoints to "on" was a bad decision that will eventually get
reverted.  Even if that doesn't happen, we have *far* better ways
to spend our time than removing five lines of code, or even
discussing whether to do so.

regards, tom lane




build remaining Flex files standalone

2022-08-11 Thread John Naylor
Starting a new thread to control clutter. [was: Re: [RFC] building
postgres with meson]

motivation: 
https://www.postgresql.org/message-id/20220810171935.7k5zgnjwqzalzmtm%40awork3.anarazel.de

On Thu, Aug 11, 2022 at 11:07 AM Andres Freund  wrote:
> I think we should consider compiling it separately from guc.c. guc.c already
> compiles quite slowly (iirc beat only by ecpg and main grammar), and it's a
> relatively commonly changed source file.

Done in the attached, and will do the rest in time. It seemed most
straightforward to put ProcessConfigFileInternal() in guc.c since
that's where most of its callers are, and it relies on some vars and
types declared there. There are a couple new extern declarations in
guc.h that are only for guc.c and guc-file.c:

+/* functions shared between guc.c and guc-file.l */
+extern int guc_name_compare(const char *namea, const char *nameb);
+extern ConfigVariable *ProcessConfigFileInternal(GucContext context,
+ bool applySettings, int elevel);
+extern void record_config_file_error(const char *errmsg,
+ const char *config_file,
+ int lineno,
+ ConfigVariable **head_p,
+ ConfigVariable **tail_p);

These might be better placed in a new guc_internal.h. Thoughts?

> It might even be a good idea to split guc.c so it only contains the settings
> arrays + direct dependencies...

Perhaps this can be a TODO item, one which falls under "[E] marks
items that are easier to implement". I've been slacking on removing
the old/intractable cruft from the TODO list, but we should also be
sticking small nice-but-not-necessary things in there. That said, if
this idea has any bearing on the guc_internal.h idea, it might be
better dealt with now.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From d723ba14acf56fd432e9e263db937fcc13fc0355 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Thu, 11 Aug 2022 19:38:37 +0700
Subject: [PATCH v1] Build guc-file.c standalone

The proposed Meson build system will need a way to ignore certain
generated files in order to coexist with the autoconf build system,
and #include'd C files generated by Flex make this more difficult.
Build guc-file.c separately from guc.c, as was done in 72b1e3a21.

TODO: other Flex-generated files

Discussion: https://www.postgresql.org/message-id/20220810171935.7k5zgnjwqzalzmtm%40awork3.anarazel.de
---
 src/backend/utils/misc/Makefile   |   5 +-
 src/backend/utils/misc/guc-file.l | 367 +-
 src/backend/utils/misc/guc.c  | 362 -
 src/include/utils/guc.h   |   9 +
 4 files changed, 370 insertions(+), 373 deletions(-)

diff --git a/src/backend/utils/misc/Makefile b/src/backend/utils/misc/Makefile
index 1d5327cf64..cf7ce9bc83 100644
--- a/src/backend/utils/misc/Makefile
+++ b/src/backend/utils/misc/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	guc.o \
+	guc-file.o \
 	help_config.o \
 	pg_config.o \
 	pg_controldata.o \
@@ -37,10 +38,6 @@ endif
 
 include $(top_srcdir)/src/backend/common.mk
 
-# guc-file is compiled as part of guc
-guc.o: guc-file.c
-
 # Note: guc-file.c is not deleted by 'make clean',
 # since we want to ship it in distribution tarballs.
 clean:
-	@rm -f lex.yy.c
diff --git a/src/backend/utils/misc/guc-file.l b/src/backend/utils/misc/guc-file.l
index ce5633844c..08adb454de 100644
--- a/src/backend/utils/misc/guc-file.l
+++ b/src/backend/utils/misc/guc-file.l
@@ -7,7 +7,7 @@
  * src/backend/utils/misc/guc-file.l
  */
 
-%{
+%top{
 
 #include "postgres.h"
 
@@ -17,9 +17,12 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
+#include 
 #include "utils/guc.h"
+#include "utils/memutils.h"
+}
 
-
+%{
 /*
  * flex emits a yy_fatal_error() function that it calls in response to
  * critical errors like malloc failure, file I/O errors, and detection of
@@ -48,12 +51,6 @@ static sigjmp_buf *GUC_flex_fatal_jmp;
 
 static void FreeConfigVariable(ConfigVariable *item);
 
-static void record_config_file_error(const char *errmsg,
-	 const char *config_file,
-	 int lineno,
-	 ConfigVariable **head_p,
-	 ConfigVariable **tail_p);
-
 static int	GUC_flex_fatal(const char *msg);
 
 /* LCOV_EXCL_START */
@@ -159,358 +156,6 @@ ProcessConfigFile(GucContext context)
 	MemoryContextDelete(config_cxt);
 }
 
-/*
- * This function handles both actual config file (re)loads and execution of
- * show_all_file_settings() (i.e., the pg_file_settings view).  In the latter
- * case we don't apply any of the settings, but we make all the usual validity
- * checks, and we return the ConfigVariable list so that it can be printed out
- * by show_all_file_settings().
- */
-static ConfigVariable *
-ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel)
-{
-	bool		error = false;
-	bool		applying = false;
-	const char *ConfFileWithError;
-	ConfigVariable *item,
-			   *head,
-			   *tail;
-	int			i;
-
-	/* Parse the main config file into a list of option names and va

Re: Perform streaming logical transactions by background workers and parallel apply

2022-08-11 Thread Peter Smith
Here are some review comments for v20-0003:

(Sorry - the reviews are time consuming, so I am lagging slightly
behind the latest posted version)

==

1. 

1a.
There are a few comment modifications in this patch (e.g. changing
FROM "in an apply background worker" TO "using an apply background
worker"). e.g. I noticed lots of these in worker.c but they might be
in other files too.

Although these are good changes, these are just tweaks to new comments
introduced by patch 0001, so IMO such changes belong in that patch,
not in this one.

1b.
Actually, there are still some comments says "by an apply background
worker///" and some saying "using an apply background worker..." and
some saying "in the apply background worker...". Maybe they are all
OK, but it will be better if all such can be searched and made to have
consistent wording

==

2. Commit message

2a.

Without these restrictions, the following scenario may occur:
The apply background worker lock a row when processing a streaming transaction,
after that the main apply worker tries to lock the same row when processing
another transaction. At this time, the main apply worker waits for the
streaming transaction to complete and the lock to be released, it won't send
subsequent data of the streaming transaction to the apply background worker;
the apply background worker waits to receive the rest of streaming transaction
and can't finish this transaction. Then the main apply worker will wait
indefinitely.

"background worker lock a row" -> "background worker locks a row"

"Then the main apply worker will wait indefinitely." -> really, you
already said the main apply worker is waiting, so I think this
sentence only needs to say: "Now a deadlock has occurred, so both
workers will wait indefinitely."

2b.

Text fragments are all common between:

i.   This commit message
ii.  Text in pgdocs CREATE SUBSCRIPTION
iii. Function comment for 'logicalrep_rel_mark_parallel_apply' in relation.c

After addressing other review comments please make sure all those 3
parts are worded same.

==

3. doc/src/sgml/ref/create_subscription.sgml

+  There are two requirements for using parallel
+  mode: 1) the unique column in the table on the subscriber-side should
+  also be the unique column on the publisher-side; 2) there cannot be
+  any non-immutable functions used by the subscriber-side replicated
+  table.

3a.
I am not sure – is "requirements" the correct word here, or maybe it
should be "prerequisites".

3b.
Is it correct to say "should also be", or should that say "must also be"?

==

4. src/backend/replication/logical/applybgworker.c -
apply_bgworker_relation_check

+ /*
+ * Skip check if not using apply background workers.
+ *
+ * If any worker is handling the streaming transaction, this check needs to
+ * be performed not only in the apply background worker, but also in the
+ * main apply worker. This is because without these restrictions, main
+ * apply worker may block apply background worker, which will cause
+ * infinite waits.
+ */
+ if (!am_apply_bgworker() &&
+ (list_length(ApplyBgworkersFreeList) == list_length(ApplyBgworkersList)))
+ return;

I struggled a bit to reconcile the comment with the condition. Is the
!am_apply_bgworker() part of this even needed – isn't the
list_length() check enough?

~~~

5.

+ /* We are in error mode and should give user correct error. */

I still [1, #3.4a] don't see the value in saying "should give correct
error" (e.g. what's the alternative?).

Maybe instead of that comment it can just say:
rel->parallel_apply = PARALLEL_APPLY_UNSAFE;

==

6. src/backend/replication/logical/proto.c - RelationGetUniqueKeyBitmap

+ /* Add referenced attributes to idindexattrs */
+ for (i = 0; i < indexRel->rd_index->indnatts; i++)
+ {
+ int attrnum = indexRel->rd_index->indkey.values[i];
+
+ /*
+ * We don't include non-key columns into idindexattrs
+ * bitmaps. See RelationGetIndexAttrBitmap.
+ */
+ if (attrnum != 0)
+ {
+ if (i < indexRel->rd_index->indnkeyatts &&
+ !bms_is_member(attrnum - FirstLowInvalidHeapAttributeNumber, attunique))
+ attunique = bms_add_member(attunique,
+attrnum - FirstLowInvalidHeapAttributeNumber);
+ }
+ }

There are 2x comments in that code that are referring to
'idindexattrs' but I think it is a cut/paste problem because that
variable name does not even exist in this copied function.

==

7. src/backend/replication/logical/relation.c -
logicalrep_rel_mark_parallel_apply

+ /* Initialize the flag. */
+ entry->parallel_apply = PARALLEL_APPLY_SAFE;

I have unsuccessfully repeated the same review comment several times
[1 #3.8] suggesting that this flag should not be initialized to SAFE.
IMO the state should remain as UNKNOWN until you are either sure it is
SAFE, or sure it is UNSAFE. Anyway, I'll give up on this point now;
let's see what other people think.

==

8. src/include/replication/logicalrelation.h

+/*
+ * States to determine if changes on on

Re: fix typos

2022-08-11 Thread John Naylor
I wrote:

> On Thu, Aug 4, 2022 at 8:41 PM Tom Lane  wrote:
> >
> > John Naylor  writes:
>
> > > RepOriginId is a typedef for uint16, so this can't print the wrong answer,
> > > but it is inconsistent with other uses. So it seems we don't need to
> > > backpatch this one?
> >
> > Um ... if it's int16, then it can't be an OID, so I'd say this message has
> > far worse problems than %d vs %u.  It should not use that terminology.
>
> The catalog has the following. Since it's not a real oid, maybe this column 
> should be rethought?

This is really a straw-man proposal, since I'm not volunteering to do
the work, or suggest anybody else should do the same. That being the
case, it seems we should just go ahead with Justin's patch for
consistency. Possibly we could also change the messages to say "ID"?

> CATALOG(pg_replication_origin,6000,ReplicationOriginRelationId) 
> BKI_SHARED_RELATION
> {
> /*
> * Locally known id that get included into WAL.
> *
> * This should never leave the system.
> *
> * Needs to fit into an uint16, so we don't waste too much space in WAL
> * records. For this reason we don't use a normal Oid column here, since
> * we need to handle allocation of new values manually.
> */
> Oid roident;

-- 
John Naylor
EDB: http://www.enterprisedb.com