date:20230216

RE: run pgindent on a regular basis / scripted manner

2023-02-16 Thread shiy.f...@fujitsu.com

On Thu, Feb 9, 2023 6:10 AM Andrew Dunstan  wrote:
> Thanks, I have committed this. Still looking at Robert's other request.
>

Hi,

Commit #068a243b7 supported directories to be non-option arguments of pgindent.
But the help text doesn't mention that. Should we update it? Attach a small
patch which did that.

Regards,
Shi Yu

v1-0001-Update-help-text-of-pgindent.patch
Description: v1-0001-Update-help-text-of-pgindent.patch

Re: Add index scan progress to pg_stat_progress_vacuum

2023-02-16 Thread Masahiko Sawada

On Wed, Feb 8, 2023 at 11:03 AM Imseih (AWS), Sami  wrote:
>
> Hi,
>
> Thanks for your reply!
>
> I addressed the latest comments in v23.
>
> 1/ cleaned up the asserts as discussed.
> 2/ used pq_putmessage to send the message on index scan completion.

 Thank you for updating the patch! Here are some comments for v23 patch:

+ 
+  ParallelVacuumFinish
+  Waiting for parallel vacuum workers to finish index
vacuum.
+ 

This change is out-of-date.

---
+ 
+  
+   indexes_total bigint
+  
+  
+   Number of indexes that will be vacuumed or cleaned up. This
value will be
+   0 if the phase is not vacuuming
indexes
+   or cleaning up indexes,
INDEX_CLEANUP
+   is set to OFF, index vacuum is skipped due to very
+   few dead tuples in the table, or vacuum failsafe is triggered.
+   See 
+   for more on vacuum failsafe.
+  
+ 

This explanation looks redundant: setting INDEX_CLEANUP to OFF
essentially means the vacuum doesn't enter the vacuuming indexes
phase. The same is true for the case of skipping index vacuum and
failsafe mode. How about the following?

Total number of indexes that will be vacuumed or cleaned up. This
number is reported as of the beginning of the vacuuming indexes phase
or the cleaning up indexes phase.

---
+ 
+  
+   indexes_completed bigint
+  
+  
+   Number of indexes vacuumed in the current vacuum cycle when the
+   phase is vacuuming indexes, or the number
+   of indexes cleaned up during the cleaning up indexes
+   phase.
+  
+ 

How about simplfyy the description to something like:

Number of indexes processed. This counter only advances when the phase
is vacuuming indexes or cleaning up indexes.

Also, index_processed sounds accurate to me. What do you think?

---
+pcxt->parallel_progress_callback = NULL;
+pcxt->parallel_progress_callback_arg = NULL;

I think these settings are not necessary since the pcxt is palloc0'ed.

---
+void
+parallel_vacuum_update_progress(void *arg)
+{
+ParallelVacuumState *pvs = (ParallelVacuumState *)arg;
+
+Assert(!IsParallelWorker());
+Assert(pvs->pcxt->parallel_progress_callback_arg);
+
+if (pvs)
+pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,
+
   pg_atomic_add_fetch_u32(&(pvs->shared->nindexes_completed), 1));
+}

Assert(pvs->pcxt->parallel_progress_callback_arg) looks wrong to me.
If 'arg' is NULL, a SEGV happens.

I think it's better to update pvs->shared->nindexes_completed by both
leader and worker processes who processed the index.

---
+/* progress callback definition */
+typedef void (*ParallelProgressCallback) (void
*parallel_progress_callback_state);
+
 typedef void (*parallel_worker_main_type) (dsm_segment *seg, shm_toc *toc);

I think it's better to make the function type consistent with the
existing parallel_worker_main_type. How about
parallel_progress_callback_type?

I've attached a patch that incorporates the above comments and has
some suggestions of updating comments etc.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

fix_v23_masahiko.patch
Description: Binary data

Re: Support logical replication of global object commands

2023-02-16 Thread Amit Kapila

On Thu, Feb 16, 2023 at 12:02 PM Amit Kapila  wrote:
>
> On Tue, Aug 30, 2022 at 8:09 AM Zheng Li  wrote:
> >
> > > > I think a publication of ALL OBJECTS sounds intuitive. Does it mean 
> > > > we'll
> > > > publish all DDL commands, all commit and abort operations in every
> > > > database if there is such publication of ALL OBJECTS?
> > > >
> > >
> > > Actually, I intend something for global objects. But the main thing
> > > that is worrying me about this is that we don't have a clean way to
> > > untie global object replication from database-specific object
> > > replication.
> >
> > I think ultimately we need a clean and efficient way to publish (and
> > subscribe to) any changes in all databases, preferably in one logical
> > replication slot.
> >
>
> Agreed. I was thinking currently for logical replication both
> walsender and slot are database-specific. So we need a way to
> distinguish the WAL for global objects and then avoid filtering based
> on the slot's database during decoding. I also thought about whether
> we want to have a WALSender that is not connected to a database for
> the replication of global objects but I couldn't come up with a reason
> for doing so. Do you have any thoughts on this matter?
>

Another thing about the patch proposed here is that it LOGs the DDL
for global objects without any consideration of whether that is
required for logical replication. This is quite unlike what we are
planning to do for other DDLs where it will be logged only when the
publication has defined an event trigger for it.

-- 
With Regards,
Amit Kapila.

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Kyotaro Horiguchi

At Thu, 16 Feb 2023 06:20:23 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Dear Horiguchi-san,
> 
> Thank you for responding! Before modifying patches, I want to confirm 
> something
> you said.
> 
> > As Amit-K mentioned, we may need to change the name of the option in
> > this version, since the delay mechanism in this version causes a delay
> > in sending from publisher than delaying apply on the subscriber side.
> 
> Right, will be changed.
> 
> > I'm not sure why output plugin is involved in the delay mechanism. It
> > appears to me that it would be simpler if the delay occurred in
> > reorder buffer or logical decoder instead.
> 
> I'm planning to change, but..

Yeah, I don't think we've made up our minds about which way to go yet,
so it's a bit too early to work on that.

> > Perhaps what I understand
> > correctly is that we could delay right before only sending commit
> > records in this case. If we delay at publisher end, all changes will
> > be sent at once if !streaming, and otherwise, all changes in a
> > transaction will be spooled at subscriber end. In any case, apply
> > worker won't be holding an active transaction unnecessarily.
> 
> What about parallel case? Latest patch does not reject the combination of 
> parallel
> streaming mode and delay. If delay is done at commit and subscriber uses an 
> parallel
> apply worker, it may acquire lock for a long time.

I didn't looked too closely, but my guess is that transactions are
conveyed in spool files in parallel mode, with each file storing a
complete transaction.

> > Of
> > course we need add the mechanism to process keep-alive and status
> > report messages.
> 
> Could you share the good way to handle keep-alive and status messages if you 
> have?
> If we changed to the decoding layer, it is strange to call walsender function
> directly.

I'm sorry, but I don't have a concrete idea at the moment. When I read
through the last patch, I missed that WalSndDelay is actually a subset
of WalSndLoop. Although it can handle keep-alives correctly, I'm not
sure we can accept that structure..

> > Those setups work fine when no
> > apply-delay involved, but they won't work with the patches we're
> > talking about because the subscriber won't respond to the keep-alive
> > packets from the peer.  So when I wrote "practically works" in the
> > last mail, this is what I meant.
> 
> I'm not sure around the part. I think in the latest patch, subscriber can 
> respond
> to the keepalive packets from the peer. Also, publisher can respond to the 
> peer.
> Could you please tell me if you know a case that publisher or subscriber 
> cannot
> respond to the opposite side? Note that if we apply the publisher-side patch, 
> we
> don't have to apply subscriber-side patch.

Sorry about that again, I missed that part in the last patch as
mentioned earlier..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Amit Kapila

On Thu, Feb 16, 2023 at 2:25 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 16 Feb 2023 06:20:23 +, "Hayato Kuroda (Fujitsu)" 
>  wrote in
> > Dear Horiguchi-san,
> >
> > Thank you for responding! Before modifying patches, I want to confirm 
> > something
> > you said.
> >
> > > As Amit-K mentioned, we may need to change the name of the option in
> > > this version, since the delay mechanism in this version causes a delay
> > > in sending from publisher than delaying apply on the subscriber side.
> >
> > Right, will be changed.
> >
> > > I'm not sure why output plugin is involved in the delay mechanism. It
> > > appears to me that it would be simpler if the delay occurred in
> > > reorder buffer or logical decoder instead.
> >
> > I'm planning to change, but..
>
> Yeah, I don't think we've made up our minds about which way to go yet,
> so it's a bit too early to work on that.
>
> > > Perhaps what I understand
> > > correctly is that we could delay right before only sending commit
> > > records in this case. If we delay at publisher end, all changes will
> > > be sent at once if !streaming, and otherwise, all changes in a
> > > transaction will be spooled at subscriber end. In any case, apply
> > > worker won't be holding an active transaction unnecessarily.
> >
> > What about parallel case? Latest patch does not reject the combination of 
> > parallel
> > streaming mode and delay. If delay is done at commit and subscriber uses an 
> > parallel
> > apply worker, it may acquire lock for a long time.
>
> I didn't looked too closely, but my guess is that transactions are
> conveyed in spool files in parallel mode, with each file storing a
> complete transaction.
>

No, we don't try to collect all the data in files for parallel mode.
Having said that, it doesn't matter because we won't know the time of
the commit (which is used to compute delay) before we encounter the
commit record in WAL. So, I feel for this approach, we can follow what
you said.

> > > Of
> > > course we need add the mechanism to process keep-alive and status
> > > report messages.
> >
> > Could you share the good way to handle keep-alive and status messages if 
> > you have?
> > If we changed to the decoding layer, it is strange to call walsender 
> > function
> > directly.
>
> I'm sorry, but I don't have a concrete idea at the moment. When I read
> through the last patch, I missed that WalSndDelay is actually a subset
> of WalSndLoop. Although it can handle keep-alives correctly, I'm not
> sure we can accept that structure..
>

I think we can use update_progress_txn() for this purpose but note
that we are discussing to change the same in thread [1].

[1] - 
https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40awork3.anarazel.de

-- 
With Regards,
Amit Kapila.

Re: Allow logical replication to copy tables in binary format

2023-02-16 Thread Melih Mutlu

Hi,

Please see the attached patch for following changes.

Bharath Rupireddy , 30 Oca 2023
Pzt, 15:34 tarihinde şunu yazdı:

> On Mon, Jan 30, 2023 at 4:19 PM Melih Mutlu 
> wrote:

It'd be better and clearer to have a separate TAP test file IMO since
> the focus of the feature here isn't to just test for data types. With
> separate tests, you can verify "ERROR:  incorrect binary data format
> in logical replication column 1" cases.
>

Moved some tests from 002_types.pl to 014_binary.pl since this is where
most binary features are tested. It covers now "incorrect data format"
cases too.
Also added some regression tests for copy_format parameter.

> With the above said, do you think checking for publisher versions is
> needed? The user can decide to enable/disable binary COPY based on the
> publisher's version no?
> +/* If the publisher is v16 or later, specify the format to copy data.
> */
> +if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 16)
> +{
>

If the user decides to enable it, then it might be nice to not allow it for
previous versions.
But I'm not sure. I'm okay to remove it if you all agree.

> Few more comments on v7:
> 1.
> +  Specifies the format in which pre-existing data on the
> publisher will
> +  copied to the subscriber. Supported formats are
> +  text and binary. The
> default is
> +  text.
> It'll be good to call out the cases in the documentation as to where
> copy_format can be enabled and needs to be disabled.
>

Modified that description a bit. Can you check if that's okay now?

> 2.
> + errmsg("%s value should be either \"text\" or \"binary\"",
> How about "value must be either "?
>

Done

> 3.
> Why should the subscription's binary option and copy_format option be
> tied at all? Tying these two options hurts usability. Is there a
> fundamental reason? I think they both are/must be independent. One
> deals with data types and another deals with how initial table data is
> copied.
>

My initial purpose with this patch was just making subscriptions with
binary option enabled fully binary from initial copy to apply. Things have
changed a bit when we decided to move binary copy behind a parameter.
I didn't actually think there would be any use case where a user wants the
initial copy to be in binary format for a sub with binary = false. Do you
think it would be useful to copy in binary even for a sub with binary
disabled?

Thanks,
-- 
Melih Mutlu
Microsoft

v8-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-16 Thread John Naylor

On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada 
wrote:
>
> On Tue, Feb 14, 2023 at 8:24 PM John Naylor
>  wrote:

> > > I can think that something like traversing a HOT chain could visit
> > > offsets out of order. But fortunately we prune such collected TIDs
> > > before heap vacuum in heap case.
> >
> > Further, currently we *already* assume we populate the tid array in
order (for binary search), so we can just continue assuming that (with an
assert added since it's more public in this form). I'm not sure why such
basic common sense evaded me a few versions ago...
>
> Right. TidStore is implemented not only for heap, so loading
> out-of-order TIDs might be important in the future.

That's what I was probably thinking about some weeks ago, but I'm having a
hard time imagining how it would come up, even for something like the
conveyor-belt concept.

> We have the following WIP comment in test_radixtree:
>
> // WIP: compiles with warnings because rt_attach is defined but not used
> // #define RT_SHMEM
>
> How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> and friends?

Sounds good to me, and the other fixes make sense as well.

> FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> seems to work fine.

That was on my list, so great! How about the other end -- nominally we
allow 512b. (In practice it won't matter, but this would make sure I didn't
mess anything up when forcing all MaxTuplesPerPage to encode.)

> You removed the vacuum integration patch from v27, is there any reason
for that?

Just an oversight.

Now for some general comments on the tid store...

+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory
associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)

Do we need to do anything for this todo?

It might help readability to have a concept of "off_upper/off_lower", just
so we can describe things more clearly. The key is block + off_upper, and
the value is a bitmap of all the off_lower bits. I hinted at that in my
addition of encode_key_off(). Along those lines, maybe
s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even
sure the TIDSTORE_ prefix is valuable for these local macros.

The word "value" as a variable name is pretty generic in this context, and
it might be better to call it the off_lower_bitmap, at least in some
places. The "key" doesn't have a good short term for naming, but in
comments we should make sure we're clear it's "block# + off_upper".

I'm not a fan of the name "tid_i", even as a temp variable -- maybe
"compressed_tid"?

maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/

It might be worth using typedefs for key and value type. Actually, since
key type is fixed for the foreseeable future, maybe the radix tree template
should define a key typedef?

The term "result" is probably fine within the tidstore, but as a public
name used by vacuum, it's not very descriptive. I don't have a good idea,
though.

Some files in backend/access use CamelCase for public functions, although
it's not consistent. I think doing that for tidstore would help
readability, since they would stand out from rt_* functions and vacuum
functions. It's a matter of taste, though.

I don't understand the control flow in tidstore_iterate_next(), or when
BlockNumberIsValid() is true. If this is the best way to code this, it
needs more commentary.

Some comments on vacuum:

I think we'd better get some real-world testing of this, fairly soon.

I had an idea: If it's not too much effort, it might be worth splitting it
into two parts: one that just adds the store (not caring about its memory
limits or progress reporting etc). During index scan, check both the new
store and the array and log a warning (we don't want to exit or crash,
better to try to investigate while live if possible) if the result doesn't
match. Then perhaps set up an instance and let something like TPC-C run for
a few days. The second patch would just restore the rest of the current
patch. That would help reassure us it's working as designed. Soon I plan to
do some measurements with vacuuming large tables to get some concrete
numbers that the community can get excited about.

We also want to verify that progress reporting works as designed and has no
weird corner cases.

  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We
initially
...
+ * create a TidStore with the maximum bytes that can be used by the
TidStore.

This kind of implies that we allocate the maximum bytes upfront. I think
this sentence can be removed. We already mentioned in the previous
paragraph that we set an upper bound.

- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (lo

Re: Allow logical replication to copy tables in binary format

2023-02-16 Thread Melih Mutlu

Hi,

Thanks for reviewing. Please see the v8 here [1]

Takamichi Osumi (Fujitsu) , 1 Şub 2023 Çar,
06:05 tarihinde şunu yazdı:

> (1) general comment
>
> I wondered if the addition of the new option/parameter can introduce some
> confusion to the users.
>
> case 1. When binary = true and copy_format = text, the table sync is
> conducted by text.
> case 2. When binary = false and copy_format = binary, the table sync is
> conducted by binary.
> (Note that the case 2 can be accepted after addressing the 3rd comment of
> Bharath-san in [1].
> I agree with the 3rd comment by itself.)
>

I replied to Bharath's comment [1], can you please check to see if that
makes sense?


> The name of the new subscription parameter looks confusing.
> How about "init_sync_format" or something ?
>

The option to enable initial sync is named "copy_data", so I named the new
parameter as "copy_format" to refer to that copy meant by "copy_data".
Maybe "copy_data_format" would be better. I can change it if you think it's
better.


> (2) The commit message doesn't get updated.
>

Done


> (3) whitespace errors.
>

Done


> (4) pg_dump.c
>

Done


> (5) describe.c
>

Done


> (6)
>
> + * Extract the copy format value from a DefElem.
> + */
> +char
> +defGetCopyFormat(DefElem *def)
>
> Shouldn't this function be static and remove the change of
> subscriptioncmds.h ?
>

I wanted to make "defGetCopyFormat" be consistent with
"defGetStreamingMode" since they're basically doing the same work for
different parameters. And that function isn't static, so I didn't make
"defGetCopyFormat" static too.


> (7) catalogs.sgml
>

Done

(8) create_subscription.sgml
>

Done

Also;

The latest patch doesn't get updated for more than two weeks
> after some review comments. If you don't have time,
> I would like to help updating the patch for you and other reviewers.
>
> Kindly let me know your status.
>

 Sorry for the delay. This patch is currently one of my priorities.
Hopefully I will share quicker updates from now on.

[1]
https://www.postgresql.org/message-id/CAGPVpCQYi9AYQSS%3DRmGgVNjz5ZEnLB8mACwd9aioVhLmbgiAMA%40mail.gmail.com


Thanks,
-- 
Melih Mutlu
Microsoft

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-02-16 Thread Jelte Fennema

FYI the last patch does not apply cleanly anymore. So a rebase is needed.

Re: Move defaults toward ICU in 16?

2023-02-16 Thread Robert Haas

On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis  wrote:
> It feels very wrong to me to explain that sort order is defined by the
> operating system on which Postgres happens to run. Saying that it's
> defined by ICU, which is part of the Unicode consotium, is much better.
> It doesn't eliminate versioning issues, of course, but I think it's a
> better explanation for users.

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

I feel like this is a very difficult kind of change to judge. If
everyone else feels this is a win, we should go with it, and hopefully
we'll end up better off. I do feel like there are things that could go
wrong, though, between the imperfect documentation, the fact that a
substantial chunk of our users won't be able to use it because they
run Windows, and everybody having to adjust to the behavior change.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Weird failure with latches in curculio on v15

2023-02-16 Thread Robert Haas

On Thu, Feb 9, 2023 at 10:53 PM Nathan Bossart  wrote:
> I've been thinking about this, actually.  I'm wondering if we could provide
> a list of files to the archiving callback (configurable via a variable in
> ArchiveModuleState), and then have the callback return a list of files that
> are archived.  (Or maybe we just put the list of files that need archiving
> in ArchiveModuleState.)  The returned list could include files that were
> sent to the callback previously.  The archive module would be responsible
> for creating background worker(s) (if desired), dispatching files
> to-be-archived to its background worker(s), and gathering the list of
> archived files to return.

Hmm. So in this design, the archiver doesn't really do the archiving
any more, because the interface makes that impossible. It has to use a
separate background worker process for that, full stop.

I don't think that's a good design. It's fine if some people want to
implement it that way, but it shouldn't be forced by the interface.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Missing default value of createrole_self_grant in document

2023-02-16 Thread shiy.f...@fujitsu.com

Hi hackers,

I noticed that the document of GUC createrole_self_grant doesn't mention its
default value. The attached patch adds that.

Regards,
Shi Yu


v1-0001-Add-default-value-of-createrole_self_grant-in-doc.patch
Description:  v1-0001-Add-default-value-of-createrole_self_grant-in-doc.patch

Re: Weird failure with latches in curculio on v15

2023-02-16 Thread Robert Haas

On Fri, Feb 10, 2023 at 12:59 AM Andres Freund  wrote:
> I'm somewhat concerned about that too, but perhaps from a different
> angle. First, I think we don't do our users a service by defaulting the
> in-core implementation to something that doesn't scale to even a moderately
> busy server.

+1.

> Second, I doubt we'll get the API for any of this right, without
> an acutual user that does something more complicated than restoring one-by-one
> in a blocking manner.

Fair.

> I don't think it's that hard to imagine problems. To be reasonably fast, a
> decent restore implementation will have to 'restore ahead'. Which also
> provides ample things to go wrong. E.g.
>
> - WAL source is switched, restore module needs to react to that, but doesn't,
>   we end up lots of wasted work, or worse, filename conflicts
> - recovery follows a timeline, restore module doesn't catch on quickly enough
> - end of recovery happens, restore just continues on

I don't see how you can prevent those things from happening. If the
restore process is working in some way that requires an event loop,
and I think that will be typical for any kind of remote archiving,
then either it has control most of the time, so the event loop can be
run inside the restore process, or, as Nathan proposes, we don't let
the archiver have control and it needs to run that restore process in
a separate background worker. The hazards that you mention here exist
either way. If the event loop is running inside the restore process,
it can decide not to call the functions that we provide in a timely
fashion and thus fail to react as it should. If the event loop runs
inside a separate background worker, then that process can fail to be
responsive in precisely the same way. Fundamentally, if the author of
a restore module writes code to have multiple I/Os in flight at the
same time and does not write code to cancel those I/Os if something
changes, then such cancellation will not occur. That remains true no
matter which process is performing the I/O.

> > I don't quite see how you can make asynchronous and parallel archiving
> > work if the archiver process only calls into the archive module at
> > times that it chooses. That would mean that the module has to return
> > control to the archiver when it's in the middle of archiving one or
> > more files -- and then I don't see how it can get control back at the
> > appropriate time. Do you have a thought about that?
>
> I don't think archiver is the hard part, that already has a dedicated
> process, and it also has something of a queuing system already. The startup
> process imo is the complicated one...
>
> If we had a 'restorer' process, startup fed some sort of a queue with things
> to restore in the near future, it might be more realistic to do something you
> describe?

Some kind of queueing system might be a useful part of the interface,
and a dedicated restorer process does sound like a good idea. But the
archiver doesn't have this solved, precisely because you have to
archive a single file, return control, and wait to be invoked again
for the next file. That does not scale.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: wrong query result due to wang plan

2023-02-16 Thread Richard Guo

On Thu, Feb 16, 2023 at 3:16 PM tender wang  wrote:

> select ref_1.r_comment as c0, subq_0.c1 as c1 from public.region as
> sample_0 right join public.partsupp as sample_1 right join public.lineitem
> as sample_2 on (cast(null as path) = cast(null as path)) on (cast(null as
> "timestamp") < cast(null as "timestamp")) inner join public.lineitem as
> ref_0 on (true) left join (select sample_3.ps_availqty as c1,
> sample_3.ps_comment as c2 from public.partsupp as sample_3 where false
> order by c1, c2 ) as subq_0 on (sample_1.ps_supplycost = subq_0.c1 ) right
> join public.region as ref_1 on (sample_1.ps_availqty = ref_1.r_regionkey )
> where ref_1.r_comment is not NULL order by c0, c1;
>

The repro can be reduced to the query below.

create table t (a int, b int);

# explain (costs off) select * from t t1 left join (t t2 inner join t t3 on
false left join t t4 on t2.b = t4.b) on t1.a = t2.a;
QUERY PLAN
--
 Result
   One-Time Filter: false
(2 rows)

As we can see, the joinrel at the final level is marked as dummy, which
is wrong.  I traced this issue down to distribute_qual_to_rels() when we
handle variable-free clause.  If such a clause is not an outer-join
clause, and it contains no volatile functions either, we assign it the
full relid set of the current JoinDomain.  I doubt this is always
correct.

Such as in the query above, the clause 'false' is assigned relids {t2,
t3, t4, t2/t4}.  And that makes it a pushed down restriction to the
second left join.  This is all right if we plan this query in the
user-given order.  But if we've commuted the two left joins, which is
legal, this pushed down and constant false restriction would make the
final joinrel be dummy.

It seems we still need to check whether a variable-free qual comes from
somewhere that is below the nullable side of an outer join before we
decide that it can be evaluated at join domain level, just like we did
before.  So I wonder if we can add a 'below_outer_join' flag in
JoinTreeItem, fill its value during deconstruct_recurse, and check it in
distribute_qual_to_rels() like

   /* eval at join domain level if not below outer join */
-  relids = bms_copy(jtitem->jdomain->jd_relids);
+  relids = jtitem->below_outer_join ?
+   bms_copy(qualscope) : bms_copy(jtitem->jdomain->jd_relids);

Thanks
Richard

Re: Normalization of utility queries in pg_stat_statements

2023-02-16 Thread Drouvot, Bertrand


Hi,

On 2/16/23 1:34 AM, Michael Paquier wrote:

While wondering about this stuff about the last few days and
discussing with bertrand, I have changed my mind on the point that
there is no need to be that aggressive yet with the normalization of
the A_Const nodes, because the query string normalization of
pg_stat_statements is not prepared yet to handle cases where a A_Const
value uses a non-quoted value with whitespaces.  The two cases where I
saw an impact is on the commands that can define an isolation level:
SET TRANSACTION and BEGIN.

For example, applying normalization to A_Const nodes does the
following as of HEAD:
1) BEGIN TRANSACTION READ ONLY, READ WRITE, DEFERRABLE, NOT DEFERRABLE;
BEGIN TRANSACTION $1 ONLY, $2 WRITE, $3, $4 DEFERRABLE
2) SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
SET TRANSACTION ISOLATION LEVEL $1 COMMITTED

On top of that, specifying a different isolation level may cause these
commands to be grouped, which is not really cool.  All that could be
done incrementally later on, in 17~ or later depending on the
adjustments that make sense.



Thanks for those patches!

Yeah, agree about the proposed approach.



0002 has been
completed with a couple of commands to track all the commands with
A_Const, so as we never lose sight of what happens.  0004 is what I
think could be done for PG16, where normalization affects only Const.
At the end of the day, this reflects the following commands that use
Const nodes because they use directly queries, so the same rules as
SELECT and DMLs apply to them:
- DECLARE
- EXPLAIN
- CREATE MATERIALIZED VIEW
- CTAS, SELECT INTO



0001:

I like the idea of splitting the existing tests in dedicated files.

What do you think about removing:

"
SET pg_stat_statements.track_utility = FALSE;
SET pg_stat_statements.track_planning = TRUE;
"

In the new pg_stat_statements.sql? That way pg_stat_statements.sql would always 
behave
with default values for those (currently we are setting both of them as non 
default).

Then, with the default values in place, if we feel that some tests are missing 
we could add them in
utility.sql or planning.sql accordingly.

0002:

Produces:
v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:834: trailing 
whitespace.
CREATE VIEW view_stats_1 AS
v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:838: trailing 
whitespace.
CREATE VIEW view_stats_1 AS
warning: 2 lines add whitespace errors.

+-- SET TRANSACTION ISOLATION
+BEGIN;
+SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
+SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

What about adding things like "SET SESSION CHARACTERISTICS AS TRANSACTION..." 
too?

0003 and 0004:
Thanks for keeping 0003 that's useful to see the impact of A_Const 
normalization.

Looking at the diff they produce, I also do think that 0004 is what
could be done for PG16.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: wrong query result due to wang plan

2023-02-16 Thread Richard Guo

On Thu, Feb 16, 2023 at 5:50 PM Richard Guo  wrote:

> It seems we still need to check whether a variable-free qual comes from
> somewhere that is below the nullable side of an outer join before we
> decide that it can be evaluated at join domain level, just like we did
> before.  So I wonder if we can add a 'below_outer_join' flag in
> JoinTreeItem, fill its value during deconstruct_recurse, and check it in
> distribute_qual_to_rels() like
>
>/* eval at join domain level if not below outer join */
> -  relids = bms_copy(jtitem->jdomain->jd_relids);
> +  relids = jtitem->below_outer_join ?
> +   bms_copy(qualscope) : bms_copy(jtitem->jdomain->jd_relids);
>

To be concrete, I mean something like attached.

Thanks
Richard


v1-0001-Fix-variable-free-clause-distribution.patch
Description: Binary data

Re: Move defaults toward ICU in 16?

2023-02-16 Thread Laurenz Albe

On Thu, 2023-02-16 at 15:05 +0530, Robert Haas wrote:
> On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis  wrote:
> > It feels very wrong to me to explain that sort order is defined by the
> > operating system on which Postgres happens to run. Saying that it's
> > defined by ICU, which is part of the Unicode consotium, is much better.
> > It doesn't eliminate versioning issues, of course, but I think it's a
> > better explanation for users.
> 
> The fact that we can't use ICU on Windows, though, weakens this
> argument a lot. In my experience, we have a lot of Windows users, and
> they're not any happier with the operating system collations than
> Linux users. Possibly less so.
> 
> I feel like this is a very difficult kind of change to judge. If
> everyone else feels this is a win, we should go with it, and hopefully
> we'll end up better off. I do feel like there are things that could go
> wrong, though, between the imperfect documentation, the fact that a
> substantial chunk of our users won't be able to use it because they
> run Windows, and everybody having to adjust to the behavior change.

Unless I misunderstand, the lack of Windows support is not a matter
of principle and can be added later on, right?

I am in favor of changing the default.  It might be good to add a section
to the documentation in "Server setup and operation" recommending that
if you go with the default choice of ICU, you should configure your
package manager not to upgrade the ICU library.

Yours,
Laurenz Albe

Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16

2023-02-16 Thread Amit Kapila

On Wed, Feb 15, 2023 at 3:35 AM Nathan Bossart  wrote:
>
> On Sat, Jan 21, 2023 at 06:42:08AM +0100, Drouvot, Bertrand wrote:
> > On 1/20/23 9:01 PM, Nathan Bossart wrote:
> >> Should we also change the related
> >> variables (e.g., ndeletable in _hash_vacuum_one_page()) to uint16?
> >
> > Yeah, I thought about it too. What I saw is that there is other places that 
> > would be good candidates for the same
> > kind of changes (see the int ntodelete argument in gistXLogDelete() being 
> > assigned to gistxlogDelete.ntodelete (uint16) for example).
> >
> > So, what do you think about:
> >
> > 1) keep this patch as it is (to "only" address the struct field and avoid 
> > possible future "useless" padding size increase)
> > and
> > 2) create a new patch (once this one is committed) to align the types for 
> > variables/arguments with the structs (related to XLOG records) fields when 
> > they are not?
>
> Okay.  I've marked this one as ready-for-committer, then.
>

LGTM. I think the padding space we are trying to save here can be used
for the patch [1], right? BTW, feel free to create the second patch
(to align the types for variables/arguments) as that would be really
helpful after we commit this one.

I think this would require XLOG_PAGE_MAGIC as it changes the WAL record.

BTW, how about a commit message like:
Change xl_hash_vacuum_one_page.ntuples from int to uint16.

This will create two bytes of padding space in xl_hash_vacuum_one_page
which can be used for future patches. This makes the datatype of
xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which
is advisable as both are used for the same purpose.

[1] - 
https://www.postgresql.org/message-id/2d62f212-fce6-d639-b9eb-2a5bc4bec3b4%40gmail.com

-- 
With Regards,
Amit Kapila.

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-02-16 Thread John Naylor

On Thu, Feb 16, 2023 at 10:03 AM David Rowley  wrote:

> I suspect it's slower because the final sort must sort the entire
> array still without knowledge that portions of it are pre-sorted.  It
> would be very interesting to improve this and do some additional work
> and keep track of the "memtupsortedto" index by pushing them onto a
> List each time we cross the availMem boundary, then do then qsort just
> the final portion of the array in tuplesort_performsort() before doing
> a k-way merge on each segment rather than qsorting the entire thing
> again. I suspect this would be faster when work_mem exceeds L3 by some
> large amount.

Sounds like a reasonable thing to try.

It seems like in-memory merge could still use abbreviation, unlike external
merge.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PATCH] Align GSS and TLS error handling in PQconnectPoll()

2023-02-16 Thread Jelte Fennema

Patch looks good to me. Definitely an improvement over the status quo.

Looking at the TLS error handling though I see these two lines:

&& conn->allow_ssl_try/* redundant? */
&& !conn->wait_ssl_try) /* redundant? */

Are they actually redundant like the comment suggests? If so, we
should probably remove them (in another patch). If not (or if we don't
know), should we have these same checks for GSS?

Re: Missing default value of createrole_self_grant in document

2023-02-16 Thread Daniel Gustafsson

> On 16 Feb 2023, at 10:47, shiy.f...@fujitsu.com wrote:

> I noticed that the document of GUC createrole_self_grant doesn't mention its
> default value. The attached patch adds that.

Agreed, showing the default value in the documentation is a good pattern IMO.
Unless objected to I'll go apply this in a bit.

--
Daniel Gustafsson

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-16 Thread Melih Mutlu

Hi Shveta and Shi,

Thanks for your investigations.

shveta malik , 8 Şub 2023 Çar, 16:49 tarihinde şunu
yazdı:

> On Tue, Feb 7, 2023 at 8:18 AM shiy.f...@fujitsu.com
>  wrote:
> > I reproduced the problem I reported before with latest patch (v7-0001,
> > v10-0002), and looked into this problem. It is caused by a similar
> reason. Here
> > is some analysis for the problem I reported [1].#6.
> >
> > First, a tablesync worker (worker-1) started for "tbl1", its originname
> is
> > "pg_16398_1". And it exited because of unique constraint. In
> > LogicalRepSyncTableStart(), originname in pg_subscription_rel is updated
> when
> > updating table state to DATASYNC, and the origin is created when
> updating table
> > state to FINISHEDCOPY. So when it exited with state DATASYNC , the
> origin is not
> > created but the originname has been updated in pg_subscription_rel.
> >
> > Then a tablesync worker (worker-2) started for "tbl2", its originname is
> > "pg_16398_2". After tablesync of "tbl2" finished, this worker moved to
> sync
> > table "tbl1". In LogicalRepSyncTableStart(), it got the originname of
> "tbl1" -
> > "pg_16398_1", by calling ReplicationOriginNameForLogicalRep(), and tried
> to drop
> > the origin (although it is not actually created before). After that, it
> called
> > replorigin_by_name to get the originid whose name is "pg_16398_1" and
> the result
> > is InvalidOid. Origin won't be created in this case because the sync
> worker has
> > created a replication slot (when it synced tbl2), so the originid was
> still
> > invalid and it caused an assertion failure when calling
> replorigin_advance().
> >
> > It seems we don't need to drop previous origin in worker-2 because the
> previous
> > origin was not created in worker-1. I think one way to fix it is to not
> update
> > originname of pg_subscription_rel when setting state to DATASYNC, and
> only do
> > that when setting state to FINISHEDCOPY. If so, the originname in
> > pg_subscription_rel will be set at the same time the origin is created.
>
> +1. Update of system-catalog needs to be done carefully and only when
> origin is created.
>

I see that setting originname in the catalog before actually creating it
causes issues. My concern with setting originname when setting the state to
FINISHEDCOPY is that if worker waits until FINISHEDCOPY to update
slot/origin name but it fails somewhere before reaching FINISHEDCOPY and
after creating slot/origin, then this new created slot/origin will be left
behind. It wouldn't be possible to find and drop them since their names are
not stored in the catalog. Eventually, this might also cause hitting
the max_replication_slots limit in case of such failures between origin
creation and FINISHEDCOPY.

One fix I can think is to update the catalog right after creating a new
origin. But this would also require commiting the current transaction to
actually persist the originname. I guess this action of commiting the
transaction in the middle of initial sync could hurt the copy process.

What do you think?

Also; working on an updated patch to address your other comments. Thanks
again.

Best,
-- 
Melih Mutlu
Microsoft

Re: Considering additional sort specialisation functions for PG16

2023-02-16 Thread John Naylor

I wrote:
> Have two memtuple arrays, one for first sortkey null and one for first
sortkey non-null:

Hacking on this has gotten only as far as the "compiles but segfaults"
stage, but I wanted to note an idea that occurred to me:

Internal qsort doesn't need the srctape member, and removing both that and
isnull1 would allow 16-byte "init-tuples" for qsort, which would save a bit
of work_mem space, binary space for qsort specializations, and work done
during swaps.

During heap sort, we already copy one entry into a stack variable to keep
from clobbering it, so it's not a big step to read a member from the init
array and form a regular sorttuple from it.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: proposal: psql: psql variable BACKEND_PID

2023-02-16 Thread Jelte Fennema

On Thu, 16 Feb 2023 at 12:44, Pavel Stehule  wrote:
> To find and use pg_backend_pid is not rocket science. But use :BACKEND_PID is 
> simpler.

I wanted to call out that if there's a connection pooler (e.g.
PgBouncer) in the middle, then BACKEND_PID (and %p) are incorrect, but
pg_backend_pid() would work for the query. This might be an edge case,
but if BACKEND_PID is added it might be worth listing this edge case
in the docs somewhere.

Re: Some revises in adding sorting path

2023-02-16 Thread Etsuro Fujita

Hi Richard,

On Tue, Jan 10, 2023 at 8:06 PM Richard Guo  wrote:
> In add_paths_with_pathkeys_for_rel, we do not try incremental sort atop
> of the epq_path, which I think we can do.  I'm not sure how useful this
> is in real world since the epq_path is used only for EPQ checks, but it
> seems doing that doesn't cost too much.

I'm not sure this is a good idea, because the epq_path will return at
most one tuple in an EPQ recheck.

The reason why an extra Sort node is injected into the epq_path is
only label it with the correct sort order to use it as an input for
the EPQ merge-join path of a higher-level foreign join, so shouldn't
we keep this step as much as simple and save cycles even a little?

Sorry for being late to the party.

Best regards,
Etsuro Fujita

Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16

2023-02-16 Thread Drouvot, Bertrand


Hi,

On 2/16/23 12:00 PM, Amit Kapila wrote:

On Wed, Feb 15, 2023 at 3:35 AM Nathan Bossart  wrote:


On Sat, Jan 21, 2023 at 06:42:08AM +0100, Drouvot, Bertrand wrote:

On 1/20/23 9:01 PM, Nathan Bossart wrote:

Should we also change the related
variables (e.g., ndeletable in _hash_vacuum_one_page()) to uint16?


Yeah, I thought about it too. What I saw is that there is other places that 
would be good candidates for the same
kind of changes (see the int ntodelete argument in gistXLogDelete() being 
assigned to gistxlogDelete.ntodelete (uint16) for example).

So, what do you think about:

1) keep this patch as it is (to "only" address the struct field and avoid possible future 
"useless" padding size increase)
and
2) create a new patch (once this one is committed) to align the types for 
variables/arguments with the structs (related to XLOG records) fields when they 
are not?


Okay.  I've marked this one as ready-for-committer, then.



LGTM. 


Thanks for looking at it!


I think the padding space we are trying to save here can be used
for the patch [1], right?


Yes exactly, without the current patch and adding isCatalogRel (from
the patch [1] you mentioned) we would get:

(gdb) ptype /o struct xl_hash_vacuum_one_page
/* offset  |size */  type = struct xl_hash_vacuum_one_page {
/*  0  |   4 */TransactionId snapshotConflictHorizon;
/*  4  |   4 */int ntuples;
/*  8  |   1 */_Bool isCatalogRel;
/* XXX  3-byte padding   */


/* total size (bytes):   12 */
  }

While with the proposed patch:

(gdb) ptype /o struct xl_hash_vacuum_one_page
/* offset  |size */  type = struct xl_hash_vacuum_one_page {
/*  0  |   4 */TransactionId snapshotConflictHorizon;
/*  4  |   2 */uint16 ntuples;
/*  6  |   1 */_Bool isCatalogRel;
/* XXX  1-byte padding   */


/* total size (bytes):8 */
  }


BTW, feel free to create the second patch
(to align the types for variables/arguments) as that would be really
helpful after we commit this one.



Yes, will do.


I think this would require XLOG_PAGE_MAGIC as it changes the WAL record.



Oh, I Was not aware about it, thanks! Will do in V2 (and in the logical
decoding on standby patch as it adds the new field mentioned above).


BTW, how about a commit message like:
Change xl_hash_vacuum_one_page.ntuples from int to uint16.

This will create two bytes of padding space in xl_hash_vacuum_one_page
which can be used for future patches. This makes the datatype of
xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which
is advisable as both are used for the same purpose.



LGTM, will add it to V2!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pg_walinspect memory leaks

2023-02-16 Thread Bharath Rupireddy

On Tue, Feb 14, 2023 at 4:07 PM Bharath Rupireddy
 wrote:
>
> On Tue, Feb 14, 2023 at 6:25 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-02-13 15:22:02 -0800, Peter Geoghegan wrote:
> > > More concretely, it looks like GetWALRecordInfo() calls
> > > CStringGetTextDatum/cstring_to_text in a way that accumulates way too
> > > much memory in ExprContext.
> >
> > Additionally, we leak two stringinfos for each record.
> >
> >
> > > This could be avoided by using a separate memory context that is reset
> > > periodically, or something else along the same lines.
> >
> > Everything other than a per-row memory context that's reset each time seems
> > hard to manage in this case.
> >
> > Somehwat funnily, GetWALRecordsInfo() then ends up being unnecessarily
> > dilligent about cleaning up O(1) memory, after not caring about O(N) 
> > memory...
>
> Thanks for reporting. I'll get back to you on this soon.

The memory usage goes up with many WAL records in GetWALRecordsInfo().
The affected functions are pg_get_wal_records_info() and
pg_get_wal_records_info_till_end_of_wal(). I think the best way to fix
this is to use a temporary memory context (like the jsonfuncs.c),
reset it after every tuple is put into the tuple store. This fix keeps
the memory under limits. I'm attaching the patches here. For HEAD, I'd
want to be a bit defensive and use the temporary memory context for
pg_get_wal_fpi_info() too.

And, the fix also needs to be back-patched to PG15.

[1]
HEAD:
   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+
COMMAND
1105979 ubuntu20   0   28.5g  28.4g 150492 R  80.7  93.0   1:47.12
postgres

PATCHED:
PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+
COMMAND
  13149 ubuntu20   0  173244 156872 150688 R 79.0   0.5   1:25.09
postgres

postgres=# select count(*) from
pg_get_wal_records_info_till_end_of_wal('0/100');
  count
--
 35285649
(1 row)

postgres=# select pg_backend_pid();
 pg_backend_pid

  13149
(1 row)

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 56ce258a6c4b4118a4cbe6612fbafb6cb172f7e3 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 15 Feb 2023 06:26:30 +
Subject: [PATCH v1] Limit memory usage of pg_walinspect functions

Some of the pg_walinspect functions can overuse and leak memory
across WAL records iterations and can cause OOM if there are many
WAL records to output are present.

Fix this by using a temporary memory context that's reset for each
WAL record iteraion.

Reported-by: Peter Geoghegan
Author: Bharath Rupireddy
Discussion: https://www.postgresql.org/message-id/CAH2-WznLEJjn7ghmKOABOEZYuJvkTk%3DGKU3m0%2B-XBAH%2BerPiJQ%40mail.gmail.com
Backpatch-through: 15
---
 contrib/pg_walinspect/pg_walinspect.c | 28 +++
 1 file changed, 28 insertions(+)

diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index 91b740ed27..9d429c6d73 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -304,6 +304,8 @@ pg_get_wal_fpi_info(PG_FUNCTION_ARGS)
 	XLogRecPtr	start_lsn;
 	XLogRecPtr	end_lsn;
 	XLogReaderState *xlogreader;
+	MemoryContext old_cxt;
+	MemoryContext tmp_cxt;
 
 	start_lsn = PG_GETARG_LSN(0);
 	end_lsn = PG_GETARG_LSN(1);
@@ -314,14 +316,26 @@ pg_get_wal_fpi_info(PG_FUNCTION_ARGS)
 
 	xlogreader = InitXLogReaderState(start_lsn);
 
+	tmp_cxt = AllocSetContextCreate(CurrentMemoryContext,
+	"pg_get_wal_fpi_info temporary cxt",
+	ALLOCSET_DEFAULT_SIZES);
+
 	while (ReadNextXLogRecord(xlogreader) &&
 		   xlogreader->EndRecPtr <= end_lsn)
 	{
+		/* Use the tmp context so we can clean up after each tuple is done */
+		old_cxt = MemoryContextSwitchTo(tmp_cxt);
+
 		GetWALFPIInfo(fcinfo, xlogreader);
 
+		/* clean up and switch back */
+		MemoryContextSwitchTo(old_cxt);
+		MemoryContextReset(tmp_cxt);
+
 		CHECK_FOR_INTERRUPTS();
 	}
 
+	MemoryContextDelete(tmp_cxt);
 	pfree(xlogreader->private_data);
 	XLogReaderFree(xlogreader);
 
@@ -440,23 +454,37 @@ GetWALRecordsInfo(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	Datum		values[PG_GET_WAL_RECORDS_INFO_COLS] = {0};
 	bool		nulls[PG_GET_WAL_RECORDS_INFO_COLS] = {0};
+	MemoryContext old_cxt;
+	MemoryContext tmp_cxt;
 
 	InitMaterializedSRF(fcinfo, 0);
 
 	xlogreader = InitXLogReaderState(start_lsn);
 
+	tmp_cxt = AllocSetContextCreate(CurrentMemoryContext,
+	"GetWALRecordsInfo temporary cxt",
+	ALLOCSET_DEFAULT_SIZES);
+
 	while (ReadNextXLogRecord(xlogreader) &&
 		   xlogreader->EndRecPtr <= end_lsn)
 	{
+		/* Use the tmp context so we can clean up after each tuple is done */
+		old_cxt = MemoryContextSwitchTo(tmp_cxt);
+
 		GetWALRecordInfo(xlogreader, values, nulls,
 		 PG_GET_WAL_RECORDS_INFO_COLS);
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,

Re: Allow logical replication to copy tables in binary format

2023-02-16 Thread Amit Kapila

On Mon, Jan 30, 2023 at 4:19 PM Melih Mutlu  wrote:
>
> Hi Bharath,
>
> Thanks for reviewing.
>
> Bharath Rupireddy , 18 Oca 2023 Çar, 
> 10:17 tarihinde şunu yazdı:
>>
>> On Thu, Jan 12, 2023 at 1:53 PM Melih Mutlu  wrote:
>> 1. The performance numbers posted upthread [1] look impressive for the
>> use-case tried, that's a 2.25X improvement or 55.6% reduction in
>> execution times. However, it'll be great to run a few more varied
>> tests to confirm the benefit.
>
>
> Sure, do you have any specific test case or suggestion in mind?
>
>>
>> 2. It'll be great to capture the perf report to see if the time spent
>> in copy_table() is reduced with the patch.
>
>
> Will provide that in another email soon.
>
>>
>> 3. I think blending initial table sync's binary copy option with
>> data-type level binary send/receive is not good. Moreover, data-type
>> level binary send/receive has its own restrictions per 9de77b5453.
>> IMO, it'll be good to have a new option, say copy_data_format synonyms
>> with COPY command's FORMAT option but with only text (default) and
>> binary values.
>
>
> Added a "copy_format" option for subscriptions with text as default value. So 
> it would be possible to create a binary subscription but copy tables in text 
> format to avoid restrictions that you're concerned about.
>
>>
>> 4. Why to add tests to existing 002_types.pl? Can we add a new file
>> with all the data types covered?
>
>
> Since 002_types.pl is where the replication of data types are covered. I 
> thought it would be okay to test replication with the binary option in that 
> file.
> Sure, we can add a new file with different data types for testing 
> subscriptions with binary option. But then I feel like it would have too many 
> duplicated lines with 002_types.pl.
> If you think that 002_types.pl lacks some data types needs to be tested, then 
> we should add those into 002_types.pl too whether we test subscription with 
> binary option in that file or some other place, right?
>
>>
>> 5. It's not clear to me as to why these rows were removed in the patch?
>> -(1, '{1, 2, 3}'),
>> -(2, '{2, 3, 1}'),
>>  (3, '{3, 2, 1}'),
>>  (4, '{4, 3, 2}'),
>>  (5, '{5, NULL, 3}');
>>
>>  -- test_tbl_arrays
>>  INSERT INTO tst_arrays (a, b, c, d) VALUES
>> -('{1, 2, 3}', '{"a", "b", "c"}', '{1.1, 2.2, 3.3}', '{"1
>> day", "2 days", "3 days"}'),
>> -('{2, 3, 1}', '{"b", "c", "a"}', '{2.2, 3.3, 1.1}', '{"2
>> minutes", "3 minutes", "1 minute"}'),
>
>
> Previously, it wasn't actually testing the initial table sync since all 
> tables were empty when subscription was created.
> I just simply split the data initially inserted to test initial table sync.
>
> With this patch, it inserts the first two rows for all data types before 
> subscriptions get created.
> You can see these lines:
>>
>> +# Insert initial test data
>> +$node_publisher->safe_psql(
>> +   'postgres', qq(
>> +   -- test_tbl_one_array_col
>> +   INSERT INTO tst_one_array (a, b) VALUES
>> +   (1, '{1, 2, 3}'),
>> +   (2, '{2, 3, 1}');
>> +
>> +   -- test_tbl_arrays
>> +   INSERT INTO tst_arrays (a, b, c, d) VALUES
>
>
>
>>
>> 6. BTW, the subbinary description is missing in pg_subscription docs
>> https://www.postgresql.org/docs/devel/catalog-pg-subscription.html?
>> -  Specifies whether the subscription will request the publisher to
>> -  send the data in binary format (as opposed to text).
>> +  Specifies whether the subscription will copy the initial data to
>> +  synchronize relations in binary format and also request the 
>> publisher
>> +  to send the data in binary format too (as opposed to text).
>
>
> Done.
>
>>
>> 7. A nitpick - space is needed after >= before 16.
>> +if (walrcv_server_version(LogRepWorkerWalRcvConn) >=16 &&
>
>
> Done.
>
>>
>> 8. Note that the COPY binary format isn't portable across platforms
>> (Windows to Linux for instance) or major versions
>> https://www.postgresql.org/docs/devel/sql-copy.html, whereas, logical
>> replication is 
>> https://www.postgresql.org/docs/devel/logical-replication.html.
>> I don't see any handling of such cases in copy_table but only a check
>> for the publisher version. I think we need to account for all the
>> cases - allow binary COPY only when publisher and subscriber are of
>> same versions, architectures, platforms. The trick here is how we
>> identify if the subscriber is of the same type and taste
>> (architectures and platforms) as the publisher. Looking for
>> PG_VERSION_STR of publisher and subscriber might be naive, but I'm not
>> sure if there's a better way to do it.
>
>
> I think having the "copy_format" option with text as default, like I replied 
> to your 3rd review above, will keep things as they are now.
> The patch now will only allow users to choose binary copy as well, if they 
> want it and acknowledge the limitations that come with binary

Re: Silent overflow of interval type

2023-02-16 Thread Nick Babadzhanian

On Thu, Feb 16, 2023 at 1:12 AM Tom Lane  wrote:
> Yeah, I don't think this would create a performance problem, at least not
> if you're using a compiler that implements pg_sub_s64_overflow reasonably.
> (And if you're not, and this bugs you, the answer is to get a better

Please find attached the v2 of the said patch with the tests added. I
tested and it applies with all tests passing both on REL_14_STABLE,
REL_15_STABLE and master. I don't know how the decision on
backpatching is made and whether it makes sense here or not. If any
additional work is required, please let me know.

> By chance did you look at all other nearby cases, is it the only place
> with overflow?

Not really, no. The other place where it could overflow was in the
interval justification function and it was fixed about a year ago.
That wasn't backpatched afaict. See
https://postgr.es/m/caavxfhenqsj2xyfbpuf_8nnquijqkag04nw6abqq0dbzsxf...@mail.gmail.com

Regards,
Nick
From 52d49e90b73d13c9acfd2b85f1ae38dfb0f64f9d Mon Sep 17 00:00:00 2001
From: Nick Babadzhanian 
Date: Thu, 16 Feb 2023 13:38:34 +0100
Subject: [PATCH] Address interval overflow and add corresponding tests

---
 src/backend/utils/adt/timestamp.c  | 6 +-
 src/test/regress/expected/interval.out | 9 +
 src/test/regress/sql/interval.sql  | 4 
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index f70f829d83..3ff51102a8 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -2688,7 +2688,11 @@ timestamp_mi(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
  errmsg("cannot subtract infinite timestamps")));
 
-	result->time = dt1 - dt2;
+	/* Subtract dt1 and dt2 with overflow detection */
+	if (unlikely(pg_sub_s64_overflow(dt1, dt2, &result->time)))
+		ereport(ERROR,
+(errcode(ERRCODE_INTERVAL_FIELD_OVERFLOW),
+ errmsg("interval field out of range")));
 
 	result->month = 0;
 	result->day = 0;
diff --git a/src/test/regress/expected/interval.out b/src/test/regress/expected/interval.out
index 78c1fab2b6..280f25c218 100644
--- a/src/test/regress/expected/interval.out
+++ b/src/test/regress/expected/interval.out
@@ -235,6 +235,15 @@ LINE 1: INSERT INTO INTERVAL_TBL_OF (f1) VALUES ('-2147483648 years'...
 -- Test edge-case overflow detection in interval multiplication
 select extract(epoch from '256 microseconds'::interval * (2^55)::float8);
 ERROR:  interval out of range
+-- Test edge-case overflow in timestamp[tz] subtraction
+SELECT timestamptz'294276-12-31 23:59:59 UTC' - timestamptz'1999-12-23 19:59:04.224193 UTC' AS doesnt_overflow;
+doesnt_overflow 
+
+ 106751991 days 04:00:54.775807
+(1 row)
+
+SELECT timestamptz'294276-12-31 23:59:59 UTC' - timestamptz'1999-12-23 19:59:04.224192 UTC' AS overflows;
+ERROR:  interval field out of range
 SELECT r1.*, r2.*
FROM INTERVAL_TBL_OF r1, INTERVAL_TBL_OF r2
WHERE r1.f1 > r2.f1
diff --git a/src/test/regress/sql/interval.sql b/src/test/regress/sql/interval.sql
index 55a449b617..a228cf1445 100644
--- a/src/test/regress/sql/interval.sql
+++ b/src/test/regress/sql/interval.sql
@@ -76,6 +76,10 @@ INSERT INTO INTERVAL_TBL_OF (f1) VALUES ('-2147483648 years');
 -- Test edge-case overflow detection in interval multiplication
 select extract(epoch from '256 microseconds'::interval * (2^55)::float8);
 
+-- Test edge-case overflow in timestamp[tz] subtraction
+SELECT timestamptz'294276-12-31 23:59:59 UTC' - timestamptz'1999-12-23 19:59:04.224193 UTC' AS doesnt_overflow;
+SELECT timestamptz'294276-12-31 23:59:59 UTC' - timestamptz'1999-12-23 19:59:04.224192 UTC' AS overflows;
+
 SELECT r1.*, r2.*
FROM INTERVAL_TBL_OF r1, INTERVAL_TBL_OF r2
WHERE r1.f1 > r2.f1
-- 
2.39.1

RE: Support logical replication of DDLs

2023-02-16 Thread houzj.f...@fujitsu.com

On Wednesday, February 15, 2023 5:51 PM Amit Kapila  
wrote:
> 
> On Wed, Feb 15, 2023 at 2:02 PM Alvaro Herrera 
> wrote:
> >
> > On 2023-Feb-15, Peter Smith wrote:
> >
> > > On Thu, Feb 9, 2023 at 8:55 PM Ajin Cherian  wrote:
> > > >
> > > > On Fri, Feb 3, 2023 at 11:41 AM Peter Smith 
> wrote:
> >
> > > > > 3. ExecuteGrantStmt
> > > > >
> > > > > + /* Copy the grantor id needed for DDL deparsing of Grant */
> > > > > + istmt.grantor_uid = grantor;
> > > > >
> > > > > SUGGESTION (comment)
> > > > > Copy the grantor id to the parsetree, needed for DDL deparsing
> > > > > of Grant
> > > >
> > > > didn't change this, as Alvaro said this was not a parsetree.
> > >
> > > Perhaps there is more to do here? Sorry, I did not understand the
> > > details of Alvaro's post [1], but I did not recognize the difference
> > > between ExecuteGrantStmt and ExecSecLabelStmt so it was my
> > > impression either one or both of these places are either wrongly
> > > commented, or maybe are doing something that should not be done.
> >
> > These two cases are different.  In ExecGrantStmt we're adding the
> > grantor OID to the InternalGrant struct, which is not a parse node, so
> > there's no strong reason not to modify it (and also the suggested
> > comment change is factually wrong).  I don't know if the idea is
> > great, but at least I see no strong objection.
> >
> > In the other case, as I said in [1], the patch proposes to edit the
> > parse node to add the grantor, but I think a better idea might be to
> > change the signature to ExecSecLabelStmt(SecLabelStmt *stmt,
> > ObjectAddress *provider) so that the function can set the provider
> > there; and caller passes &secondaryObject, which is the method we
> > adopted for this kind of thing.
> >
> 
> +1, that is a better approach to make the required change in
> ExecSecLabelStmt().

I did some research for this.

The provider seems not a database object, user needs to register a provider via
C ode via register_label_provider. And ObjectAddress only have three
fields(classId, objectId, objectSubId), so it seems hard to set the provider 
with name to
a ObjectAddress. We cannot get the correct provider name from the catalog as
well because there could be multiple providers for the same db object.

So, if we don't want to modify the parsetree. Maybe we can introduce a function
GetDefaultProvider() which can be used if user doesn't specify the provider in
the DDL command. GetDefaultProvider will return the first provider in the
providers list as is done in ExecSecLabelStmt(). What do you think ?

Best Regards,
Hou zj

Refactor calculations to use instr_time

2023-02-16 Thread Nazir Bilal Yavuz


Hi,


In 'instr_time.h' it is stated that:

* When summing multiple measurements, it's recommended to leave the
* running sum in instr_time form (ie, use INSTR_TIME_ADD or
* INSTR_TIME_ACCUM_DIFF) and convert to a result format only at the end.


So, I refactored 'PendingWalStats' to use 'instr_time' instead of 
'PgStat_Counter' while accumulating 'wal_write_time' and 
'wal_sync_time'. Also, I refactored some calculations to use 
'INSTR_TIME_ACCUM_DIFF' instead of using 'INSTR_TIME_SUBTRACT' and 
'INSTR_TIME_ADD'. What do you think?



Regards,
Nazir Bilal Yavuz
Microsoft
From 5216c70da53ff32b8e9898595445ad6f4880b06d Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 2 Feb 2023 15:06:48 +0300
Subject: [PATCH v1] Refactor instr_time calculations

---
 src/backend/access/transam/xlog.c   |  6 ++
 src/backend/storage/file/buffile.c  |  6 ++
 src/backend/utils/activity/pgstat_wal.c | 27 -
 src/include/pgstat.h| 17 +++-
 4 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d1..3c35dc1ca23 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2206,8 +2206,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 	instr_time	duration;
 
 	INSTR_TIME_SET_CURRENT(duration);
-	INSTR_TIME_SUBTRACT(duration, start);
-	PendingWalStats.wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+	INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration, start);
 }
 
 PendingWalStats.wal_write++;
@@ -8201,8 +8200,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		instr_time	duration;
 
 		INSTR_TIME_SET_CURRENT(duration);
-		INSTR_TIME_SUBTRACT(duration, start);
-		PendingWalStats.wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, duration, start);
 	}
 
 	PendingWalStats.wal_sync++;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 0a51624df3b..e55f86b675e 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -469,8 +469,7 @@ BufFileLoadBuffer(BufFile *file)
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_SUBTRACT(io_time, io_start);
-		INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time);
+		INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
 	}
 
 	/* we choose not to advance curOffset here */
@@ -544,8 +543,7 @@ BufFileDumpBuffer(BufFile *file)
 		if (track_io_timing)
 		{
 			INSTR_TIME_SET_CURRENT(io_time);
-			INSTR_TIME_SUBTRACT(io_time, io_start);
-			INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time);
+			INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
 		}
 
 		file->curOffset += bytestowrite;
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e8598b2f4e0..56211fbc61a 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -21,7 +21,7 @@
 #include "executor/instrument.h"
 
 
-PgStat_WalStats PendingWalStats = {0};
+PgStat_PendingWalUsage PendingWalStats = {0};
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
@@ -89,26 +89,25 @@ pgstat_flush_wal(bool nowait)
 	 * previous counters from the current ones.
 	 */
 	WalUsageAccumDiff(&diff, &pgWalUsage, &prevWalUsage);
-	PendingWalStats.wal_records = diff.wal_records;
-	PendingWalStats.wal_fpi = diff.wal_fpi;
-	PendingWalStats.wal_bytes = diff.wal_bytes;
 
 	if (!nowait)
 		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
 	else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
 		return true;
 
-#define WALSTAT_ACC(fld) stats_shmem->stats.fld += PendingWalStats.fld
-	WALSTAT_ACC(wal_records);
-	WALSTAT_ACC(wal_fpi);
-	WALSTAT_ACC(wal_bytes);
-	WALSTAT_ACC(wal_buffers_full);
-	WALSTAT_ACC(wal_write);
-	WALSTAT_ACC(wal_sync);
-	WALSTAT_ACC(wal_write_time);
-	WALSTAT_ACC(wal_sync_time);
+#define WALSTAT_ACC(fld, var_to_add) \
+ 	(stats_shmem->stats.fld += var_to_add.fld)
+#define WALLSTAT_ACC_INSTR_TIME_TYPE(fld) \
+	(stats_shmem->stats.fld += INSTR_TIME_GET_MICROSEC(PendingWalStats.fld))
+	WALSTAT_ACC(wal_records, diff);
+	WALSTAT_ACC(wal_fpi, diff);
+	WALSTAT_ACC(wal_bytes, diff);
+	WALSTAT_ACC(wal_buffers_full, PendingWalStats);
+	WALSTAT_ACC(wal_write, PendingWalStats);
+	WALSTAT_ACC(wal_sync, PendingWalStats);
+	WALLSTAT_ACC_INSTR_TIME_TYPE(wal_write_time);
+	WALLSTAT_ACC_INSTR_TIME_TYPE(wal_sync_time);
 #undef WALSTAT_ACC
-
 	LWLockRelease(&stats_shmem->lock);
 
 	/*
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index db9675884f3..295c5eabf38 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -445,6 +445,21 @@ typedef struct PgStat_WalStats
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
+/* Created f

Re: Dead code in ps_status.c

2023-02-16 Thread Tom Lane

Thomas Munro  writes:
> On Thu, Feb 16, 2023 at 6:34 PM Tom Lane  wrote:
>> Hm, is "defined(sun)" true on any live systems at all?

> My GCC compile farm account seems to have expired, or something, so I
> couldn't check on wrasse's host (though whether wrasse is "live" is
> debatable: Solaris 11.3 has reached EOL, it's just that the CPU is too
> old to be upgraded, so it's not testing a real OS that anyone would
> actually run PostgreSQL on).  But from some googling[1], I think
> __sun, __sun__ and sun should all be defined.

My account still works, and what I see on wrasse's host is

tgl@gcc-solaris11:~$ gcc -x c /dev/null -dM -E | grep -i svr
#define __SVR4 1
#define __svr4__ 1
tgl@gcc-solaris11:~$ gcc -x c /dev/null -dM -E | grep -i sun
#define __sun 1
#define sun 1
#define __sun__ 1

I don't know a way to get the list of predefined macros out of the
compiler wrasse is actually using (/opt/developerstudio12.6/bin/cc),
but doing some experiments with #ifdef confirmed that it defines
__sun, __sun__, and __svr4__, but not __svr5__.

regards, tom lane

Re: Move defaults toward ICU in 16?

2023-02-16 Thread Jonathan S. Katz


On 2/16/23 4:35 AM, Robert Haas wrote:

On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis  wrote:

It feels very wrong to me to explain that sort order is defined by the
operating system on which Postgres happens to run. Saying that it's
defined by ICU, which is part of the Unicode consotium, is much better.
It doesn't eliminate versioning issues, of course, but I think it's a
better explanation for users.


The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.


This is one reason why we're discussing ICU as the "preferred default" 
vs. "the default." While it may not completely eliminate platform 
dependent behavior for collations, it takes a step forward.


And AIUI, it does sound like ICU is available on newer versions of 
Windows[1].



I feel like this is a very difficult kind of change to judge. If
everyone else feels this is a win, we should go with it, and hopefully
we'll end up better off. I do feel like there are things that could go
wrong, though, between the imperfect documentation, the fact that a
substantial chunk of our users won't be able to use it because they
run Windows, and everybody having to adjust to the behavior change.


We should continue to improve our documentation. Personally, I found the 
biggest challenge was understanding how to set ICU locales / rules, 
particularly for nondeterministic collations as it was challenging to 
find where these were listed. I was able to overcome this with the 
examples in our docs + blogs, but I agree it's an area we can continue 
to improve upon.


Thanks,

Jonathan

[1] 
https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu


OpenPGP_signature
Description: OpenPGP digital signature

Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16

2023-02-16 Thread Drouvot, Bertrand


Hi,

On 2/16/23 1:26 PM, Drouvot, Bertrand wrote:

Hi,

On 2/16/23 12:00 PM, Amit Kapila wrote:

I think this would require XLOG_PAGE_MAGIC as it changes the WAL record.



Oh, I Was not aware about it, thanks! Will do in V2 (and in the logical
decoding on standby patch as it adds the new field mentioned above).


BTW, how about a commit message like:
Change xl_hash_vacuum_one_page.ntuples from int to uint16.

This will create two bytes of padding space in xl_hash_vacuum_one_page
which can be used for future patches. This makes the datatype of
xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which
is advisable as both are used for the same purpose.



LGTM, will add it to V2!


Please find V2 attached.
The commit message also mention the XLOG_PAGE_MAGIC bump.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom a1f627a6e8c347bdb3112033ae08ace57fbf0a10 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 16 Feb 2023 15:01:29 +
Subject: [PATCH v2] Change xl_hash_vacuum_one_page.ntuples from int to uint16.

This will create two bytes of padding space in xl_hash_vacuum_one_page
which can be used for future patches. This makes the datatype of
xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which
is advisable as both are used for the same purpose.

Bump XLOG_PAGE_MAGIC.
---
 src/include/access/hash_xlog.h | 4 ++--
 src/include/access/xlog_internal.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)
 100.0% src/include/access/

diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index a2f0f39213..9894ab9afe 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -251,13 +251,13 @@ typedef struct xl_hash_init_bitmap_page
 typedef struct xl_hash_vacuum_one_page
 {
TransactionId snapshotConflictHorizon;
-   int ntuples;
+   uint16  ntuples;
 
/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
 } xl_hash_vacuum_one_page;
 
 #define SizeOfHashVacuumOnePage \
-   (offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int))
+   (offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/xlog_internal.h 
b/src/include/access/xlog_internal.h
index 59fc7bc105..8edae7bb07 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
 /*
  * Each page of XLOG file has a header like this:
  */
-#define XLOG_PAGE_MAGIC 0xD111 /* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD112 /* can be used as WAL version indicator */
 
 typedef struct XLogPageHeaderData
 {
-- 
2.34.1

Re: PATCH: Using BRIN indexes for sorted output

2023-02-16 Thread Justin Pryzby

On Thu, Feb 16, 2023 at 03:07:59PM +0100, Tomas Vondra wrote:
> Rebased version of the patches, fixing only minor conflicts.

Per cfbot, the patch fails on 32 bit builds.
+ERROR:  count mismatch: 0 != 1000

And causes warnings in mingw cross-compile.

On Sun, Oct 23, 2022 at 11:32:37PM -0500, Justin Pryzby wrote:
> I think new GUCs should be enabled during patch development.
> Maybe in a separate 0002 patch "for CI only not for commit".
> That way "make check" at least has a chance to hit that new code
> paths.
> 
> Also, note that indxpath.c had the var initialized to true.

In your patch, the amstats guc is still being set to false during
startup by the guc machinery.  And the tests crash everywhere if it's
set to on:

TRAP: failed Assert("(nmatches_unique >= 1) && (nmatches_unique <= 
unique[nvalues-1])"), File: "../src/backend/access/brin/brin_minmax.c", Line: 
644, PID: 25519

>  . Some typos in your other patches: "heuristics heuristics".  ste.
>lest (least).

These are still present.

-- 
Justin

Re: Refactor calculations to use instr_time

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 16:19:02 +0300, Nazir Bilal Yavuz wrote:
> What do you think?

Here's a small review:

> +#define WALSTAT_ACC(fld, var_to_add) \
> + (stats_shmem->stats.fld += var_to_add.fld)
> +#define WALLSTAT_ACC_INSTR_TIME_TYPE(fld) \
> + (stats_shmem->stats.fld += INSTR_TIME_GET_MICROSEC(PendingWalStats.fld))
> + WALSTAT_ACC(wal_records, diff);
> + WALSTAT_ACC(wal_fpi, diff);
> + WALSTAT_ACC(wal_bytes, diff);
> + WALSTAT_ACC(wal_buffers_full, PendingWalStats);
> + WALSTAT_ACC(wal_write, PendingWalStats);
> + WALSTAT_ACC(wal_sync, PendingWalStats);
> + WALLSTAT_ACC_INSTR_TIME_TYPE(wal_write_time);
> + WALLSTAT_ACC_INSTR_TIME_TYPE(wal_sync_time);
>  #undef WALSTAT_ACC
> -
>   LWLockRelease(&stats_shmem->lock);

WALSTAT_ACC is undefined, but WALLSTAT_ACC_INSTR_TIME_TYPE isn't.

I'd not remove the newline before LWLockRelease().


>   /*
> diff --git a/src/include/pgstat.h b/src/include/pgstat.h
> index db9675884f3..295c5eabf38 100644
> --- a/src/include/pgstat.h
> +++ b/src/include/pgstat.h
> @@ -445,6 +445,21 @@ typedef struct PgStat_WalStats
>   TimestampTz stat_reset_timestamp;
>  } PgStat_WalStats;
>  
> +/* Created for accumulating wal_write_time and wal_sync_time as a
> instr_time

Minor code-formatting point: In postgres we don't put code in the same line as
a multi-line comment starting with the /*. So either

/* single line comment */
or
/*
 * multi line
 * comment
 */

> + * but instr_time can't be used as a type where it ends up on-disk
> + * because its units may change. PgStat_WalStats type is used for
> + * in-memory/on-disk data. So, PgStat_PendingWalUsage is created for
> + * accumulating intervals as a instr_time.
> + */
> +typedef struct PgStat_PendingWalUsage
> +{
> + PgStat_Counter wal_buffers_full;
> + PgStat_Counter wal_write;
> + PgStat_Counter wal_sync;
> + instr_time wal_write_time;
> + instr_time wal_sync_time;
> +} PgStat_PendingWalUsage;
> +

I wonder if we should try to put pgWalUsage in here. But that's probably
better done as a separate patch.

Greetings,

Andres Freund

Re: Move defaults toward ICU in 16?

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 15:05:10 +0530, Robert Haas wrote:
> The fact that we can't use ICU on Windows, though, weakens this
> argument a lot. In my experience, we have a lot of Windows users, and
> they're not any happier with the operating system collations than
> Linux users. Possibly less so.

Why can't you use ICU on windows? It works today, afaict?

Greetings,

Andres Freund

Re: Weird failure with latches in curculio on v15

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 15:18:57 +0530, Robert Haas wrote:
> On Fri, Feb 10, 2023 at 12:59 AM Andres Freund  wrote:
> > I don't think it's that hard to imagine problems. To be reasonably fast, a
> > decent restore implementation will have to 'restore ahead'. Which also
> > provides ample things to go wrong. E.g.
> >
> > - WAL source is switched, restore module needs to react to that, but 
> > doesn't,
> >   we end up lots of wasted work, or worse, filename conflicts
> > - recovery follows a timeline, restore module doesn't catch on quickly 
> > enough
> > - end of recovery happens, restore just continues on
> 
> I don't see how you can prevent those things from happening. If the
> restore process is working in some way that requires an event loop,
> and I think that will be typical for any kind of remote archiving,
> then either it has control most of the time, so the event loop can be
> run inside the restore process, or, as Nathan proposes, we don't let
> the archiver have control and it needs to run that restore process in
> a separate background worker. The hazards that you mention here exist
> either way. If the event loop is running inside the restore process,
> it can decide not to call the functions that we provide in a timely
> fashion and thus fail to react as it should. If the event loop runs
> inside a separate background worker, then that process can fail to be
> responsive in precisely the same way. Fundamentally, if the author of
> a restore module writes code to have multiple I/Os in flight at the
> same time and does not write code to cancel those I/Os if something
> changes, then such cancellation will not occur. That remains true no
> matter which process is performing the I/O.

IDK. I think we can make that easier or harder. Right now the proposed API
doesn't provide anything to allow to address this.


> > > I don't quite see how you can make asynchronous and parallel archiving
> > > work if the archiver process only calls into the archive module at
> > > times that it chooses. That would mean that the module has to return
> > > control to the archiver when it's in the middle of archiving one or
> > > more files -- and then I don't see how it can get control back at the
> > > appropriate time. Do you have a thought about that?
> >
> > I don't think archiver is the hard part, that already has a dedicated
> > process, and it also has something of a queuing system already. The startup
> > process imo is the complicated one...
> >
> > If we had a 'restorer' process, startup fed some sort of a queue with things
> > to restore in the near future, it might be more realistic to do something 
> > you
> > describe?
> 
> Some kind of queueing system might be a useful part of the interface,
> and a dedicated restorer process does sound like a good idea. But the
> archiver doesn't have this solved, precisely because you have to
> archive a single file, return control, and wait to be invoked again
> for the next file. That does not scale.

But there's nothing inherent in that. We know for certain which files we're
going to archive. And we don't need to work one-by-one. The archiver could
just start multiple subprocesses at the same time. All the blocking it does
right now are artificially imposed by the use of system(). We could instead
just use something popen() like and have a configurable number of processes
running at the same time.

What I was trying to point out was that the work a "restorer" process has to
do is more speculative, because we don't know when we'll promote, whether
we'll follow a timeline increase, whether the to-be-restored WAL already
exists. That's solvable, but a bunch of the relevant work ought to be solved
in core core code, instead of just in archive modules.

Greetings,

Andres Freund

Re: run pgindent on a regular basis / scripted manner

2023-02-16 Thread Andrew Dunstan



On 2023-02-16 Th 03:26, shiy.f...@fujitsu.com wrote:

On Thu, Feb 9, 2023 6:10 AM Andrew Dunstan  wrote:

Thanks, I have committed this. Still looking at Robert's other request.


Hi,

Commit #068a243b7 supported directories to be non-option arguments of pgindent.
But the help text doesn't mention that. Should we update it? Attach a small
patch which did that.



Thanks, pushed.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 16:22:56 +0700, John Naylor wrote:
> On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada 
> > Right. TidStore is implemented not only for heap, so loading
> > out-of-order TIDs might be important in the future.
> 
> That's what I was probably thinking about some weeks ago, but I'm having a
> hard time imagining how it would come up, even for something like the
> conveyor-belt concept.

We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.

Greetings,

Andres Freund

Re: Support logical replication of DDLs

2023-02-16 Thread Jonathan S. Katz


Hi,

On 2/14/23 10:01 PM, houzj.f...@fujitsu.com wrote:


Here is the new version patch which addressed above comments.
I also fixed a bug for the deparsing of CREATE RULE that it didn't add
parentheses for rule action list.


I started testing this change set from this patch. I'm doing a mix of 
happy path, "making mistakes" path, and "real world" testing, and 
testing this both with unidirectional and "origin=none" replication.


I wanted to report an issue I came up with using one of my real world 
cases. I had previously built a demo scheduling app to demonstrate 
several features of PostgreSQL to help with various kinds of data 
synchronization[1]. The first example uses a series of functions and 
triggers[2] to keep a calendar table up-to-date.


I set up an experiment as such:

1. Create two different clusters. In each cluster, create a DB
2. On Cluster 1, run:

CREATE PUBLICATION ddl FOR ALL TABLES WITH (ddl='all');

3. On Cluster 2, run:

CREATE SUBSCRIPTION ddl CONNECTION '' PUBLICATION ddl;

4. On Cluster 1, run the commands in [2]. Note that I reproduced the 
error both by running the commands individually and as part of a single 
transaction.


5. The transactions (or single transaction) completes successfully on 
Cluster 1


5. Cluster 2 reports the following error:


2023-02-16 16:11:10.537 UTC [25207] LOG:  logical replication apply 
worker for subscription "ddl" has started
2023-02-16 16:11:10.570 UTC [25207] ERROR:  relation "availability" does 
not exist at character 279
2023-02-16 16:11:10.570 UTC [25207] CONTEXT:  processing remote data for 
replication origin "pg_16733" during message type "DDL" in transaction 
890, finished at 0/BF298CC0
2023-02-16 16:11:10.570 UTC [25207] STATEMENT:  CREATE OR REPLACE 
FUNCTION public.availability_rule_bulk_insert ( IN availability_rule 
public.availability_rule, IN day_of_week pg_catalog.int4 ) RETURNS 
pg_catalog.void LANGUAGE sql VOLATILE PARALLEL UNSAFE CALLED ON NULL 
INPUT SECURITY INVOKER COST 100 AS $_$

INSERT INTO availability (
room_id,
availability_rule_id,
available_date,
available_range
)
SELECT
$1.room_id,
$1.id,
available_date::date + $2 - 1,
tstzrange(
/** start of range */
(available_date::date + $2 - 1) + $1.start_time,
/** end of range */
	/** check if there is a time wraparound, if so, increment 
by a day */

CASE $1.end_time <= $1.start_time
WHEN TRUE THEN (available_date::date + $2) + $1.end_time
ELSE (available_date::date + $2 - 1) + $1.end_time
END
)
FROM
generate_series(
date_trunc('week', CURRENT_DATE),
	date_trunc('week', CURRENT_DATE) + 
($1.generate_weeks_into_future::text || ' weeks')::interval,

'1 week'::interval
) available_date;
$_$
2023-02-16 16:11:10.573 UTC [15348] LOG:  background worker "logical 
replication worker" (PID 25207) exited with exit code 1


I attempted this with both async and sync logical replication. In sync 
mode, the publisher hangs and is unable to accept any more writes.


When I went in and explicitly schema qualified the tables in the 
functions[3], the example executed successfully.


My high level guess without looking at the code is that the apply worker 
is not aware of the search_path to use when processing functions during 
creation. Provided that the publisher/subscriber environments are 
similar (if not identical), I would expect that if the function create 
succeeds on the publisher, it should also succeed on the subscriber.


Thanks,

Jonathan

[1] https://github.com/CrunchyData/postgres-realtime-demo
[2] 
https://github.com/CrunchyData/postgres-realtime-demo/blob/main/examples/demo/demo1.sql

[3] https://gist.github.com/jkatz/5655c10da1a4c8691094e951ea07b036


OpenPGP_signature
Description: OpenPGP digital signature

Re: Weird failure with latches in curculio on v15

2023-02-16 Thread Nathan Bossart

On Thu, Feb 16, 2023 at 03:08:14PM +0530, Robert Haas wrote:
> On Thu, Feb 9, 2023 at 10:53 PM Nathan Bossart  
> wrote:
>> I've been thinking about this, actually.  I'm wondering if we could provide
>> a list of files to the archiving callback (configurable via a variable in
>> ArchiveModuleState), and then have the callback return a list of files that
>> are archived.  (Or maybe we just put the list of files that need archiving
>> in ArchiveModuleState.)  The returned list could include files that were
>> sent to the callback previously.  The archive module would be responsible
>> for creating background worker(s) (if desired), dispatching files
>> to-be-archived to its background worker(s), and gathering the list of
>> archived files to return.
> 
> Hmm. So in this design, the archiver doesn't really do the archiving
> any more, because the interface makes that impossible. It has to use a
> separate background worker process for that, full stop.
> 
> I don't think that's a good design. It's fine if some people want to
> implement it that way, but it shouldn't be forced by the interface.

I don't think it would force you to use a background worker, but if you
wanted to, the tools would be available.  At least, that is the intent.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Support logical replication of DDLs

2023-02-16 Thread Alvaro Herrera

On 2023-Feb-16, Jonathan S. Katz wrote:

[replication tries to execute this command]

> 2023-02-16 16:11:10.570 UTC [25207] STATEMENT:  CREATE OR REPLACE FUNCTION
> public.availability_rule_bulk_insert ( IN availability_rule
> public.availability_rule, IN day_of_week pg_catalog.int4 ) RETURNS
> pg_catalog.void LANGUAGE sql VOLATILE PARALLEL UNSAFE CALLED ON NULL INPUT
> SECURITY INVOKER COST 100 AS $_$
>   INSERT INTO availability (
>   room_id,
>   availability_rule_id,
>   available_date,
>   available_range
>   )

[which results in:]

> 2023-02-16 16:11:10.570 UTC [25207] ERROR:  relation "availability" does not
> exist at character 279

I don't think this is the fault of logical replication.  Consider that
for the backend server, the function source code is just an opaque
string that is given to the plpgsql engine to interpret.  So there's no
way for the logical DDL replication engine to turn this into runnable
code if the table name is not qualified.

(The fact that this is a security-invoker function prevents you from
attaching a SET search_path clause to the function, I believe?  Which
means it is extra dangerous to have an unqualified table reference
there.)

> My high level guess without looking at the code is that the apply worker is
> not aware of the search_path to use when processing functions during
> creation. Provided that the publisher/subscriber environments are similar
> (if not identical), I would expect that if the function create succeeds on
> the publisher, it should also succeed on the subscriber.

If we're going to force search_path and all other settings to be
identical, then we might as well give up the whole deparsing design and
transmit the original string for execution in the replica; it is much
simpler.  But this idea was rejected outright when this stuff was first
proposed years ago.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Crear es tan difícil como ser libre" (Elsa Triolet)

Re: ATTACH PARTITION seems to ignore column generation status

2023-02-16 Thread Alexander Lakhin


Hello,

11.01.2023 23:58, Tom Lane wrote:

Amit Langote  writes:

I've updated your disallow-generated-child-columns-2.patch to do this,
and have also merged the delta post that I had attached with my last
email, whose contents you sound to agree with.

Pushed with some further work to improve the handling of multiple-
inheritance cases.  We still need to insist that all or none of the
parent columns are generated, but we don't have to require their
generation expressions to be alike: that can be resolved by letting
the child table override the expression, much as we've long done for
plain default expressions.  (This did need some work in pg_dump
after all.)  I'm pretty happy with where this turned out.

I've encountered a query that triggers an assert added in that commit:
CREATE TABLE t(a int, b int GENERATED ALWAYS AS (a) STORED) PARTITION BY 
RANGE (a);

CREATE TABLE tp PARTITION OF t(b DEFAULT 1) FOR VALUES FROM (0) to (1);

Core was generated by `postgres: law regression [local] CREATE 
TABLE '.

Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/3152655' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6, 
threadid=140460372887360) at ./nptl/pthread_kill.c:44

44  ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, 
threadid=140460372887360) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140460372887360) at 
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140460372887360, signo=signo@entry=6) 
at ./nptl/pthread_kill.c:89
#3  0x7fbf79f0e476 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26

#4  0x7fbf79ef47f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x55e76b35b322 in ExceptionalCondition (
    conditionName=conditionName@entry=0x55e76b4a2240 
"!(coldef->generated && !restdef->generated)",
    fileName=fileName@entry=0x55e76b49ec71 "tablecmds.c", 
lineNumber=lineNumber@entry=3028) at assert.c:66
#6  0x55e76afef8c3 in MergeAttributes (schema=0x55e76d480318, 
supers=supers@entry=0x55e76d474c18,
    relpersistence=112 'p', is_partition=true, 
supconstr=supconstr@entry=0x7ffe945a3768) at tablecmds.c:3028
#7  0x55e76aff0ef2 in DefineRelation 
(stmt=stmt@entry=0x55e76d44b2d8, relkind=relkind@entry=114 'r', ownerId=10,

    ownerId@entry=0, typaddress=typaddress@entry=0x0,
    queryString=queryString@entry=0x55e76d44a408 "CREATE TABLE tp 
PARTITION OF t(b DEFAULT 1) FOR VALUES FROM (0) to (1);") at tablecmds.c:861

...

Without asserts enables, the partition created successfully, and
INSERT INTO t VALUES(0);
SELECT * FROM t;
yields:
a | b
---+---
0 | 1
(1 row)

Is this behavior expected?

Best regards,
Alexander

Re: [PATCH] Align GSS and TLS error handling in PQconnectPoll()

2023-02-16 Thread Jacob Champion

On Thu, Feb 16, 2023 at 3:31 AM Jelte Fennema  wrote:
>
> Patch looks good to me. Definitely an improvement over the status quo.

Thanks for the review!

> Looking at the TLS error handling though I see these two lines:
>
> && conn->allow_ssl_try/* redundant? */
> && !conn->wait_ssl_try) /* redundant? */
>
> Are they actually redundant like the comment suggests? If so, we
> should probably remove them (in another patch). If not (or if we don't
> know), should we have these same checks for GSS?

It's a fair point. GSS doesn't have an "allow" encryption mode, so
they can't be the exact same checks. And we're already not checking
the probably-redundant information, so I'd vote against speculatively
adding it back. (try_gss is already based on gssencmode, which we're
using here. So I think rechecking try_gss would only help if we wanted
to clear it manually while in the middle of processing a GSS exchange.
>From a quick inspection, I don't think that's happening today -- and
I'm not really sure that it could be useful in the future, because I'd
think prefer-mode is supposed to guarantee a retry on failure.)

I suspect this is a much deeper rabbit hole; I think it's work that
needs to be done, but I can't sign myself up for it at the moment. The
complexity of this function is off the charts (for instance, why do we
recheck conn->try_gss above, if the only apparent way to get to
CONNECTION_GSS_STARTUP is by having try_gss = true to begin with? is
there a goto/retry path I'm missing?). I think it either needs heavy
assistance from a committer who already has intimate knowledge of this
state machine and all of its possible branches, or from a static
analysis tool that can help with a step-by-step simplification.

Thanks,
--Jacob

Re: Support logical replication of DDLs

2023-02-16 Thread Alvaro Herrera

On 2023-Feb-16, houzj.f...@fujitsu.com wrote:

> I did some research for this.
> 
> The provider seems not a database object, user needs to register a provider 
> via
> C ode via register_label_provider. And ObjectAddress only have three
> fields(classId, objectId, objectSubId), so it seems hard to set the provider 
> with name to
> a ObjectAddress. We cannot get the correct provider name from the catalog as
> well because there could be multiple providers for the same db object.

Oh, I didn't realize that the provider wasn't an ObjectAddress.  You're
right, that idea won't fly.

> So, if we don't want to modify the parsetree. Maybe we can introduce a 
> function
> GetDefaultProvider() which can be used if user doesn't specify the provider in
> the DDL command. GetDefaultProvider will return the first provider in the
> providers list as is done in ExecSecLabelStmt(). What do you think ?

One of the design principles is that the DDL execution must resolve the
object just once, and thereafter all references must use the same.
Resolving twice via this new GetDefaultProvider would violate that
(consider if you made it a GUC and the user sent the SIGHUP that changed
it just in between.)

I think another workable possibility is to create a new value in
CollectedCommandType, a separate struct in the union of
CollectedCommand, and stash the info about the whole command there,
including the provider; then tell ProcessUtilitySlow that the command
was already stashed.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

Re: psql: Add role's membership options to the \du+ command

2023-02-16 Thread Pavel Luzanov


On 16.02.2023 00:37, David G. Johnston wrote:
I mean, either you accept the change in how this meta-command presents 
its information or you don't.


Yes, that's the main issue of this patch.

On the one hand, it would be nice to see the membership options with the 
psql command.


On the other hand, I don't have an exact understanding of how best to do 
it. That's why I proposed a variant for discussion. It is quite possible 
that if there is no consensus, it would be better to leave it as is, and 
get information by queries to the system catalog.


-
Pavel Luzanov

Re: ATTACH PARTITION seems to ignore column generation status

2023-02-16 Thread Alvaro Herrera

On 2023-Feb-16, Alexander Lakhin wrote:

> I've encountered a query that triggers an assert added in that commit:
> CREATE TABLE t(a int, b int GENERATED ALWAYS AS (a) STORED) PARTITION BY
> RANGE (a);
> CREATE TABLE tp PARTITION OF t(b DEFAULT 1) FOR VALUES FROM (0) to (1);

It seems wrong that this command is accepted.  It should have given an
error, because the partition is not allowed to override the generation
of the value that is specified in the parent table.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)

Add WAL read stats to pg_stat_wal

2023-02-16 Thread Bharath Rupireddy

Hi,

While working on [1], I was in need to know WAL read stats (number of
times and amount of WAL data read from disk, time taken to read) to
measure the benefit. I had to write a developer patch to capture WAL
read stats as pg_stat_wal currently emits WAL write stats. With recent
works on pg_stat_io which emit data read IO stats too, I think it's
better to not miss WAL read stats. It might help others who keep an
eye on IOPS of the production servers for various reasons. The WAL
read stats I'm thinking useful are wal_read_bytes - total amount of
WAL read, wal_read - total number of times WAL is read from disk,
wal_read_time - total amount of time spent reading WAL (tracked only
when an existing GUC track_wal_io_timing is on).

I came up with a patch and attached it here. The WAL readers that add
to WAL read stats are WAL senders, startup process and other backends
using xlogreader for logical replication or pg_walinspect SQL
functions. They all report stats to shared memory by calling
pgstat_report_wal() in appropriate locations. In standby mode, calling
pgstat_report_wa() for every record seems to be costly. Therefore, I
chose to report stats every 1024 WAL records (a random number,
suggestions for a better a way are welcome here).

Note that the patch needs a bit more work, per [2]. With the patch,
the WAL senders (processes exiting after checkpointer) will generate
stats and we need to either let all or only one WAL sender to write
stats to disk. Allowing one WAL sender to write might be tricky.
Allowing all WAL senders to write might make too many writes to the
stats file. And, we need a lock to let only one process write. I can't
think of a best way here at the moment.

Thoughts?

[1] 
https://www.postgresql.org/message-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=q...@mail.gmail.com
[2]
/*
 * Write out stats after shutdown. This needs to be called by exactly one
 * process during a normal shutdown, and since checkpointer is shut down
 * very late...
 *
 * Walsenders are shut down after the checkpointer, but currently don't
 * report stats. If that changes, we need a more complicated solution.
 */
before_shmem_exit(pgstat_before_server_shutdown, 0);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From c123e4f5e94cbb4881ce4d65c9dec3c4d5d72778 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 15 Feb 2023 13:29:22 +
Subject: [PATCH v1] Add WAL read stats to pg_stat_wal

---
 doc/src/sgml/monitoring.sgml  | 32 +
 src/backend/access/transam/xlogreader.c   | 32 +
 src/backend/access/transam/xlogrecovery.c | 34 +++
 src/backend/catalog/system_views.sql  |  3 ++
 src/backend/replication/walsender.c   | 12 
 src/backend/utils/activity/pgstat_wal.c   |  6 +++-
 src/backend/utils/adt/pgstatfuncs.c   | 24 ++--
 src/include/catalog/pg_proc.dat   |  6 ++--
 src/include/pgstat.h  |  9 ++
 src/test/regress/expected/rules.out   |  5 +++-
 10 files changed, 155 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dca50707ad..20b16405b5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4088,6 +4088,38 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   wal_read_bytes numeric
+  
+  
+   Total amount of WAL read from disk in bytes, either by recovery process
+   for replay, or by WAL senders for replication, or by backend processes.
+  
+ 
+
+ 
+  
+   wal_read bigint
+  
+  
+   Number of times WAL is read from disk, either by recovery process for
+   replay, or by WAL senders for replication, or by backend processes.
+  
+ 
+
+ 
+  
+   wal_read_time double precision
+  
+  
+   Total amount of time spent reading WAL from disk, either by recovery
+   process for replay, or by WAL senders for replication, or by backend
+   processes, in milliseconds (if 
+   is enabled, otherwise zero).
+  
+ 
+
  
   
stats_reset timestamp with time zone
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index aa6c929477..29a0938dba 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -180,6 +180,11 @@ XLogReaderFree(XLogReaderState *state)
 		pfree(state->readRecordBuf);
 	pfree(state->readBuf);
 	pfree(state);
+
+#ifndef FRONTEND
+	/* Report pending statistics to the cumulative stats system */
+	pgstat_report_wal(true);
+#endif
 }
 
 /*
@@ -1506,6 +1511,9 @@ WALRead(XLogReaderState *state,
 		uint32		startoff;
 		int			segbytes;
 		int			readbytes;
+#ifndef FRONTEND
+		instr_time	start;
+#endif

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-02-16 Thread Jacob Champion

On Thu, Feb 16, 2023 at 1:35 AM Jelte Fennema  wrote:
>
> FYI the last patch does not apply cleanly anymore. So a rebase is needed.

Thanks for the nudge, v7 rebases over the configure conflict from 9244c11afe.

--Jacob
1:  e7e2d43b18 ! 1:  b07af1c564 libpq: add sslrootcert=system to use default CAs
@@ configure: fi
 
  ## configure.ac ##
 @@ configure.ac: if test "$with_ssl" = openssl ; then
-   # thread-safety. In 1.1.0, it's no longer required, and CRYPTO_lock()
-   # function was removed.
AC_CHECK_FUNCS([CRYPTO_lock])
+   # Function introduced in OpenSSL 1.1.1.
+   AC_CHECK_FUNCS([X509_get_signature_info])
 +  # Let tests differentiate between vanilla OpenSSL and LibreSSL.
 +  AC_CHECK_DECLS([LIBRESSL_VERSION_NUMBER], [], [], [#include 
])
AC_DEFINE([USE_OPENSSL], 1, [Define to 1 to build with OpenSSL support. 
(--with-ssl=openssl)])
2:  e4d9731e1e = 2:  5de1c458b1 libpq: force sslmode=verify-full for system CAs
From 5de1c458b19dc73608b1518d87f153b733ce1921 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 24 Oct 2022 15:24:11 -0700
Subject: [PATCH v7 2/2] libpq: force sslmode=verify-full for system CAs

Weaker verification modes don't make much sense for public CAs.
---
 doc/src/sgml/libpq.sgml   | 15 +
 doc/src/sgml/runtime.sgml |  6 ++--
 src/interfaces/libpq/fe-connect.c | 56 +++
 src/interfaces/libpq/t/001_uri.pl | 30 +++--
 src/test/ssl/t/001_ssltests.pl| 12 +++
 5 files changed, 107 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index a293431020..364a82adfd 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1733,15 +1733,16 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
 locations may be further modified by the SSL_CERT_DIR
 and SSL_CERT_FILE environment variables.

-   
+   
 
- When using sslrootcert=system, it is critical to
- also use the strongest certificate verification method,
- sslmode=verify-full. In most cases it is trivial for
- anyone to obtain a certificate trusted by the system for a hostname
- they control, rendering the verify-ca mode useless.
+ When using sslrootcert=system, the default
+ sslmode is changed to verify-full,
+ and any weaker setting will result in an error. In most cases it is
+ trivial for anyone to obtain a certificate trusted by the system for a
+ hostname they control, rendering verify-ca and all
+ weaker modes useless.
 
-   
+   
   
  
 
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index b93184537a..dbe23db54f 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -2009,9 +2009,9 @@ pg_dumpall -p 5432 | psql -d postgres -p 5433
verify-full and have the appropriate root certificate
file installed (). Alternatively the
system CA pool can be used using sslrootcert=system; in
-   this case, sslmode=verify-full must be used for safety,
-   since it is generally trivial to obtain certificates which are signed by a
-   public CA.
+   this case, sslmode=verify-full is forced for safety, since
+   it is generally trivial to obtain certificates which are signed by a public
+   CA.
   
 
   
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 50b5df3490..69826eaadf 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -1259,6 +1259,22 @@ connectOptions2(PGconn *conn)
 			goto oom_error;
 	}
 
+#ifndef USE_SSL
+	/*
+	 * sslrootcert=system is not supported. Since setting this changes the
+	 * default sslmode, check this _before_ we validate sslmode, to avoid
+	 * confusing the user with errors for an option they may not have set.
+	 */
+	if (conn->sslrootcert
+		&& strcmp(conn->sslrootcert, "system") == 0)
+	{
+		conn->status = CONNECTION_BAD;
+		libpq_append_conn_error(conn, "sslrootcert value \"%s\" invalid when SSL support is not compiled in",
+conn->sslrootcert);
+		return false;
+	}
+#endif
+
 	/*
 	 * validate sslmode option
 	 */
@@ -1305,6 +1321,21 @@ connectOptions2(PGconn *conn)
 			goto oom_error;
 	}
 
+#ifdef USE_SSL
+	/*
+	 * If sslrootcert=system, make sure our chosen sslmode is compatible.
+	 */
+	if (conn->sslrootcert
+		&& strcmp(conn->sslrootcert, "system") == 0
+		&& strcmp(conn->sslmode, "verify-full") != 0)
+	{
+		conn->status = CONNECTION_BAD;
+		libpq_append_conn_error(conn, "weak sslmode \"%s\" may not be used with sslrootcert=system",
+conn->sslmode);
+		return false;
+	}
+#endif
+
 	/*
 	 * Validate TLS protocol versions for ssl_min_protocol_version and
 	 * ssl_max_protocol_version.
@@ -5806,6 +5837,7 @@ static bool
 conninfo_add_defaults(PQconninfoOption *options, PQExpBuffer errorMessage)
 {
 	PQconninfoOption *option;
+	PQconninfoOption *sslmode_de

Re: [PoC] Let libpq reject unexpected authentication requests

2023-02-16 Thread Jacob Champion

v14 rebases over the test and solution conflicts from 9244c11afe2.

Thanks,
--Jacob
1:  542d330310 ! 1:  eec891c519 libpq: let client reject unexpected auth methods
@@ src/test/ssl/t/002_scram.pl: $node->connect_ok(
 +qr/channel binding is required, but server did not offer an 
authentication method that supports channel binding/
 +  );
 +}
++
+ # Now test with a server certificate that uses the RSA-PSS algorithm.
+ # This checks that the certificate can be loaded and that channel binding
+ # works. (see bug #17760)
+@@ src/test/ssl/t/002_scram.pl: if ($supports_rsapss_certs)
+   qr/connection authenticated: identity="ssltestuser" 
method=scram-sha-256/
+   ]);
+ }
 +
  done_testing();
2:  815fdc4393 ! 2:  dc7f94f24c Add sslcertmode option for client certificates
@@ src/tools/msvc/Solution.pm: sub GenerateFiles
HAVE_STDINT_H=> 1,
HAVE_STDLIB_H=> 1,
 @@ src/tools/msvc/Solution.pm: sub GenerateFiles
- 
-   my ($digit1, $digit2, $digit3) = $self->GetOpenSSLVersion();
- 
+   $define{HAVE_HMAC_CTX_NEW}  = 1;
+   $define{HAVE_OPENSSL_INIT_SSL}  = 1;
+   }
++
++  # Symbols needed with OpenSSL 1.0.2 and above.
 +  if (   ($digit1 >= '3' && $digit2 >= '0' && $digit3 >= '0')
 +  || ($digit1 >= '1' && $digit2 >= '1' && $digit3 >= '0')
 +  || ($digit1 >= '1' && $digit2 >= '0' && $digit3 >= '2'))
 +  {
 +  $define{HAVE_SSL_CTX_SET_CERT_CB} = 1;
 +  }
-+
-   # More symbols are needed with OpenSSL 1.1.0 and above.
-   if (   ($digit1 >= '3' && $digit2 >= '0' && $digit3 >= '0')
-   || ($digit1 >= '1' && $digit2 >= '1' && $digit3 >= '0'))
+   }
+ 
+   $self->GenerateConfigHeader('src/include/pg_config.h', \%define, 1);
3:  6c3b1f28bc = 3:  9a84af5936 require_auth: decouple SASL and SCRAM
From 9a84af59361e09446d087cf0c79f1a9a3b61e39d Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Tue, 18 Oct 2022 16:55:36 -0700
Subject: [PATCH v14 3/3] require_auth: decouple SASL and SCRAM

Rather than assume that an AUTH_REQ_SASL* code refers to SCRAM-SHA-256,
future-proof by separating the single allowlist into a list of allowed
authentication request codes and a list of allowed SASL mechanisms.

The require_auth check is now separated into two tiers. The
AUTH_REQ_SASL* codes are always allowed. If the server sends one,
responsibility for the check then falls to pg_SASL_init(), which
compares the selected mechanism against the list of allowed mechanisms.
(Other SCRAM code is already responsible for rejecting unexpected or
out-of-order AUTH_REQ_SASL_* codes, so that's not explicitly handled
with this addition.)

Since there's only one recognized SASL mechanism, conn->sasl_mechs
currently only points at static hardcoded lists. Whenever a second
mechanism is added, the list will need to be managed dynamically.
---
 src/interfaces/libpq/fe-auth.c| 34 ++
 src/interfaces/libpq/fe-connect.c | 42 +++
 src/interfaces/libpq/libpq-int.h  |  2 ++
 src/test/authentication/t/001_password.pl | 10 +++---
 src/test/ssl/t/002_scram.pl   |  6 
 5 files changed, 83 insertions(+), 11 deletions(-)

diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index a474e17fb2..34b0573a98 100644
--- a/src/interfaces/libpq/fe-auth.c
+++ b/src/interfaces/libpq/fe-auth.c
@@ -522,6 +522,40 @@ pg_SASL_init(PGconn *conn, int payloadlen)
 		goto error;
 	}
 
+	/*
+	 * Before going ahead with any SASL exchange, ensure that the user has
+	 * allowed (or, alternatively, has not forbidden) this particular mechanism.
+	 *
+	 * In a hypothetical future where a server responds with multiple SASL
+	 * mechanism families, we would need to instead consult this list up above,
+	 * during mechanism negotiation. We don't live in that world yet. The server
+	 * presents one auth method and we decide whether that's acceptable or not.
+	 */
+	if (conn->sasl_mechs)
+	{
+		const char **mech;
+		bool		found = false;
+
+		Assert(conn->require_auth);
+
+		for (mech = conn->sasl_mechs; *mech; mech++)
+		{
+			if (strcmp(*mech, selected_mechanism) == 0)
+			{
+found = true;
+break;
+			}
+		}
+
+		if ((conn->sasl_mechs_denied && found)
+			|| (!conn->sasl_mechs_denied && !found))
+		{
+			libpq_append_conn_error(conn, "auth method \"%s\" requirement failed: server requested unacceptable SASL mechanism \"%s\"",
+	conn->require_auth, selected_mechanism);
+			goto error;
+		}
+	}
+
 	if (conn->channel_binding[0] == 'r' &&	/* require */
 		strcmp(selected_mechanism, SCRAM_SHA_256_PLUS_NAME) != 0)
 	{
diff --git a/src/interfaces/libpq/fe-con

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 16:58:23 +0900, Michael Paquier wrote:
> On Wed, Feb 15, 2023 at 01:00:00PM +0530, Bharath Rupireddy wrote:
> > The v3 patch reduces initialization of iovec array elements which is a
> > clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ
> > many times (I assume this is what is needed for the relation extension
> > lock improvements feature). However, it increases the number of iovec
> > initialization with zerobuf for the cases when pg_pwrite_zeros is
> > called for sizes far greater than BLCKSZ (for instance, WAL file
> > initialization).

In those cases the cost of initializing the IOV doesn't matter, relative to
the other costs. The important point is to not initialize a lot of elements if
they're not even needed. Because we need to overwrite the trailing iov
element, it doesn't seem worth to try to "pre-initialize" iov.

Referencing a static variable is more expensive than accessing an on-stack
variable. Having a first-call check is more expensive than not having it.

Thus making the iov and zbuf_sz static isn't helpful. Particularly the latter
seems like a bad idea, because it's a compiler constant.

> It seems to me that v3 would do extra initializations only if
> pg_pwritev_with_retry() does *not* retry its writes, but that's not
> the case as it retries on a partial write as per its name.  The number
> of iov buffers is stricly capped by remaining_size.

I don't really understand this bit?

Greetings,

Andres Freund

Re: Support logical replication of DDLs

2023-02-16 Thread Jonathan S. Katz


On 2/16/23 12:53 PM, Alvaro Herrera wrote:

On 2023-Feb-16, Jonathan S. Katz wrote:

[replication tries to execute this command]


2023-02-16 16:11:10.570 UTC [25207] STATEMENT:  CREATE OR REPLACE FUNCTION
public.availability_rule_bulk_insert ( IN availability_rule
public.availability_rule, IN day_of_week pg_catalog.int4 ) RETURNS
pg_catalog.void LANGUAGE sql VOLATILE PARALLEL UNSAFE CALLED ON NULL INPUT
SECURITY INVOKER COST 100 AS $_$
INSERT INTO availability (
room_id,
availability_rule_id,
available_date,
available_range
)


[which results in:]


2023-02-16 16:11:10.570 UTC [25207] ERROR:  relation "availability" does not
exist at character 279


I don't think this is the fault of logical replication.  Consider that
for the backend server, the function source code is just an opaque
string that is given to the plpgsql engine to interpret.  So there's no
way for the logical DDL replication engine to turn this into runnable
code if the table name is not qualified.


Sure, that's fair. That said, the example above would fall under a 
"typical use case", i.e. I'm replicating functions that call tables 
without schema qualification. This is pretty common, and as logical 
replication becomes used for more types of workloads (e.g. high 
availability), we'll definitely see this.



(The fact that this is a security-invoker function prevents you from
attaching a SET search_path clause to the function, I believe?  Which
means it is extra dangerous to have an unqualified table reference
there.)


Yes, but the level of danger would depend on how the schema is actually 
used. And while the above pattern is not great, it is still widely common.



My high level guess without looking at the code is that the apply worker is
not aware of the search_path to use when processing functions during
creation. Provided that the publisher/subscriber environments are similar
(if not identical), I would expect that if the function create succeeds on
the publisher, it should also succeed on the subscriber.


If we're going to force search_path and all other settings to be
identical, then we might as well give up the whole deparsing design and
transmit the original string for execution in the replica; it is much
simpler.  But this idea was rejected outright when this stuff was first
proposed years ago.


Hm, maybe we go the other way in terms of execution of function bodies, 
i.e. we don't try to run/parse it on the subscriber? If the function 
body is just based in as a string, can we just insert it without doing 
any evaluation on the source code? I'd have to think a little bit more 
about the SQL standard bodies (BEGIN ATOMIC)...though AIUI it would 
possibly be a similar flow (execute on publisher, just copy w/o 
execution into subscriber)?


If I'm using DDL replication, I'm trying to keep my 
publisher/subscribers synchronized to a reasonable level of consistency, 
so it is highly likely the function should work when it's called. I know 
things can go wrong and break, particularly if I've made independent 
changes to the schema on the subscriber, but that can happen anyway 
today with functions on a single instance.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature

Re: Add WAL read stats to pg_stat_wal

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 23:39:00 +0530, Bharath Rupireddy wrote:
> While working on [1], I was in need to know WAL read stats (number of
> times and amount of WAL data read from disk, time taken to read) to
> measure the benefit. I had to write a developer patch to capture WAL
> read stats as pg_stat_wal currently emits WAL write stats. With recent
> works on pg_stat_io which emit data read IO stats too, I think it's
> better to not miss WAL read stats. It might help others who keep an
> eye on IOPS of the production servers for various reasons. The WAL
> read stats I'm thinking useful are wal_read_bytes - total amount of
> WAL read, wal_read - total number of times WAL is read from disk,
> wal_read_time - total amount of time spent reading WAL (tracked only
> when an existing GUC track_wal_io_timing is on).

I doesn't really seem useful to have this in pg_stat_wal, because you can't
really figure out where those reads are coming from. Are they crash recovery?
Walsender? ...?

I think this'd better be handled by adding WAL support for pg_stat_io. Then
the WAL reads would be attributed to the relevant backend type, making it
easier to answer such questions.  Going forward I want to add support for
seeing pg_stat_io for individual connections, which'd then automatically
support this feature for the WAL reads as well.

Eventually I think pg_stat_wal should only track wal_records, wal_fpi,
wal_buffers_full and fill the other columns from pg_stat_io.

However, this doesn't "solve" the following issue:

> Note that the patch needs a bit more work, per [2]. With the patch,
> the WAL senders (processes exiting after checkpointer) will generate
> stats and we need to either let all or only one WAL sender to write
> stats to disk. Allowing one WAL sender to write might be tricky.
> Allowing all WAL senders to write might make too many writes to the
> stats file. And, we need a lock to let only one process write. I can't
> think of a best way here at the moment.
> 
> Thoughts?
> 
> [1] 
> https://www.postgresql.org/message-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=q...@mail.gmail.com
> [2]
> /*
>  * Write out stats after shutdown. This needs to be called by exactly one
>  * process during a normal shutdown, and since checkpointer is shut down
>  * very late...
>  *
>  * Walsenders are shut down after the checkpointer, but currently don't
>  * report stats. If that changes, we need a more complicated solution.
>  */
> before_shmem_exit(pgstat_before_server_shutdown, 0);

I wonder if we should keep the checkpointer around for longer. If we have
checkpointer signal postmaster after it wrote the shutdown checkpoint,
postmaster could signal walsenders to shut down, and checkpointer could do
some final work, like writing out the stats.

I suspect this could be useful for other things as well. It's awkward that we
don't have a place to put "just before shutting down" type tasks. And
checkpointer seems well suited for that.

Greetings,

Andres Freund

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 14:21:01 +0900, Kyotaro Horiguchi wrote:
> I'm not sure why output plugin is involved in the delay mechanism.

+many

The output plugin absolutely never should be involved in something like
this. It was a grave mistake that OutputPluginUpdateProgress() calls were
added to the commit callback and then proliferated.

> It appears to me that it would be simpler if the delay occurred in reorder
> buffer or logical decoder instead.

This is a feature specific to walsender. So the riggering logic should either
directly live in the walsender, or in a callback set in
LogicalDecodingContext. That could be called from decode.c or such.

Greetings,

Andres Freund

Re: recovery modules

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-15 10:44:07 -0800, Nathan Bossart wrote:
> On Wed, Feb 15, 2023 at 03:38:21PM +0900, Michael Paquier wrote:
> > I may tweak a bit the comments, but nothing more.  And I don't think I
> > have more to add.  Andres, do you have anything you would like to
> > mention?

Just some minor comments below. None of them needs to be addressed.


> @@ -144,10 +170,12 @@ basic_archive_configured(void)
>   * Archives one file.
>   */
>  static bool
> -basic_archive_file(const char *file, const char *path)
> +basic_archive_file(ArchiveModuleState *state, const char *file, const char 
> *path)
>  {
>   sigjmp_buf  local_sigjmp_buf;

Not related the things changed here, but this should never have been pushed
down into individual archive modules. There's absolutely no way that we're
going to keep this up2date and working correctly in random archive
modules. And it would break if archive modules are ever called outside of
pgarch.c.


> +static void
> +basic_archive_shutdown(ArchiveModuleState *state)
> +{
> + BasicArchiveData *data = (BasicArchiveData *) (state->private_data);

The parens around (state->private_data) are imo odd.

> + basic_archive_context = data->context;
> + Assert(CurrentMemoryContext != basic_archive_context);
> +
> + if (MemoryContextIsValid(basic_archive_context))
> + MemoryContextDelete(basic_archive_context);

I guess I'd personally be paranoid and clean data->context after
this. Obviously doesn't matter right now, but at some later date it could be
that we'd error out after this point, and re-entered the shutdown callback.


> +
> +/*
> + * Archive module callbacks
> + *
> + * These callback functions should be defined by archive libraries and 
> returned
> + * via _PG_archive_module_init().  ArchiveFileCB is the only required 
> callback.
> + * For more information about the purpose of each callback, refer to the
> + * archive modules documentation.
> + */
> +typedef void (*ArchiveStartupCB) (ArchiveModuleState *state);
> +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
> +typedef bool (*ArchiveFileCB) (ArchiveModuleState *state, const char *file, 
> const char *path);
> +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state);
> +
> +typedef struct ArchiveModuleCallbacks
> +{
> + ArchiveStartupCB startup_cb;
> + ArchiveCheckConfiguredCB check_configured_cb;
> + ArchiveFileCB archive_file_cb;
> + ArchiveShutdownCB shutdown_cb;
> +} ArchiveModuleCallbacks;

If you wanted you could just define the callback types in the struct now, as
we don't need asserts for the types.

Greetings,

Andres Freund

Re: Make ON_ERROR_STOP stop on shell script failure

2023-02-16 Thread Andreas 'ads' Scherbaum


On 23/11/2022 00:16, Matheus Alcantara wrote:

--- Original Message ---
On Tuesday, November 22nd, 2022 at 20:10, bt22nakamorit 
 wrote:



There was a mistake in the error message for \! so I updated the patch.

Best,
Tatsuhiro Nakamori

Hi

I was checking your patch and seems that it failed to be applied into the
master cleanly. Could you please rebase it?


Was also looking into this patch.

Tatsuhiro: can you please send a rebased version?


Thanks

--
Andreas 'ads' Scherbaum
German PostgreSQL User Group
European PostgreSQL User Group - Board of Directors
Volunteer Regional Contact, Germany - PostgreSQL Project

Re: Make ON_ERROR_STOP stop on shell script failure

2023-02-16 Thread Andreas 'ads' Scherbaum


On 16/02/2023 20:33, Andreas 'ads' Scherbaum wrote:

On 23/11/2022 00:16, Matheus Alcantara wrote:

--- Original Message ---
On Tuesday, November 22nd, 2022 at 20:10, bt22nakamorit 
 wrote:




There was a mistake in the error message for \! so I updated the patch.

Best,
Tatsuhiro Nakamori

Hi

I was checking your patch and seems that it failed to be applied into 
the

master cleanly. Could you please rebase it?


Was also looking into this patch.

Tatsuhiro: can you please send a rebased version?


The email address is bouncing. That might be why ...

--
Andreas 'ads' Scherbaum
German PostgreSQL User Group
European PostgreSQL User Group - Board of Directors
Volunteer Regional Contact, Germany - PostgreSQL Project

Re: Support logical replication of DDLs

2023-02-16 Thread Alvaro Herrera

On 2023-Feb-16, Jonathan S. Katz wrote:

> On 2/16/23 12:53 PM, Alvaro Herrera wrote:

> > I don't think this is the fault of logical replication.  Consider that
> > for the backend server, the function source code is just an opaque
> > string that is given to the plpgsql engine to interpret.  So there's no
> > way for the logical DDL replication engine to turn this into runnable
> > code if the table name is not qualified.
> 
> Sure, that's fair. That said, the example above would fall under a "typical
> use case", i.e. I'm replicating functions that call tables without schema
> qualification. This is pretty common, and as logical replication becomes
> used for more types of workloads (e.g. high availability), we'll definitely
> see this.

Hmm, I think you're saying that replay should turn check_function_bodies
off, and I think I agree with that.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)

Re: Support logical replication of DDLs

2023-02-16 Thread Jonathan S. Katz


On 2/16/23 2:38 PM, Alvaro Herrera wrote:

On 2023-Feb-16, Jonathan S. Katz wrote:


On 2/16/23 12:53 PM, Alvaro Herrera wrote:



I don't think this is the fault of logical replication.  Consider that
for the backend server, the function source code is just an opaque
string that is given to the plpgsql engine to interpret.  So there's no
way for the logical DDL replication engine to turn this into runnable
code if the table name is not qualified.


Sure, that's fair. That said, the example above would fall under a "typical
use case", i.e. I'm replicating functions that call tables without schema
qualification. This is pretty common, and as logical replication becomes
used for more types of workloads (e.g. high availability), we'll definitely
see this.


Hmm, I think you're saying that replay should turn check_function_bodies
off, and I think I agree with that.


Yes, exactly. +1

The docs seem to think that is the correct approach too:

"Set this parameter to off before loading functions on behalf of other 
users"[1].


[1] 
https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-CHECK-FUNCTION-BODIES




OpenPGP_signature
Description: OpenPGP digital signature

Re: recovery modules

2023-02-16 Thread Nathan Bossart

Thanks for reviewing.

On Thu, Feb 16, 2023 at 11:29:56AM -0800, Andres Freund wrote:
> On 2023-02-15 10:44:07 -0800, Nathan Bossart wrote:
>> @@ -144,10 +170,12 @@ basic_archive_configured(void)
>>   * Archives one file.
>>   */
>>  static bool
>> -basic_archive_file(const char *file, const char *path)
>> +basic_archive_file(ArchiveModuleState *state, const char *file, const char 
>> *path)
>>  {
>>  sigjmp_buf  local_sigjmp_buf;
> 
> Not related the things changed here, but this should never have been pushed
> down into individual archive modules. There's absolutely no way that we're
> going to keep this up2date and working correctly in random archive
> modules. And it would break if archive modules are ever called outside of
> pgarch.c.

Yeah.  IIRC I did briefly try to avoid this, but the difficulty was that
each module will have its own custom cleanup logic.  There's no requirement
that a module creates an exception handler, but I imagine that any
sufficiently complex one will.  In any case, I agree that it's worth trying
to pull this out of the individual modules.

>> +static void
>> +basic_archive_shutdown(ArchiveModuleState *state)
>> +{
>> +BasicArchiveData *data = (BasicArchiveData *) (state->private_data);
> 
> The parens around (state->private_data) are imo odd.

Oops, removed.

>> +basic_archive_context = data->context;
>> +Assert(CurrentMemoryContext != basic_archive_context);
>> +
>> +if (MemoryContextIsValid(basic_archive_context))
>> +MemoryContextDelete(basic_archive_context);
> 
> I guess I'd personally be paranoid and clean data->context after
> this. Obviously doesn't matter right now, but at some later date it could be
> that we'd error out after this point, and re-entered the shutdown callback.

Done.

>> +/*
>> + * Archive module callbacks
>> + *
>> + * These callback functions should be defined by archive libraries and 
>> returned
>> + * via _PG_archive_module_init().  ArchiveFileCB is the only required 
>> callback.
>> + * For more information about the purpose of each callback, refer to the
>> + * archive modules documentation.
>> + */
>> +typedef void (*ArchiveStartupCB) (ArchiveModuleState *state);
>> +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
>> +typedef bool (*ArchiveFileCB) (ArchiveModuleState *state, const char *file, 
>> const char *path);
>> +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state);
>> +
>> +typedef struct ArchiveModuleCallbacks
>> +{
>> +ArchiveStartupCB startup_cb;
>> +ArchiveCheckConfiguredCB check_configured_cb;
>> +ArchiveFileCB archive_file_cb;
>> +ArchiveShutdownCB shutdown_cb;
>> +} ArchiveModuleCallbacks;
> 
> If you wanted you could just define the callback types in the struct now, as
> we don't need asserts for the types.

This crossed my mind.  I thought it was nice to have a declaration for each
callback that we can copy into the docs, but I'm sure we could do without
it, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5969db5576f9d475169f1b0e25fcb4d9d331c9f4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 9 Feb 2023 13:49:46 -0800
Subject: [PATCH v15 1/1] restructure archive modules API

---
 contrib/basic_archive/basic_archive.c | 86 ---
 doc/src/sgml/archive-modules.sgml | 35 ++--
 src/backend/Makefile  |  2 +-
 src/backend/archive/Makefile  | 18 
 src/backend/archive/meson.build   |  5 ++
 .../{postmaster => archive}/shell_archive.c   | 32 ---
 src/backend/meson.build   |  1 +
 src/backend/postmaster/Makefile   |  1 -
 src/backend/postmaster/meson.build|  1 -
 src/backend/postmaster/pgarch.c   | 27 +++---
 src/backend/utils/misc/guc_tables.c   |  1 +
 src/include/archive/archive_module.h  | 58 +
 src/include/archive/shell_archive.h   | 24 ++
 src/include/postmaster/pgarch.h   | 39 -
 14 files changed, 243 insertions(+), 87 deletions(-)
 create mode 100644 src/backend/archive/Makefile
 create mode 100644 src/backend/archive/meson.build
 rename src/backend/{postmaster => archive}/shell_archive.c (77%)
 create mode 100644 src/include/archive/archive_module.h
 create mode 100644 src/include/archive/shell_archive.h

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 36b7a4814a..cd852888ce 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -30,9 +30,9 @@
 #include 
 #include 
 
+#include "archive/archive_module.h"
 #include "common/int.h"
 #include "miscadmin.h"
-#include "postmaster/pgarch.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "utils/guc.h"
@@ -40,14 +40,27 @@
 
 PG_MODULE_MAGIC;
 
+typedef struct BasicArchiveData
+{
+	MemoryContext context;
+} BasicArchiveData;
+
 static char *archive_dire

Re: Support logical replication of DDLs

2023-02-16 Thread Jonathan S. Katz


On 2/16/23 2:43 PM, Jonathan S. Katz wrote:

On 2/16/23 2:38 PM, Alvaro Herrera wrote:

On 2023-Feb-16, Jonathan S. Katz wrote:


On 2/16/23 12:53 PM, Alvaro Herrera wrote:



I don't think this is the fault of logical replication.  Consider that
for the backend server, the function source code is just an opaque
string that is given to the plpgsql engine to interpret.  So there's no
way for the logical DDL replication engine to turn this into runnable
code if the table name is not qualified.


Sure, that's fair. That said, the example above would fall under a 
"typical
use case", i.e. I'm replicating functions that call tables without 
schema

qualification. This is pretty common, and as logical replication becomes
used for more types of workloads (e.g. high availability), we'll 
definitely

see this.


Hmm, I think you're saying that replay should turn check_function_bodies
off, and I think I agree with that.


Yes, exactly. +1


I drilled into this a bit more using the SQL standard bodies (BEGIN 
ATOMIC) to see if there were any other behaviors we needed to account 
for. Overall, it worked well but I ran into one issue.


First, functions with "BEGIN ATOMIC" ignores "check_function_bodies" 
which is by design based on how this feature works. We should still turn 
"check_function_bodies" to "off" though, per above discussion.


In the context of DDL replication, "BEGIN ATOMIC" does support 
schema-unqualified functions, presumably because it includes the parsed 
content?


I created an updated example[1] where I converted the SQL functions to 
use the standard syntax and I returned the table names to be schema 
unqualified. This seemed to work, but I ran into a weird case with this 
function:


CREATE OR REPLACE FUNCTION public.calendar_manage(room_id int, 
calendar_date date)

RETURNS void
LANGUAGE SQL
BEGIN ATOMIC
WITH delete_calendar AS (
DELETE FROM calendar
WHERE
room_id = $1 AND
calendar_date = $2
)
INSERT INTO calendar (room_id, status, calendar_date, calendar_range)
SELECT $1, c.status, $2, c.calendar_range
FROM calendar_generate_calendar($1, tstzrange($2, $2 + 1)) c;
END;

This produced an error on the subscriber, with the following message:

2023-02-16 20:58:24.096 UTC [26864] ERROR:  missing FROM-clause entry 
for table "calendar_1" at character 322
2023-02-16 20:58:24.096 UTC [26864] CONTEXT:  processing remote data for 
replication origin "pg_18658" during message type "DDL" in transaction 
980, finished at 0/C099A7D8
2023-02-16 20:58:24.096 UTC [26864] STATEMENT:  CREATE OR REPLACE 
FUNCTION public.calendar_manage ( IN room_id pg_catalog.int4, IN 
calendar_date pg_catalog.date ) RETURNS pg_catalog.void LANGUAGE sql 
VOLATILE PARALLEL UNSAFE CALLED ON NULL INPUT SECURITY INVOKER COST 100 
BEGIN ATOMIC

 WITH delete_calendar AS (
  DELETE FROM public.calendar
	   WHERE ((calendar_1.room_id OPERATOR(pg_catalog.=) 
calendar_manage.room_id) AND (calendar_1.calendar_date 
OPERATOR(pg_catalog.=) calendar_manage.calendar_date))

 )
	  INSERT INTO public.calendar (room_id, status, calendar_date, 
calendar_range)  SELECT calendar_manage.room_id,

 c.status,
 calendar_manage.calendar_date,
 c.calendar_range
	FROM 
public.calendar_generate_calendar(calendar_manage.room_id, 
pg_catalog.tstzrange((calendar_manage.calendar_date)::timestamp with 
time zone, ((calendar_manage.calendar_date OPERATOR(pg_catalog.+) 
1))::timestamp with time zone)) c(status, calendar_range);

END

This seemed to add an additional, incorrect reference to the origin 
table for the "room_id" and "calendar_date" attributes within the CTE of 
this function. I don't know if this is directly related to the DDL 
replication patch, but reporting it as I triggered the behavior through it.


Thanks,

Jonathan

[1] https://gist.github.com/jkatz/fe29006b724fd6f32ee849a96dc01608



OpenPGP_signature
Description: OpenPGP digital signature

Re: recovery modules

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 12:15:12 -0800, Nathan Bossart wrote:
> Thanks for reviewing.
> 
> On Thu, Feb 16, 2023 at 11:29:56AM -0800, Andres Freund wrote:
> > On 2023-02-15 10:44:07 -0800, Nathan Bossart wrote:
> >> @@ -144,10 +170,12 @@ basic_archive_configured(void)
> >>   * Archives one file.
> >>   */
> >>  static bool
> >> -basic_archive_file(const char *file, const char *path)
> >> +basic_archive_file(ArchiveModuleState *state, const char *file, const 
> >> char *path)
> >>  {
> >>sigjmp_buf  local_sigjmp_buf;
> > 
> > Not related the things changed here, but this should never have been pushed
> > down into individual archive modules. There's absolutely no way that we're
> > going to keep this up2date and working correctly in random archive
> > modules. And it would break if archive modules are ever called outside of
> > pgarch.c.
> 
> Yeah.  IIRC I did briefly try to avoid this, but the difficulty was that
> each module will have its own custom cleanup logic.

It can use PG_TRY/CATCH for that, if the top-level sigsetjmp is in pgarch.c.
Or you could add a cleanup callback to the API, to be called after the
top-level cleanup in pgarch.c.

I'm quite baffled by:
/* Close any files left open by copy_file() or compare_files() 
*/
AtEOSubXact_Files(false, InvalidSubTransactionId, 
InvalidSubTransactionId);

in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files()
completely outside the context of a transaction environment. And it only does
the thing you want because you pass parameters that aren't actually valid in
the normal use in AtEOSubXact_Files().  I really don't understand how that's
supposed to be ok.

Greetings,

Andres Freund

Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-15 09:21:48 +0100, Drouvot, Bertrand wrote:
> diff --git a/src/backend/utils/activity/pgstat_relation.c 
> b/src/backend/utils/activity/pgstat_relation.c
> index f793ac1516..b26e2a5a7a 100644
> --- a/src/backend/utils/activity/pgstat_relation.c
> +++ b/src/backend/utils/activity/pgstat_relation.c
> @@ -471,20 +471,46 @@ pgstat_fetch_stat_tabentry_ext(bool shared, Oid reloid)
>   * Find any existing PgStat_TableStatus entry for rel_id in the current
>   * database. If not found, try finding from shared tables.
>   *
> + * If an entry is found, copy it and increment the copy's counters with their
> + * subtransactions counterparts. Then return the copy. There is no need for 
> the
> + * caller to pfree the copy as the MemoryContext will be reset soon after.
> + *

The "There is no need" bit seems a bit off. Yes, that's true for the current
callers, but who says that it has to stay that way?

Otherwise this looks ready, on a casual scan.

Greetings,

Andres Freund

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

2023-02-16 Thread Andres Freund

On 2023-02-16 12:37:57 +0530, Robert Haas wrote:
> The patch creates 100_bugs.pl

What's the story behind 100_bugs.pl? This name clearly is copied from
src/test/subscription/t/100_bugs.pl - but I've never understood why that is
outside of the normal numbering space.

Re: Missing free_var() at end of accum_sum_final()?

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 15:26:26 +0900, Michael Paquier wrote:
> On Thu, Feb 16, 2023 at 06:59:13AM +0100, Joel Jacobson wrote:
> > I noticed the NumericVar's pos_var and neg_var are not free_var()'d
> > at the end of accum_sum_final().
> > 
> > The potential memory leak seems small, since the function is called
> > only once per sum() per worker (and from a few more places), but
> > maybe they should be free'd anyways for correctness? 
> 
> Indeed, it is true that any code path of numeric.c that relies on a
> NumericVar with an allocation done in its buffer is careful enough to
> free it, except for generate_series's SRF where one step of the
> computation is done.  I don't see directly why you could not do the
> following:
> @@ -11973,6 +11973,9 @@ accum_sum_final(NumericSumAccum *accum, NumericVar 
> *result)
> /* And add them together */
> add_var(&pos_var, &neg_var, result);
>  
> +   free_var(&pos_var);
> +   free_var(&neg_var);
> +
> /* Remove leading/trailing zeroes */
> strip_var(result);

But why do we need it? Most SQL callable functions don't need to be careful
about not leaking O(1) memory, the exception being functions backing btree
opclasses.

In fact, the detailed memory management often is *more* expensive than just
relying on the calling memory context being reset.

Of course, numeric.c doesn't really seem to have gotten that message, so
there's a consistency argument here.

Greetings,

Andres Freund

Re: Add connection active, idle time to pg_stat_activity

2023-02-16 Thread Andrey Borodin

On Wed, Feb 1, 2023 at 12:46 PM Sergey Dudoladov
 wrote:
>
> I've sketched the first version of a patch to add pg_stat_session.
> Please review this early version.

Hi Sergey!

I've taken a look into the patch and got some notes.
1. It is hard to understand what fastpath backend state is. What do
fastpath metrics mean for a user?
2. Anyway, the path "if (PGSTAT_IS_FASTPATH(beentry))" seems
unreachable to me. I'm a bit surprised that compilers do not produce
warnings about it. Maybe I'm just wrong.
3. Tests do not check any incrementation logic. I think we can have
some test that verifies delta for select some_counter from
pg_stat_session where pid = pg_backend_pid();
4. Macroses like PGSTAT_IS_RUNNING do not look like net win in code
readability and PGSTAT prefix have no semantic load.

That's all I've found so far. Thank you!

Best regards, Andrey Borodin.

PS. We were doing on-air review session [0], I hope Nik will chime-in
with "usability part of a review".

[0] https://youtu.be/vTV8XhWf3mo?t=2404

Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-16 Thread Jonah H. Harris

Hi everyone,

I've been working on a federated database project that heavily relies on
foreign data wrappers. During benchmarking, we noticed high system CPU
usage in OLTP-related cases, which we traced back to multiple brk calls
resulting from block frees in AllocSetReset upon ExecutorEnd's
FreeExecutorState. This is because FDWs allocate their own derived
execution states and required data structures within this context,
exceeding the initial 8K allocation, that need to be cleaned-up.

Increasing the default query context allocation from ALLOCSET_DEFAULT_SIZES
to a larger initial "appropriate size" solved the issue and almost doubled
the throughput. However, the "appropriate size" is workload and
implementation dependent, so making it configurable may be better than
increasing the defaults, which would negatively impact users (memory-wise)
who aren't encountering this scenario.

I have a patch to make it configurable, but before submitting it, I wanted
to hear your thoughts and feedback on this and any other alternative ideas
you may have.

-- 
Jonah H. Harris

Re: recovery modules

2023-02-16 Thread Nathan Bossart

On Thu, Feb 16, 2023 at 01:17:54PM -0800, Andres Freund wrote:
> On 2023-02-16 12:15:12 -0800, Nathan Bossart wrote:
>> On Thu, Feb 16, 2023 at 11:29:56AM -0800, Andres Freund wrote:
>> > On 2023-02-15 10:44:07 -0800, Nathan Bossart wrote:
>> >> @@ -144,10 +170,12 @@ basic_archive_configured(void)
>> >>   * Archives one file.
>> >>   */
>> >>  static bool
>> >> -basic_archive_file(const char *file, const char *path)
>> >> +basic_archive_file(ArchiveModuleState *state, const char *file, const 
>> >> char *path)
>> >>  {
>> >>   sigjmp_buf  local_sigjmp_buf;
>> > 
>> > Not related the things changed here, but this should never have been pushed
>> > down into individual archive modules. There's absolutely no way that we're
>> > going to keep this up2date and working correctly in random archive
>> > modules. And it would break if archive modules are ever called outside of
>> > pgarch.c.
>> 
>> Yeah.  IIRC I did briefly try to avoid this, but the difficulty was that
>> each module will have its own custom cleanup logic.
> 
> It can use PG_TRY/CATCH for that, if the top-level sigsetjmp is in pgarch.c.
> Or you could add a cleanup callback to the API, to be called after the
> top-level cleanup in pgarch.c.

Yeah, that seems workable.

> I'm quite baffled by:
>   /* Close any files left open by copy_file() or compare_files() 
> */
>   AtEOSubXact_Files(false, InvalidSubTransactionId, 
> InvalidSubTransactionId);
> 
> in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files()
> completely outside the context of a transaction environment. And it only does
> the thing you want because you pass parameters that aren't actually valid in
> the normal use in AtEOSubXact_Files().  I really don't understand how that's
> supposed to be ok.

Hm.  Should copy_file() and compare_files() have PG_FINALLY blocks that
attempt to close the files instead?  What would you recommend?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Add pretty-printed XML output option

2023-02-16 Thread Jim Jones


On 16.02.23 00:13, Peter Smith wrote:

Today I fetched and tried the latest v11.
It is failing too, but only just.
- see attached file pretty-v11-results

It looks only due to a whitespace EOF issue in the xml_2.out

@@ -1679,4 +1679,4 @@
  -- XML format: empty string
  SELECT xmlformat('');
  ERROR:  invalid XML document
-\set VERBOSITY default
\ No newline at end of file
+\set VERBOSITY default

--

The attached patch update (v12-0002) fixes the xml_2.out for me.


I'm squashing v12-0001 and v12-0002 (v13 attached). There is still an 
open discussion on renaming the function to xmlserialize,[1] but it 
shouldn't be too difficult to change it later in case we reach a 
consensus :)


Thanks!

Jim

1- 
https://www.postgresql.org/message-id/CANNMO%2BKwb4_87G8qDeN%2BVk1B1vX3HvgoGW%2B13fJ-b6rj7qbAww%40mail.gmail.com
From e28e9da7890d07e10f412ad61318d7a9ce4d058c Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Thu, 16 Feb 2023 22:36:19 +0100
Subject: [PATCH v13] Add pretty-printed XML output option

This small patch introduces a XML pretty print function.
It basically takes advantage of the indentation feature
of xmlDocDumpFormatMemory from libxml2 to format XML strings.
---
 doc/src/sgml/func.sgml  |  34 ++
 src/backend/utils/adt/xml.c |  45 +
 src/include/catalog/pg_proc.dat |   3 +
 src/test/regress/expected/xml.out   | 101 
 src/test/regress/expected/xml_1.out |  53 +++
 src/test/regress/expected/xml_2.out | 101 
 src/test/regress/sql/xml.sql|  33 +
 7 files changed, 370 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e09e289a43..a621192425 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14861,6 +14861,40 @@ SELECT xmltable.*
 ]]>
 

+
+ 
+xmlformat
+
+ 
+ xmlformat
+ 
+
+
+xmlformat ( xml ) xml
+
+
+ 
+ Converts the given XML value to pretty-printed, indented text.
+ 
+
+ 
+ Example:
+ 
+ 
+   
+
   
 
   
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 079bcb1208..ec12707b5c 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -473,6 +473,51 @@ xmlBuffer_to_xmltype(xmlBufferPtr buf)
 }
 #endif
 
+Datum
+xmlformat(PG_FUNCTION_ARGS)
+{
+#ifdef USE_LIBXML
+
+	xmlDocPtr  doc;
+	xmlChar*xmlbuf = NULL;
+	text   *arg = PG_GETARG_TEXT_PP(0);
+	StringInfoData buf;
+	int nbytes;
+
+	doc = xml_parse(arg, XMLOPTION_DOCUMENT, false, GetDatabaseEncoding(), NULL);
+
+	if(!doc)
+		elog(ERROR, "could not parse the given XML document");
+
+	/**
+	 * xmlDocDumpFormatMemory (
+	 *   xmlDocPtr doc, # the XML document
+	 *   xmlChar **xmlbuf,  # the memory pointer
+	 *   int  *nbytes,  # the memory length
+	 *   int   format   # 1 = node indenting
+	 *)
+	 */
+
+	xmlDocDumpFormatMemory(doc, &xmlbuf, &nbytes, 1);
+
+	xmlFreeDoc(doc);
+
+	if(!nbytes)
+		elog(ERROR, "could not indent the given XML document");
+
+	initStringInfo(&buf);
+	appendStringInfoString(&buf, (const char *)xmlbuf);
+
+	xmlFree(xmlbuf);
+
+	PG_RETURN_XML_P(stringinfo_to_xmltype(&buf));
+
+#else
+	NO_XML_SUPPORT();
+return 0;
+#endif
+}
+
 
 Datum
 xmlcomment(PG_FUNCTION_ARGS)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c0f2a8a77c..54e8a6262a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8842,6 +8842,9 @@
 { oid => '3053', descr => 'determine if a string is well formed XML content',
   proname => 'xml_is_well_formed_content', prorettype => 'bool',
   proargtypes => 'text', prosrc => 'xml_is_well_formed_content' },
+{ oid => '4642', descr => 'Indented text from xml',
+  proname => 'xmlformat', prorettype => 'xml',
+  proargtypes => 'xml', prosrc => 'xmlformat' },
 
 # json
 { oid => '321', descr => 'I/O',
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index 3c357a9c7e..e45116aaa7 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -1599,3 +1599,104 @@ SELECT * FROM XMLTABLE('.' PASSING XMLELEMENT(NAME a) columns a varchar(20) PATH
   | 
 (1 row)
 
+-- XML format: single line XML string
+SELECT xmlformat('Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650');
+xmlformat 
+--
+ +
+ +
+ Belgian Waffles+
+ $5.95+
+ Two of our famous Belgian Waffles with plenty of real maple syrup+
+ 650

Introduce list_reverse() to make lcons() usage less inefficient

2023-02-16 Thread David Rowley

While working on [1] to make improvements in the query planner around
the speed to find EquivalenceMembers in an EquivalenceClass, because
that patch does have a large impact in terms of performance
improvements, some performance tests with that patch started to
highlight some other places that bottleneck the planner's performance.

One of those places is in generate_orderedappend_paths() when we find
that the required sort order is the same as the reverse of the
partition order.  In this case, we build a list of paths for each
partition using the lcons() function. Since Lists are now arrays in
PostgreSQL, lcons() isn't as efficient as it once was and it must
memmove the entire existing contents of the list up one element to
make way to prepend the new element. This is effectively quadratic and
becomes noticeable with a large number of partitions.

One way we could solve that is to just lappend() the new item and then
just reverse the order of the list only when we need to.  This has the
added advantage of removing a bunch of semi-duplicated code from
generate_orderedappend_paths(). It also has a noticeable impact on the
planner's performance.

I did a quick test with:

create table lp (a int, b int) partition by list(a);
select 'create table lp'||x::text||' partition of lp for values
in('||x::text||');' from generate_Series(1,1)x;
\gexec
create index on lp(a);

Using: psql -c "explain (analyze, timing off) select * from lp order
by a desc" postgres | grep "Planning Time"

master:
Planning Time: 6034.765 ms
Planning Time: 5919.914 ms
Planning Time: 5720.529 ms

master + eclass idx (from [1]) (yes, it really is this much faster)
Planning Time: 549.262 ms
Planning Time: 489.023 ms
Planning Time: 497.803 ms

master + eclass idx + list_reverse (attached)
Planning Time: 517.067 ms
Planning Time: 463.613 ms
Planning Time: 463.036 ms

I suspect there won't be much controversy here and there's certainly
not much complexity, so in absence of anyone voicing an opinion here,
I'm inclined to not waste too much time on this one and just get it
done. I'll leave it for a few days.

David

[1] 
https://postgr.es/m/flat/caj2pmkzncgoukse+_5lthd+kbxkvq6h2hqn8esxpxd+cxmg...@mail.gmail.com
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index a709d23ef1..5f1f675e44 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1654,6 +1654,37 @@ list_copy_deep(const List *oldlist)
return newlist;
 }
 
+/*
+ * Reverses the order of 'list' elements in place and returns the input list
+ */
+List *
+list_reverse(List *list)
+{
+   ListCell   *head;
+   ListCell   *tail;
+
+   if (list == NIL)
+   return NIL;
+
+   head = &list->elements[0];
+   tail = &list->elements[list->length - 1];
+
+   while (head < tail)
+   {
+   ListCell tmp;
+
+   /* Swap data at the head and tail position */
+   memcpy(&tmp, head, sizeof(ListCell));
+   memcpy(head, tail, sizeof(ListCell));
+   memcpy(tail, &tmp, sizeof(ListCell));
+
+   head++;
+   tail--;
+   }
+
+   return list;
+}
+
 /*
  * Sort a list according to the specified comparator function.
  *
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index ae0f9bdc8a..4356102399 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1801,7 +1801,7 @@ generate_orderedappend_paths(PlannerInfo *root, 
RelOptInfo *rel,
 * Collect the appropriate child paths.  The required 
logic varies
 * for the Append and MergeAppend cases.
 */
-   if (match_partition_order)
+   if (match_partition_order || match_partition_order_desc)
{
/*
 * We're going to make a plain Append path.  We 
don't need
@@ -1822,25 +1822,6 @@ generate_orderedappend_paths(PlannerInfo *root, 
RelOptInfo *rel,
fractional_subpaths = 
lappend(fractional_subpaths, cheapest_fractional);
}
}
-   else if (match_partition_order_desc)
-   {
-   /*
-* As above, but we need to reverse the order 
of the children,
-* because nodeAppend.c doesn't know anything 
about reverse
-* ordering and will scan the children in the 
order presented.
-*/
-   cheapest_startup = 
get_singleton_append_subpath(cheapest_startup);
-   cheapest_total = 
get_singleton_append_subpath(cheapest_total);
-
-   startup_subpaths = lcons(cheapest_st

Re: ATTACH PARTITION seems to ignore column generation status

2023-02-16 Thread Tom Lane

Alvaro Herrera  writes:
> On 2023-Feb-16, Alexander Lakhin wrote:
>> I've encountered a query that triggers an assert added in that commit:
>> CREATE TABLE t(a int, b int GENERATED ALWAYS AS (a) STORED) PARTITION BY
>> RANGE (a);
>> CREATE TABLE tp PARTITION OF t(b DEFAULT 1) FOR VALUES FROM (0) to (1);

> It seems wrong that this command is accepted.  It should have given an
> error, because the partition is not allowed to override the generation
> of the value that is specified in the parent table.

Agreed.  We missed a check somewhere, will fix.

regards, tom lane

psql \watch 2nd argument: iteration count

2023-02-16 Thread Andrey Borodin

Hi hackers!

>From time to time I want to collect some stats from locks, activity
and other stat views into one table from different time points. In
this case the \watch psql command is very handy. However, it's not
currently possible to specify the number of times a query is
performed.
Also, if we do not provide a timespan, 2 seconds are selected. But if
we provide an incorrect argument - 1 second is selected.
PFA the patch that adds iteration count argument and makes timespan
argument more consistent.
What do you think?


Best regards, Andrey Borodin.


0001-Add-iteration-count-argument-to-psql-watch-command.patch
Description: Binary data

Re: [PATCH] Add pretty-printed XML output option

2023-02-16 Thread Andrey Borodin

On Thu, Feb 16, 2023 at 2:12 PM Jim Jones  wrote:
>
> I'm squashing v12-0001 and v12-0002 (v13 attached).
I've looked into the patch. The code looks to conform to usual expectations.
One nit: this comment should have just one asterisk.
+ /**
And I have a dumb question: is this function protected from using
external XML namespaces? What if the user passes some xmlns that will
force it to read namespace data from the server filesystem? Or is it
not possible? I see there are a lot of calls to xml_parse() anyway,
but still...

Best regards, Andrey Borodin.

Re: Introduce list_reverse() to make lcons() usage less inefficient

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-17 11:36:40 +1300, David Rowley wrote:
> While working on [1] to make improvements in the query planner around
> the speed to find EquivalenceMembers in an EquivalenceClass, because
> that patch does have a large impact in terms of performance
> improvements, some performance tests with that patch started to
> highlight some other places that bottleneck the planner's performance.
> 
> One of those places is in generate_orderedappend_paths() when we find
> that the required sort order is the same as the reverse of the
> partition order.  In this case, we build a list of paths for each
> partition using the lcons() function. Since Lists are now arrays in
> PostgreSQL, lcons() isn't as efficient as it once was and it must
> memmove the entire existing contents of the list up one element to
> make way to prepend the new element. This is effectively quadratic and
> becomes noticeable with a large number of partitions.

I have wondered before if we eventually ought to switch to embedded lists for
some planner structures, including paths. add_path() inserts/deletes at points
in the middle of the list, which isn't great.

> One way we could solve that is to just lappend() the new item and then
> just reverse the order of the list only when we need to.

That's not generally the same as lcons() ing, but I guess it's fine here,
because we build the lists from scratch, so the reversing actually yields the
correct result.

But wouldn't an even cheaper way here be to iterate over the children in
reverse order when match_partition_order_desc? We can do that efficiently
now. Looks like we don't have a readymade helper for it, but it'd be easy
enough to add or open code.

Greetings,

Andres Freund

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-16 Thread Justin Pryzby

On Wed, Feb 08, 2023 at 04:40:49PM -0600, Justin Pryzby wrote:
> This squishes together 001/2 as the main patch.
> I believe it's ready.

Update to address a compiler warning in the supplementary patches adding
assertions.
>From 71427bf7cd9927af04513ba3fe99e481a8ba1f61 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Tue, 31 Jan 2023 19:13:07 +0400
Subject: [PATCH 1/3] fix CREATE INDEX progress report with nested partitions

The progress reporting was added in v12 (ab0dfc961) but the original
patch didn't seem to consider the possibility of nested partitioning.

When called recursively, DefineIndex() would clobber the number of
completed partitions, and it was possible to end up with the TOTAL
counter greater than the DONE counter.

This clarifies/re-defines that the progress reporting counts both direct
and indirect children, but doesn't count intermediate partitioned tables:

- The TOTAL counter is set once at the start of the command.
- For indexes which are newly-built, the recursively-called
DefineIndex() increments the DONE counter.
- For pre-existing indexes which are ATTACHed rather than built,
DefineIndex() increments the DONE counter, and if the attached index is
partitioned, the counter is incremented to account for each of its leaf
partitions.

Author: Ilya Gladyshev
Reviewed-By: Justin Pryzby, Tomas Vondra, Dean Rasheed, Alvaro Herrera, Matthias van de Meent
Discussion: https://www.postgresql.org/message-id/flat/a15f904a70924ffa4ca25c3c744cff31e0e6e143.camel%40gmail.com
---
 doc/src/sgml/monitoring.sgml  | 10 ++-
 src/backend/commands/indexcmds.c  | 70 +--
 src/backend/utils/activity/backend_progress.c | 28 
 src/include/utils/backend_progress.h  |  1 +
 4 files changed, 102 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dca50707ad4..945d64ed2fa 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6896,7 +6896,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
   
   
When creating an index on a partitioned table, this column is set to
-   the total number of partitions on which the index is to be created.
+   the total number of partitions on which the index is to be created or attached.
+   In the case of intermediate partitioned tables, this includes both
+   direct and indirect partitions, but excludes the intermediate
+   partitioned tables themselves.
This field is 0 during a REINDEX.
   
  
@@ -6907,7 +6910,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
   
   
When creating an index on a partitioned table, this column is set to
-   the number of partitions on which the index has been created.
+   the number of partitions on which the index has been created or attached.
+   In the case of intermediate partitioned tables, this includes both
+   direct and indirect partitions, but excludes the intermediate
+   partitioned tables themselves.
This field is 0 during a REINDEX.
   
  
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 16ec0b114e6..147265ddcca 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -130,6 +130,30 @@ typedef struct ReindexErrorInfo
 	char		relkind;
 } ReindexErrorInfo;
 
+
+/*
+ * Count the number of direct and indirect leaf partitions.  Note that this
+ * also excludes foreign tables.
+ */
+static int
+count_leaf_partitions(Oid relid)
+{
+	int			nleaves = 0;
+	List	   *childs = find_all_inheritors(relid, NoLock, NULL);
+	ListCell   *lc;
+
+	foreach(lc, childs)
+	{
+		Oid			partrelid = lfirst_oid(lc);
+
+		if (RELKIND_HAS_STORAGE(get_rel_relkind(partrelid)))
+			nleaves++;
+	}
+
+	list_free(childs);
+	return nleaves;
+}
+
 /*
  * CheckIndexCompatible
  *		Determine whether an existing index definition is compatible with a
@@ -1219,8 +1243,18 @@ DefineIndex(Oid relationId,
 			Relation	parentIndex;
 			TupleDesc	parentDesc;
 
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
-		 nparts);
+			/*
+			 * Set the total number of partitions at the start of the command;
+			 * don't update it when being called recursively.
+			 */
+			if (!OidIsValid(parentIndexId))
+			{
+int			total_parts;
+
+total_parts = count_leaf_partitions(relationId);
+pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
+			 total_parts);
+			}
 
 			/* Make a local copy of partdesc->oids[], just for safety */
 			memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
@@ -1250,6 +1284,7 @@ DefineIndex(Oid relationId,
 			{
 Oid			childRelid = part_oids[i];
 Relation	childrel;
+char		child_relkind;
 Oid			child_save_userid;
 int			child_save_sec_context;
 int			child_save_nestlevel;
@@ -1259,6 +1294,7 @@ DefineIndex(Oid relationId,
 bool		found = false;
 
 childrel = tabl

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 16:49:07 -0500, Jonah H. Harris wrote:
> I've been working on a federated database project that heavily relies on
> foreign data wrappers. During benchmarking, we noticed high system CPU
> usage in OLTP-related cases, which we traced back to multiple brk calls
> resulting from block frees in AllocSetReset upon ExecutorEnd's
> FreeExecutorState. This is because FDWs allocate their own derived
> execution states and required data structures within this context,
> exceeding the initial 8K allocation, that need to be cleaned-up.

What PG version?

Do you have a way to reproduce this with core code,
e.g. postgres_fdw/file_fdw?

What is all that memory used for? Is it possible that the real issue are too
many tiny allocations, due to some allocation growing slowly?

> Increasing the default query context allocation from ALLOCSET_DEFAULT_SIZES
> to a larger initial "appropriate size" solved the issue and almost doubled
> the throughput. However, the "appropriate size" is workload and
> implementation dependent, so making it configurable may be better than
> increasing the defaults, which would negatively impact users (memory-wise)
> who aren't encountering this scenario.
> 
> I have a patch to make it configurable, but before submitting it, I wanted
> to hear your thoughts and feedback on this and any other alternative ideas
> you may have.

This seems way too magic to expose to users. How would they ever know how to
set it? And it will heavily on the specific queries, so a global config won't
work well.

If the issue is a specific FDW needing to make a lot of allocations, I can see
adding an API to tell a memory context that it ought to be ready to allocate a
certain amount of memory efficiently (e.g. by increasing the block size of the
next allocation by more than 2x).

Greetings,

Andres Freund

Re: Make set_ps_display faster and easier to use

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 14:19:24 +1300, David Rowley wrote:
> After fixing up the set_ps_display()s to use set_ps_display_with_len()
> where possible, I discovered some not so nice code which appends "
> waiting" onto the process title. Basically, there's a bunch of code
> that looks like this:
> 
> const char *old_status;
> int len;
> 
> old_status = get_ps_display(&len);
> new_status = (char *) palloc(len + 8 + 1);
> memcpy(new_status, old_status, len);
> strcpy(new_status + len, " waiting");
> set_ps_display(new_status);
> new_status[len] = '\0'; /* truncate off " waiting" */

Yea, that code is atrocious...  It took me a while to figure out that no,
LockBufferForCleanup() isn't leaking memory, because it'll always reach the
cleanup path *further up* in the function.

Avoiding the allocation across loop iterations seems like a completely
pointless optimization in these paths - we add the " waiting", precisely
because it's a slow path. But of course not allocating memory would be even
better...

> Seeing that made me wonder if we shouldn't just have something more
> generic for setting a suffix on the process title.  I came up with
> set_ps_display_suffix() and set_ps_display_remove_suffix().  The above
> code can just become:
> 
> set_ps_display_suffix("waiting");
> 
> then to remove the "waiting" suffix, just:
> 
> set_ps_display_remove_suffix();

That'd definitely be better.

It's not really a topic for this patch, but somehow the fact that we have
these set_ps_display() calls all over feels wrong, particularly because most
of them are paired with a pgstat_report_activity() call.  It's not entirely
obvious how it should be instead, but it doesn't feel right.

> +/*
> + * set_ps_display_suffix
> + *   Adjust the process title to append 'suffix' onto the end with a 
> space
> + *   between it and the current process title.
> + */
> +void
> +set_ps_display_suffix(const char *suffix)
> +{
> + size_t  len;

Think this will give you an unused-variable warning in the PS_USE_NONE case.

> +#ifndef PS_USE_NONE
> + /* update_process_title=off disables updates */
> + if (!update_process_title)
> + return;
> +
> + /* no ps display for stand-alone backend */
> + if (!IsUnderPostmaster)
> + return;
> +
> +#ifdef PS_USE_CLOBBER_ARGV
> + /* If ps_buffer is a pointer, it might still be null */
> + if (!ps_buffer)
> + return;
> +#endif

This bit is now repeated three times. How about putting it into a helper?

> +#ifndef PS_USE_NONE
> +static void
> +set_ps_display_internal(void)

Very very minor nit: Perhaps this should be update_ps_display() or
flush_ps_display() instead?

Greetings,

Andres Freund

Re: Introduce list_reverse() to make lcons() usage less inefficient

2023-02-16 Thread Tom Lane

Andres Freund  writes:
> On 2023-02-17 11:36:40 +1300, David Rowley wrote:
>> One of those places is in generate_orderedappend_paths() when we find
>> that the required sort order is the same as the reverse of the
>> partition order.  In this case, we build a list of paths for each
>> partition using the lcons() function. Since Lists are now arrays in
>> PostgreSQL, lcons() isn't as efficient as it once was and it must
>> memmove the entire existing contents of the list up one element to
>> make way to prepend the new element. This is effectively quadratic and
>> becomes noticeable with a large number of partitions.

> I have wondered before if we eventually ought to switch to embedded lists for
> some planner structures, including paths. add_path() inserts/deletes at points
> in the middle of the list, which isn't great.

I'm not hugely excited about that, because it presumes that paths appear
in only one list, which isn't true.  We could perhaps privilege
RelOptInfo.pathlist over other cases, but that'd be asymmetrical and
probably bug-inducing.

regards, tom lane

Re: Dead code in ps_status.c

2023-02-16 Thread Thomas Munro

On Fri, Feb 17, 2023 at 3:38 AM Tom Lane  wrote:
> My account still works, and what I see on wrasse's host is
>
> tgl@gcc-solaris11:~$ gcc -x c /dev/null -dM -E | grep -i svr
> #define __SVR4 1
> #define __svr4__ 1
> tgl@gcc-solaris11:~$ gcc -x c /dev/null -dM -E | grep -i sun
> #define __sun 1
> #define sun 1
> #define __sun__ 1
>
> I don't know a way to get the list of predefined macros out of the
> compiler wrasse is actually using (/opt/developerstudio12.6/bin/cc),
> but doing some experiments with #ifdef confirmed that it defines
> __sun, __sun__, and __svr4__, but not __svr5__.

Thanks.  I went with __sun, because a random man page google found me
for Sun "cc" mentioned that but not __sun__.  Pushed.

http://www.polarhome.com/service/man/?qf=cc&tf=2&of=Solaris&sf=1

Re: REASSIGN OWNED vs ALTER TABLE OWNER TO permission inconsistencies

2023-02-16 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Feb 15, 2023 at 9:01 AM Stephen Frost  wrote:
> > I'm not really a fan of just dropping the CREATE check.  If we go with
> > "recipient needs CREATE rights" then at least without superuser
> > intervention and excluding cases where REVOKE's or such are happening,
> > we should be able to see that only objects where the owners of those
> > objects have CREATE rights exist in the system.  If we drop the CREATE
> > check entirely then clearly any user who happens to have access to
> > multiple roles can arrange to have objects owned by any of their roles
> > in any schema or database they please without any consideration for what
> > the owner of the parent object's wishes are.
> 
> That's true, and it is a downside of dropping to CREATE check, but
> it's also a bit hard to believe that anyone's really getting a lot of
> value out of the current inconsistent checks.

I agree that we should be consistent about these checks.  I'm just more
inclined to have that consistent result include the CREATE check than
have it be dropped.  Not sure that it's a huge thing but being able to
control what set of owner roles are allowed to have objects in a given
schema seems useful and was certainly the intent, as I recall anyhow.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-16 Thread Jonah H. Harris

On Thu, Feb 16, 2023 at 7:32 PM Andres Freund  wrote:

> What PG version?
>

Hey, Andres. Thanks for the reply.

Given not much changed regarding that allocation context IIRC, I’d think
all recents. It was observed in 13, 14, and 15.

Do you have a way to reproduce this with core code,
> e.g. postgres_fdw/file_fdw?

I’ll have to create one - it was most evident on a TPC-C or sysbench test
using the Postgres, MySQL, SQLite, and Oracle FDWs. It may be reproducible
with pgbench as well.

What is all that memory used for? Is it possible that the real issue are too
> many tiny allocations, due to some allocation growing slowly?

The FDW state management allocations and whatever each FDW needs to
accomplish its goals. Different FDWs do different things.

This seems way too magic to expose to users. How would they ever know how to
> set it? And it will heavily on the specific queries, so a global config
> won't
> work well.

Agreed on the nastiness of exposing it directly. Not that we don’t give
users control of memory anyway, but that one is easier to mess up without
at least putting some custom set bounds on it.

If the issue is a specific FDW needing to make a lot of allocations, I can
> see
> adding an API to tell a memory context that it ought to be ready to
> allocate a
> certain amount of memory efficiently (e.g. by increasing the block size of
> the
> next allocation by more than 2x).

While I’m happy to be wrong, it seems is an inherent problem not really
specific to FDW implementations themselves but the general expectation that
all FDWs are using more of that context than non-FDW cases for similar
types of operations, which wasn’t really a consideration in that allocation
over time.

If we come up with some sort of alternate allocation strategy, I don’t know
how it would be very clean API-wise, but it’s definitely an idea.

-- 
Jonah H. Harris

Re: Missing free_var() at end of accum_sum_final()?

2023-02-16 Thread Michael Paquier

On Thu, Feb 16, 2023 at 01:35:54PM -0800, Andres Freund wrote:
> But why do we need it? Most SQL callable functions don't need to be careful
> about not leaking O(1) memory, the exception being functions backing btree
> opclasses.
> 
> In fact, the detailed memory management often is *more* expensive than just
> relying on the calling memory context being reset.
> 
> Of course, numeric.c doesn't really seem to have gotten that message, so
> there's a consistency argument here.

I don't know which final result is better.  The arguments go two ways:
1) Should numeric.c be simplified so as its memory structure with extra
pfree()s, making it more consistent with more global assumptions than
just this file?  This has the disadvantage of creating more noise in
backpatching, while increasing the risk of leaks if some of the
removed parts are allocated in a tight loop within the same context.
This makes memory management less complicated.  That's how I am
understanding your point.
2) Should the style within numeric.c be more consistent?  This is how
I am understanding this proposal.  As you quote, this makes memory
management more complicated (not convinced about that for the internal
of numerics?), while making the file more consistent.

At the end, perhaps that's not worth bothering, but 2) prevails when
it comes to the rule of making some code consistent with its
surroundings.  1) has more risks seeing how old this code is.
--
Michael

signature.asc
Description: PGP signature

Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16

2023-02-16 Thread Amit Kapila

On Thu, Feb 16, 2023 at 8:39 PM Drouvot, Bertrand
 wrote:
>
> On 2/16/23 1:26 PM, Drouvot, Bertrand wrote:
> > Hi,
> >
> > On 2/16/23 12:00 PM, Amit Kapila wrote:
> >> I think this would require XLOG_PAGE_MAGIC as it changes the WAL record.
> >>
> >
> > Oh, I Was not aware about it, thanks! Will do in V2 (and in the logical
> > decoding on standby patch as it adds the new field mentioned above).
> >
> >> BTW, how about a commit message like:
> >> Change xl_hash_vacuum_one_page.ntuples from int to uint16.
> >>
> >> This will create two bytes of padding space in xl_hash_vacuum_one_page
> >> which can be used for future patches. This makes the datatype of
> >> xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which
> >> is advisable as both are used for the same purpose.
> >>
> >
> > LGTM, will add it to V2!
> >
> Please find V2 attached.
> The commit message also mention the XLOG_PAGE_MAGIC bump.
>

Thanks, I was not completely sure about whether we need to bump
XLOG_PAGE_MAGIC for this patch as this makes the additional space just
by changing the datatype of one of the members of the existing WAL
record. We normally change it for the addition/removal of new fields
aka change in WAL record format, or addition of a new type of WAL
record. Does anyone else have an opinion/suggestion on this matter?


-- 
With Regards,
Amit Kapila.

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-02-16 Thread Thomas Munro

On Tue, Feb 14, 2023 at 4:38 PM Anton A. Melnikov  wrote:
> First of all it seemed to me that is not a problem at all since msdn
> guarantees sector-by-sector atomicity.
> "Physical Sector: The unit for which read and write operations to the device
> are completed in a single operation. This is the unit of atomic write..."
> https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering
> https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile
> (Of course, only if the 512 bytes lays from the beginning of the file with a 
> zero
> offset, but this is our case. The current size of ControlFileData is
> 296 bytes at offset = 0.)

There are two kinds of atomicity that we rely on for the control file today:

* atomicity on power loss (= device property, in case of overwrite filesystems)
* atomicity of concurrent reads and writes (= VFS or kernel buffer
pool interlocking policy)

I assume that documentation is talking about the first thing (BTW, I
suspect that documentation is also now wrong in one special case: NTFS
filesystems mounted on DAX devices don't have sectors or sector-based
atomicity unless you turn on BTT and slow them down[1]; that might
eventually be something for PostgreSQL to think about, and it applies
to other OSes too).

With this patch we would stop relying on the second thing.  Supposedly
POSIX requires read/write atomicity, and many file systems offer it in
a strong form (internal range locking) or maybe a weak accidental form
(page-level locking).  Since some extremely popular implementations
just don't care, and Windows isn't even a POSIX, we just have to do it
ourselves.

BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)

> I tried to verify this fact experimentally and was very surprised.
> Unfortunately it crashed in an hour during torture test:
> 2023-02-13 15:10:52.675 MSK [5704] LOG:  starting PostgreSQL 16devel, 
> compiled by Visual C++ build 1929, 64-bit
> 2023-02-13 15:10:52.768 MSK [5704] LOG:  database system is ready to accept 
> connections
> @@ sizeof(ControlFileData) = 296
> .
> 2023-02-13 16:10:41.997 MSK [9380] ERROR:  calculated CRC checksum does not 
> match value stored in file

Thanks.  I'd seen reports of this in discussions on the 'net, but
those had no authoritative references or supporting test results.  The
fact that fsync made it take longer (in your following email) makes
sense and matches Linux.  It inserts a massive pause in the middle of
the experiment loop, affecting the probabilities.

Therefore, we need a solution for Windows too.  I tried to write the
equivalent code, in the attached.  I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix).  Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments).  Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that.  I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do.  Do you have
any ideas?

While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file.  Hrmph.  Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?

> > I would
> > only want to consider this if we also stop choosing "open_datasync" as
> > a default on at least Windows.
>
> I didn't quite understand this point. Could you clarify it for me, please? If 
> the performance
> of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all 
> platforms.

The level of durability would be weakened on Windows.  On all systems
except Linux and FreeBSD, we default to wal_sync_method=open_datasync,
so then we would start using FILE_FLAG_WRITE_THROUGH for the control
file too.  We know from reading and experimentation that
FILE_FLAG_WRITE_THROUGH doesn't flush the drive cache on consumer
Windows hardware, but its fdatasync-like thing does[2].  I have not
thought too hard about the consequences of using different durability
levels for different categories of file, but messing with write
ordering can't be good for crash recovery, so I'd rather increase WAL
durability than decrease control file durability.  If a Windows user
complains that it makes their fancy non-volati

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

2023-02-16 Thread Amit Kapila

On Fri, Feb 17, 2023 at 2:59 AM Andres Freund  wrote:
>
> On 2023-02-16 12:37:57 +0530, Robert Haas wrote:
> > The patch creates 100_bugs.pl
>
> What's the story behind 100_bugs.pl? This name clearly is copied from
> src/test/subscription/t/100_bugs.pl - but I've never understood why that is
> outside of the normal numbering space.
>

Yeah, I have also previously wondered about this name for
src/test/subscription/t/100_bugs.pl. My guess is that it has been kept
to distinguish it from the other feature tests which have numbering
starting from 001.

-- 
With Regards,
Amit Kapila.

Re: Introduce list_reverse() to make lcons() usage less inefficient

2023-02-16 Thread David Rowley

On Fri, 17 Feb 2023 at 13:23, Andres Freund  wrote:
> But wouldn't an even cheaper way here be to iterate over the children in
> reverse order when match_partition_order_desc? We can do that efficiently
> now. Looks like we don't have a readymade helper for it, but it'd be easy
> enough to add or open code.

That seems fair.  I think open coding is a better option.  I had a go
at foreach_reverse recently and decided to keep clear of it due to
behavioural differences with foreach_delete_current().

I've attached a patch for this.  It seems to have similar performance
to the list_reverse()

$ psql -c "explain (analyze, timing off) select * from lp order by a
desc" postgres | grep "Planning Time"
 Planning Time: 522.554 ms <- cold relcache
 Planning Time: 467.776 ms
 Planning Time: 466.424 ms

David
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index ae0f9bdc8a..f4d4018960 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1703,6 +1703,9 @@ generate_orderedappend_paths(PlannerInfo *root, 
RelOptInfo *rel,
ListCell   *lcr;
boolmatch_partition_order;
boolmatch_partition_order_desc;
+   int end_element;
+   int first_element;
+   int direction;
 
/*
 * Determine if this sort ordering matches any partition 
pathkeys we
@@ -1723,10 +1726,31 @@ generate_orderedappend_paths(PlannerInfo *root, 
RelOptInfo *rel,
 (!partition_pathkeys_desc_partial &&
  pathkeys_contained_in(partition_pathkeys_desc, 
pathkeys)));
 
+   /*
+* When the required pathkeys match the reverse of the partition
+* order, we must build the list of paths in reverse starting 
with the
+* last matching partition first.  We can get away without 
making any
+* special cases for this in the loop by just looping backward 
over
+* the child relations in this case.
+*/
+   if (match_partition_order_desc)
+   {
+   first_element = list_length(live_childrels) - 1;
+   end_element = -1;
+   direction = -1;
+   match_partition_order = true;
+   }
+   else
+   {
+   first_element = 0;
+   end_element = list_length(live_childrels);
+   direction = 1;
+   }
+
/* Select the child paths for this ordering... */
-   foreach(lcr, live_childrels)
+   for (int i = first_element; i != end_element; i += direction)
{
-   RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr);
+   RelOptInfo *childrel = list_nth_node(RelOptInfo, 
live_childrels, i);
Path   *cheapest_startup,
   *cheapest_total,
   *cheapest_fractional = NULL;
@@ -1822,25 +1846,6 @@ generate_orderedappend_paths(PlannerInfo *root, 
RelOptInfo *rel,
fractional_subpaths = 
lappend(fractional_subpaths, cheapest_fractional);
}
}
-   else if (match_partition_order_desc)
-   {
-   /*
-* As above, but we need to reverse the order 
of the children,
-* because nodeAppend.c doesn't know anything 
about reverse
-* ordering and will scan the children in the 
order presented.
-*/
-   cheapest_startup = 
get_singleton_append_subpath(cheapest_startup);
-   cheapest_total = 
get_singleton_append_subpath(cheapest_total);
-
-   startup_subpaths = lcons(cheapest_startup, 
startup_subpaths);
-   total_subpaths = lcons(cheapest_total, 
total_subpaths);
-
-   if (cheapest_fractional)
-   {
-   cheapest_fractional = 
get_singleton_append_subpath(cheapest_fractional);
-   fractional_subpaths = 
lcons(cheapest_fractional, fractional_subpaths);
-   }
-   }
else
{
/*
@@ -1859,7 +1864,7 @@ generate_orderedappend_paths(PlannerInfo *root, 
RelOptInfo *rel,
}
 
/* ... and build the Append or MergeAppend paths */
-

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 21:34:18 -0500, Jonah H. Harris wrote:
> On Thu, Feb 16, 2023 at 7:32 PM Andres Freund  wrote:
> Given not much changed regarding that allocation context IIRC, I’d think
> all recents. It was observed in 13, 14, and 15.

We did have a fair bit of changes in related code in the last few years,
including some in 16. I wouldn't expect them to help *hugely*, but also
wouldn't be surprised if it showed.

> I’ll have to create one - it was most evident on a TPC-C or sysbench test
> using the Postgres, MySQL, SQLite, and Oracle FDWs. It may be reproducible
> with pgbench as well.

I'd like a workload that hits a perf issue with this, because I think there
likely are some general performance improvements that we could make, without
changing the initial size or the "growth rate".

Perhaps, as a starting point, you could get
  MemoryContextStats(queryDesc->estate->es_query_cxt)
both at the end of standard_ExecutorStart() and at the beginning of
standard_ExecutorFinish(), for one of the queries triggering the performance
issues?

> > If the issue is a specific FDW needing to make a lot of allocations, I can
> > see
> > adding an API to tell a memory context that it ought to be ready to
> > allocate a
> > certain amount of memory efficiently (e.g. by increasing the block size of
> > the
> > next allocation by more than 2x).
>
>
> While I’m happy to be wrong, it seems is an inherent problem not really
> specific to FDW implementations themselves but the general expectation that
> all FDWs are using more of that context than non-FDW cases for similar
> types of operations, which wasn’t really a consideration in that allocation
> over time.

Lots of things can end up in the query context, it's really not FDW specific
for it to be of nontrivial size. E.g. most tuples passed around end up in it.

Similar performance issues also exists for plenty other memory contexts. Which
for me that's a reason *not* to make it configurable just for
CreateExecutorState. Or were you proposing to make ALLOCSET_DEFAULT_INITSIZE
configurable? That would end up with a lot of waste, I think.

The executor context case might actually be a comparatively easy case to
address. There's really two "phases" of use for es_query_ctx. First, we create
the entire executor tree in it, during standard_ExecutorStart(). Second,
during query execution, we allocate things with query lifetime (be that
because they need to live till the end, or because they are otherwise
managed, like tuples).

Even very simple queries end up with multiple blocks at the end:
E.g.
  SELECT relname FROM pg_class WHERE relkind = 'r' AND relname = 'frak';
yields:
  ExecutorState: 43784 total in 3 blocks; 8960 free (5 chunks); 34824 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  Grand total: 51976 bytes in 4 blocks; 16888 free (5 chunks); 35088 used

So quite justifiably we could just increase the hardcoded size in
CreateExecutorState. I'd expect that starting a few size classes up would help
noticeably.

But I think we likely could do better here. The amount of memory that ends up
in es_query_cxt during "phase 1" strongly correlates with the complexity of
the statement, as the whole executor tree ends up in it.  Using information
about the complexity of the planned statement to influence es_query_cxt's
block sizes would make sense to me.  I suspect it's a decent enough proxy for
"phase 2" as well.

Medium-long term I really want to allocate at least all the executor nodes
themselves in a single allocation. But that's a bit further out than what
we're talking about here.

Greetings,

Andres Freund

Should CreateExprContext() be using ALLOCSET_DEFAULT_SIZES?

2023-02-16 Thread Andres Freund

Hi,

The thread around 
https://postgr.es/m/caduqk8uqw5qauqldd-0sbcvzvncre3jmjb9+ydwo_umv_ht...@mail.gmail.com
reminded me of the following:

ISTM that we really shouldn't use ALLOCSET_DEFAULT_SIZES for expression
contexts, as they most commonly see only a few small, or no, allocations.

That's true for OLTPish queries, but is is very often true even for analytics
queries.

Just because I had it loaded, here's the executor state for TPCH-Q01, which is
pretty expression heavy:

ExecutorState: 65536 total in 4 blocks; 42512 free (11 chunks); 23024 used
  TupleSort main: 32832 total in 2 blocks; 7320 free (7 chunks); 25512 used
TupleSort sort: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  Caller tuples: 8192 total in 1 blocks (9 chunks); 6488 free (0 chunks); 
1704 used
  ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
Grand total: 139328 bytes in 11 blocks; 88032 free (18 chunks); 51296 used

As you can see very little was allocated in the ExprContext's.


ISTM that we could save a reasonable amount of memory by using a smaller
initial size.

Not so sure if a smaller max size should be used though.

Greetings,

Andres Freund

Re: proposal: psql: psql variable BACKEND_PID

2023-02-16 Thread Pavel Stehule

čt 16. 2. 2023 v 12:49 odesílatel Jelte Fennema  napsal:

> On Thu, 16 Feb 2023 at 12:44, Pavel Stehule 
> wrote:
> > To find and use pg_backend_pid is not rocket science. But use
> :BACKEND_PID is simpler.
>
> I wanted to call out that if there's a connection pooler (e.g.
> PgBouncer) in the middle, then BACKEND_PID (and %p) are incorrect, but
> pg_backend_pid() would work for the query. This might be an edge case,
> but if BACKEND_PID is added it might be worth listing this edge case
> in the docs somewhere.
>

good note

Regards

Pavel

Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16

2023-02-16 Thread Andres Freund

Hi

On 2023-02-17 08:30:09 +0530, Amit Kapila wrote:
> Thanks, I was not completely sure about whether we need to bump
> XLOG_PAGE_MAGIC for this patch as this makes the additional space just
> by changing the datatype of one of the members of the existing WAL
> record. We normally change it for the addition/removal of new fields
> aka change in WAL record format, or addition of a new type of WAL
> record. Does anyone else have an opinion/suggestion on this matter?

I'd definitely change it - the width of a field still means you can't really
parse the old WAL sensibly anymore.

Greetings,

Andres Freund

Re: Support logical replication of DDLs

2023-02-16 Thread Zheng Li

> > I've implemented a prototype to allow replicated objects to have the
> > same owner from the publisher in
> > v69-0008-Allow-replicated-objects-to-have-the-same-owner-from.patch.
> >
>
> I also think it would be a helpful addition for users.A few points
Thanks for supporting this addition.

> that come to my mind are: (a) Shouldn't the role have the same
> privileges (for ex. rolbypassrls or rolsuper) on both sides before we
> allow this? (b) Isn't it better to first have a replication of roles?

> I think if we have (b) then it would be probably a bit easier because
> if the subscription has allowed replicating roles and we can confirm
> that the role is replicated then we don't need to worry about the
> differences.

Yes, having role replication will help further reduce the manual
effort. But even if we don't end up doing role replication soon, I
think we can still provide this subscription option (match_ddl_owner,
off by default) and document that the same roles need to be on both
sides for it to work.

> Now, coming to implementation, won't it be better if we avoid sending
> the owner to the subscriber unless it is changed for the replicated
> command? Consider the current case of tables where we send schema only
> if it is changed. This is not a direct mapping but it would be better
> to avoid sending additional information and then process it on the
> subscriber for each command.

Right, we can do some optimization here: only send the owner for
commands that create objects (CREATE TABLE/FUNCTION/INDEX etc.) Note
that ALTER TABLE/OBJECT OWNER TO is replicated so we don't need to
worry about owner change.

Regards,
Zane

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-16 Thread David Rowley

On Fri, 17 Feb 2023 at 16:40, Andres Freund  wrote:
> I'd like a workload that hits a perf issue with this, because I think there
> likely are some general performance improvements that we could make, without
> changing the initial size or the "growth rate".

I didn't hear it mentioned explicitly here, but I suspect it's faster
when increasing the initial size due to the memory context caching
code that reuses aset MemoryContexts (see context_freelists[] in
aset.c). Since we reset the context before caching it, then it'll
remain fast when we can reuse a context, provided we don't need to do
a malloc for an additional block beyond the initial block that's kept
in the cache.

Maybe we should think of a more general-purpose way of doing this
caching which just keeps a global-to-the-process dclist of blocks
laying around.  We could see if we have any free blocks both when
creating the context and also when we need to allocate another block.
I see no reason why this couldn't be shared among the other context
types rather than keeping this cache stuff specific to aset.c.  slab.c
might need to be pickier if the size isn't exactly what it needs, but
generation.c should be able to make use of it the same as aset.c
could.  I'm unsure what'd we'd need in the way of size classing for
this, but I suspect we'd need to pay attention to that rather than do
things like hand over 16MBs of memory to some context that only wants
a 1KB initial block.

David

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-16 Thread Jonah H. Harris

On Thu, Feb 16, 2023 at 11:26 PM David Rowley  wrote:

> I didn't hear it mentioned explicitly here, but I suspect it's faster
> when increasing the initial size due to the memory context caching
> code that reuses aset MemoryContexts (see context_freelists[] in
> aset.c). Since we reset the context before caching it, then it'll
> remain fast when we can reuse a context, provided we don't need to do
> a malloc for an additional block beyond the initial block that's kept
> in the cache.


This is what we were seeing. The larger initial size reduces/eliminates the
multiple smaller blocks that are malloced and freed in each per-query
execution.

Maybe we should think of a more general-purpose way of doing this
> caching which just keeps a global-to-the-process dclist of blocks
> laying around.  We could see if we have any free blocks both when
> creating the context and also when we need to allocate another block.
> I see no reason why this couldn't be shared among the other context
> types rather than keeping this cache stuff specific to aset.c.  slab.c
> might need to be pickier if the size isn't exactly what it needs, but
> generation.c should be able to make use of it the same as aset.c
> could.  I'm unsure what'd we'd need in the way of size classing for
> this, but I suspect we'd need to pay attention to that rather than do
> things like hand over 16MBs of memory to some context that only wants
> a 1KB initial block.


Yeah. There’s definitely a smarter and more reusable approach than I was
proposing. A lot of that code is fairly mature and I figured more people
wouldn’t want to alter it in such ways - but I’m up for it if an approach
like this is the direction we’d want to go in.



-- 
Jonah H. Harris

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-16 Thread David Rowley

On Fri, 17 Feb 2023 at 17:40, Jonah H. Harris  wrote:
> Yeah. There’s definitely a smarter and more reusable approach than I was 
> proposing. A lot of that code is fairly mature and I figured more people 
> wouldn’t want to alter it in such ways - but I’m up for it if an approach 
> like this is the direction we’d want to go in.

I've spent quite a bit of time in this area recently and I think that
context_freelists[] is showing its age now. It does seem that slab and
generation were added before context_freelists[] (9fa6f00b), but not
by much, and those new contexts had fewer users back then. It feels a
little unfair that aset should get to cache but the other context
types don't.  I don't think each context type should have some
separate cache either as that probably means more memory wasted.
Having something agnostic to if it's allocating a new context or
adding a block to an existing one seems like a good idea to me.

I think the tricky part will be the discussion around which size
classes to keep around and in which cases can we use a larger
allocation without worrying too much that it'll be wasted. We also
don't really want to make the minimum memory that a backend can keep
around too bad. Patches such as [1] are trying to reduce that.  Maybe
we can just keep a handful of blocks of 1KB, 8KB and 16KB around, or
more accurately put, ALLOCSET_SMALL_INITSIZE,
ALLOCSET_DEFAULT_INITSIZE and ALLOCSET_DEFAULT_INITSIZE * 2, so that
it works correctly if someone adjusts those definitions.

I think you'll want to look at what the maximum memory a backend can
keep around in context_freelists[] and not make the worst-case memory
consumption worse than it is today.

I imagine this would be some new .c file in src/backend/utils/mmgr
which aset.c, generation.c and slab.c each call a function from to see
if we have any cached blocks of that size.  You'd want to call that in
all places we call malloc() from those files apart from when aset.c
and generation.c malloc() for a dedicated block.  You can probably get
away with replacing all of the free() calls with a call to another
function where you pass the pointer and the size of the block to have
it decide if it's going to free() it or cache it.  I doubt you need to
care too much if the block is from a dedicated allocation or a normal
block.  We'd just always free() if it's not in the size classes that
we care about.

David

[1] https://commitfest.postgresql.org/42/3867/

Re: Move defaults toward ICU in 16?

2023-02-16 Thread Robert Haas

On Thu, Feb 16, 2023 at 9:45 PM Andres Freund  wrote:
> On 2023-02-16 15:05:10 +0530, Robert Haas wrote:
> > The fact that we can't use ICU on Windows, though, weakens this
> > argument a lot. In my experience, we have a lot of Windows users, and
> > they're not any happier with the operating system collations than
> > Linux users. Possibly less so.
>
> Why can't you use ICU on windows? It works today, afaict?

Uh, I had the contrary impression from the discussion upthread, but it
sounds like I might be misunderstanding the situation?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Support logical replication of global object commands

2023-02-16 Thread Zheng Li

> > > Actually, I intend something for global objects. But the main thing
> > > that is worrying me about this is that we don't have a clean way to
> > > untie global object replication from database-specific object
> > > replication.
> >
> > I think ultimately we need a clean and efficient way to publish (and
> > subscribe to) any changes in all databases, preferably in one logical
> > replication slot.
> >
>
> Agreed. I was thinking currently for logical replication both
> walsender and slot are database-specific. So we need a way to
> distinguish the WAL for global objects and then avoid filtering based
> on the slot's database during decoding.

But which WALSender should handle the WAL for global objects if we
don't filter by database? Is there any specific problem you see for
decoding global objects commands in a database specific WALSender?

> I also thought about whether
> we want to have a WALSender that is not connected to a database for
> the replication of global objects but I couldn't come up with a reason
> for doing so. Do you have any thoughts on this matter?

Regards,
Zane

Re: Move defaults toward ICU in 16?

2023-02-16 Thread Andres Freund

Hi, 

On February 16, 2023 9:14:05 PM PST, Robert Haas  wrote:
>On Thu, Feb 16, 2023 at 9:45 PM Andres Freund  wrote:
>> On 2023-02-16 15:05:10 +0530, Robert Haas wrote:
>> > The fact that we can't use ICU on Windows, though, weakens this
>> > argument a lot. In my experience, we have a lot of Windows users, and
>> > they're not any happier with the operating system collations than
>> > Linux users. Possibly less so.
>>
>> Why can't you use ICU on windows? It works today, afaict?
>
>Uh, I had the contrary impression from the discussion upthread, but it
>sounds like I might be misunderstanding the situation?

That was about the build environment in CI / cfbot, I think. Jeff was making 
icu a hard requirement by default, but ICU wasn't installed in a usable way, so 
the build failed. The patch he referred to was just building ICU during the CI 
run. 

I do remember encountering issues with the mkvcbuild.pl build not building 
against a downloaded modern icu build, but that was just about library naming 
or directory structure, or such. 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Move defaults toward ICU in 16?

2023-02-16 Thread Michael Paquier

On Fri, Feb 17, 2023 at 10:44:05AM +0530, Robert Haas wrote:
> Uh, I had the contrary impression from the discussion upthread, but it
> sounds like I might be misunderstanding the situation?

IMO, it would be nice to be able to have the automatic detection of
meson work in the CFbot to see how this patch goes.  Perhaps that's
not a reason enough to hold on this patch, though..

Separate question: what's the state of the Windows installers provided
by the community regarding libicu?  Is that embedded in the MSI?
--
Michael

signature.asc
Description: PGP signature

shoud be get_extension_schema visible?

2023-02-16 Thread Pavel Stehule

Hi

more times I needed to get the extension's assigned namespace. There is
already a cooked function get_extension_schema, but it is static.

I need to find a function with a known name, but possibly an unknown schema
from a known extension.

Regards

Pavel

1 2 >

1 - 100 of 111 matches

Mail list logo