date:20240924

How to send patch with so many files changes?

2024-09-24 Thread Tony Wayne

Hi hackers,
I am new in contributing to postgres. I have a doubt regarding if we want
to send a patch which has an extension and also has changes in pg source
also,what's the way to do it?
It seems odd right ? we are doing extension then ,then why changes in pg?

This is the 1st version of the idea,if it goes well 🤞 we can move the
changes from extension to pg.

regards,
Tony Wayne

Re: How to send patch with so many files changes?

2024-09-24 Thread Tony Wayne

On Wed, Sep 25, 2024 at 6:02 AM Tony Wayne 
wrote:

>
> I am new in contributing to postgres. I have a doubt regarding if we want
> to send a patch which has an extension and also has changes in pg source
> also,what's the way to do it?
>
> is git diff enough?

Re: race condition in pg_class

2024-09-24 Thread Noah Misch

On Thu, Sep 19, 2024 at 02:33:46PM -0700, Noah Misch wrote:
> On Mon, Sep 09, 2024 at 10:55:32AM +0530, Nitin Motiani wrote:
> > On Sat, Sep 7, 2024 at 12:25 AM Noah Misch  wrote:
> > > https://commitfest.postgresql.org/49/5090/ remains in status="Needs 
> > > review".
> > > When someone moves it to status="Ready for Committer", I will commit
> > > inplace090, inplace110, and inplace120 patches.  If one of you is 
> > > comfortable
> > > with that, please modify the status.
> > 
> > Done.
> 
> FYI, here are the branch-specific patches.  I plan to push these after the v17
> release freeze lifts next week.

Pushed, but the pushes contained at least one defect:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=akepa&dt=2024-09-24%2022%3A29%3A02

I will act on that and other buildfarm failures that show up.

Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges

2024-09-24 Thread Michael Banck

Hi,

On Tue, Oct 24, 2023 at 11:42:15AM -0400, Tom Lane wrote:
> Christoph Berg  writes:
> > Anyway, if this doesn't raise any "oh we didn't think of this"
> > concerns, we'll just remove the old operators in pgsphere.
> 
> Well, the idea was exactly to forbid that sort of setup.
> However, if we get sufficient pushback maybe we should
> reconsider --- for example, maybe it'd be sane to enforce
> the restriction in ALTER but not CREATE?
> 
> I'm inclined to wait and see if there are more complaints.

FWIW, rdkit also fails, but that seems to be an ancient thing as well:

https://github.com/rdkit/rdkit/issues/7843

I guess there's no way to make that error a bit more helpful, like
printing out the offenbding SQL command, presumably because we are
loding an extension?


Michael

Re: Add contrib/pg_logicalsnapinspect

2024-09-24 Thread Peter Smith

On Wed, Sep 25, 2024 at 2:51 AM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Mon, Sep 23, 2024 at 12:27:27PM +1000, Peter Smith wrote:
> > My review comments for v8-0001
> >
> > ==
> > contrib/pg_logicalinspect/pg_logicalinspect.c
> >
> > 1.
> > +/*
> > + * Lookup table for SnapBuildState.
> > + */
> > +
> > +#define SNAPBUILD_STATE_INCR 1
> > +
> > +static const char *const SnapBuildStateDesc[] = {
> > + [SNAPBUILD_START + SNAPBUILD_STATE_INCR] = "start",
> > + [SNAPBUILD_BUILDING_SNAPSHOT + SNAPBUILD_STATE_INCR] = "building",
> > + [SNAPBUILD_FULL_SNAPSHOT + SNAPBUILD_STATE_INCR] = "full",
> > + [SNAPBUILD_CONSISTENT + SNAPBUILD_STATE_INCR] = "consistent",
> > +};
> > +
> > +/*
> >
> > nit - the SNAPBUILD_STATE_INCR made this code appear more complicated
> > than it is. Please take a look at the attachment for an alternative
> > implementation which includes an explanatory comment. YMMV. Feel free
> > to ignore it.
>
> Thanks for the feedback!
>
> I like the commment, so added it in v9 attached. OTOH I think that's better
> to keep SNAPBUILD_STATE_INCR as those "+1" are all linked and that would be
> easy to miss the one in pg_get_logical_snapshot_info() should we change the
> increment in the future.
>

I see SNAPBUILD_STATE_INCR more as an "offset" (to get the lowest enum
value to be at lookup index [0]) than an "increment" (between the enum
values), so I'd be naming that differently. But, maybe I am straying
into just personal opinion instead of giving useful feedback, so let's
say I have no more review comments. Patch v9 looks OK to me.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: How to send patch with so many files changes?

2024-09-24 Thread David G. Johnston

On Tue, Sep 24, 2024 at 5:37 PM Tony Wayne 
wrote:

>
> On Wed, Sep 25, 2024 at 6:02 AM Tony Wayne 
> wrote:
>
>>
>> I am new in contributing to postgres. I have a doubt regarding if we want
>> to send a patch which has an extension and also has changes in pg source
>> also,what's the way to do it?
>>
>> is git diff enough?
>

Usually you'd want to use format-patch so your commit message(s) make it
into the artifact.  Especially for something complex/large.

David J.

Re: How to send patch with so many files changes?

2024-09-24 Thread David G. Johnston

On Tue, Sep 24, 2024 at 5:32 PM Tony Wayne 
wrote:

>
> This is the 1st version of the idea,if it goes well 🤞 we can move the
> changes from extension to pg.
>
>
If you are saying you are planning to add something to the contrib directly
then you should just post the patch(es) that do it.  Your ability to make
it digestible will highly influence whether anyone is willing to review it.

If this isn't intended for core or contrib you are not in the correct
place.  If you wish to share an external extension you are publishing the
-general channel would be the place to discuss such things.

David J.

Re: AIO writes vs hint bits vs checksums

2024-09-24 Thread Thomas Munro

On Wed, Sep 25, 2024 at 8:30 AM Andres Freund  wrote:
> Just ran that. There probably is a performance difference, but it's small
> (<0.5%) making it somewhat hard to be certain. It looks like the main reason
> for that is ConditionVariableBroadcast() on the iocv shows up even though
> nobody is waiting.

. o O { Gotta fix that.  Memory barriers might be enough to check for
empty wait list?, and even in the slow path, atomic wait lists or
something better than spinlocks... }

> However, our habit of modifying buffers while IO is going on is
> causing issues with filesystem level checksums as well, as evidenced by the
> fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
> tend to think it'd be better to just stop doing that alltogether (we also do
> that for WAL, when writing out a partial page, but a potential fix there would
> be different, I think).

+many.  Interesting point re the WAL variant.  For the record, here's
some discussion and a repro for that problem, which Andrew currently
works around in a build farm animal with mount options:

https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com

Re: How to send patch with so many files changes?

2024-09-24 Thread Tony Wayne

These changes are for core ,I think it would be better to either move whole
changes to core or contrib as an extension.

On Wed, Sep 25, 2024 at 6:09 AM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Tue, Sep 24, 2024 at 5:37 PM Tony Wayne 
> wrote:
>
>>
>> On Wed, Sep 25, 2024 at 6:02 AM Tony Wayne 
>> wrote:
>>
>>>
>>> I am new in contributing to postgres. I have a doubt regarding if we
>>> want to send a patch which has an extension and also has changes in pg
>>> source also,what's the way to do it?
>>>
>>> is git diff enough?
>>
>
> Usually you'd want to use format-patch so your commit message(s) make it
> into the artifact.  Especially for something complex/large.
>
> David J.
>

Re: How to send patch with so many files changes?

2024-09-24 Thread Michael Paquier

On Wed, Sep 25, 2024 at 06:19:39AM +0530, Tony Wayne wrote:
> These changes are for core ,I think it would be better to either move whole
> changes to core or contrib as an extension.

Please avoid top-posting.  The community mailing lists use
bottom-posting, to ease discussions.  See:
https://en.wikipedia.org/wiki/Posting_style#Bottom-posting

> On Wed, Sep 25, 2024 at 6:09 AM David G. Johnston <
> david.g.johns...@gmail.com> wrote:
>> Usually you'd want to use format-patch so your commit message(s) make it
>> into the artifact.  Especially for something complex/large.

The community wiki has some guidelines about all that:
https://wiki.postgresql.org/wiki/Submitting_a_Patch

In my experience, it is much easier to sell a feature to the community
if a patch is organized into independent useful pieces with
refactoring pieces presented on top of the actual feature.  In order
to achieve that `git format-patch` is essential because it is possible
to present a patch set organizing your ideas so as others need to
spend less time trying to figure out what a patch set is doing when
doing a review.  format-patch with `git am` is also quite good to
track the addition of new files or the removal of old files.  Writing
your ideas in the commit logs can also bring a lot of insight for
anybody reading your patches.

For simpler and localized changes, using something like git diff would
be also OK that can be applied with a simple `patch` command can also
be fine.  I've done plenty of work with patches sent to the lists this
way for bug fixes.  Of course this is case-by-case, for rather complex
bug fixes format-patch can still be a huge gain of time when reading
somebody else's ideas on a specific matter.
--
Michael

signature.asc
Description: PGP signature

Re: AIO writes vs hint bits vs checksums

2024-09-24 Thread Noah Misch

On Tue, Sep 24, 2024 at 04:30:25PM -0400, Andres Freund wrote:
> On 2024-09-24 12:43:40 -0700, Noah Misch wrote:
> > On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:
> > > Besides that, the need to copy the buffers makes checkpoints with AIO
> > > noticeably slower when checksums are enabled - it's not the checksum but 
> > > the
> > > copy that's the biggest source of the slowdown.

How big is that copy's contribution to the slowdown there?  A measurable CPU
overhead on writes likely does outweigh the unmeasurable overhead on index
scans, but ...

> > > Does this sound like a reasonable idea?  Counterpoints?

> > How should we think about comparing the distributed cost of the buffer
> > header manipulations during index scans vs. the costs of bounce buffers?
> 
> Well, the cost of bounce buffers would be born as long as postgres is up,
> whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't
> really show up.

... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision.  From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers.  I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability.  Bounce
buffers have an implementation.  They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.

> Zooming out (a lot) more: I like the idea of having a way to get the
> permission to perform some kinds of modifications on a page without an
> exlusive lock. While obviously a lot more work, I do think there's some
> potential to have some fast-paths that perform work on a page level without
> blocking out readers. E.g. some simple cases of insert could correctly be done
> without blocking out readers (by ordering the update of the max page offset

True.

Re: Add on_error and log_verbosity options to file_fdw

2024-09-24 Thread Fujii Masao





On 2024/09/24 20:08, torikoshia wrote:

Thanks for the explanation and suggestion.
Since there is almost the same code in copyfrom.c, attached 0003 patch for 
refactoring both.


Thanks for updating the patches!

Regarding 0002.patch, I think it’s better to include the refactored code
from the start rather than adding redundant code intentionally.
How about leaving just the refactor in copyfrom.c to 0003.patch?
If that works, as a refactoring, you could also replace "skipped" with
"cstate->num_errors" in that patch, as you suggested earlier.

While reviewing again, I noticed that running ANALYZE on a file_fdw
foreign table also calls NextCopyFrom(), but it doesn’t seem to
skip erroneous rows when on_error is set to "ignore." This could lead
to inaccurate statistics. Shouldn’t ANALYZE on file_fdw foreign tables
with on_error=ignore also skip erroneous rows?


The tab-completion needs to be updated to support the "silent" option?


Yes, updated 0002 patch.


Thanks! Also, this should be part of 0001.patch since "silent" is
introduced there, right?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pgsql: Improve default and empty privilege outputs in psql.

2024-09-24 Thread Tom Lane

Christoph Berg  writes:
> Re: Tom Lane
>> Improve default and empty privilege outputs in psql.

> I'm sorry to report this when 17.0 has already been wrapped, but this
> change is breaking `psql -l` against 9.3-or-earlier servers:
> FEHLER:  42883: Funktion pg_catalog.cardinality(aclitem[]) existiert nicht

Grumble.  Well, if that's the worst bug in 17.0 we should all be
very pleased indeed ;-).  I'll see about fixing it after the
release freeze lifts.

> The psql docs aren't really explicit about which old versions are
> still supported, but there's some mentioning that \d should work back
> to 9.2:

Yes, that's the expectation.  I'm sure we can think of a more
backwards-compatible way to test for empty datacl, though.

regards, tom lane

Re: Add llvm version into the version string

2024-09-24 Thread Andres Freund

Hi,

On 2024-09-24 13:53:49 +0100, Alastair Turner wrote:
> Since the build and runtime versions may differ for some things like llvm,
> libxml2 and the interpreters behind some of the PLs, it may be valuable to
> expand the view and show two values - a build time (or configure time)
> value and a runtime value for these.

+1

Somewhat orthogonal: I've wondered before whether it'd be useful to have a
view showing the file path and perhaps the soversion of libraries dynamically
loaded into postgres. That's currently hard to figure out over a connection
(which is all a lot of our users have access to).

Greetings,

Andres Freund

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-09-24 Thread Amonson, Paul D

Hi all,

I will be retiring from Intel at the end of this week. I wanted to introduce 
the engineer who will be taking over the CRC32c proposal and commit fest entry.

Devulapalli, Raghuveer 

I have brought him up to speed and he will be the go-to for technical review 
comments and questions. Please welcome him into the community.

Thanks,
Paul

Re: Conflict detection for update_deleted in logical replication

2024-09-24 Thread Masahiko Sawada

On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila  wrote:
>
> On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Apart from the vacuum_defer_cleanup_age idea.
> >
>
> I think you meant to say vacuum_committs_age idea.
>
> > we’ve given more thought to our
> > approach for retaining dead tuples and have come up with another idea that 
> > can
> > reliably detect conflicts without requiring users to choose a wise value for
> > the vacuum_committs_age. This new idea could also reduce the performance
> > impact. Thanks a lot to Amit for off-list discussion.
> >
> > The concept of the new idea is that, the dead tuples are only useful to 
> > detect
> > conflicts when applying *concurrent* transactions from remotes. Any 
> > subsequent
> > UPDATE from a remote node after removing the dead tuples should have a later
> > timestamp, meaning it's reasonable to detect an update_missing scenario and
> > convert the UPDATE to an INSERT when applying it.
> >
> > To achieve above, we can create an additional replication slot on the
> > subscriber side, maintained by the apply worker. This slot is used to retain
> > the dead tuples. The apply worker will advance the slot.xmin after 
> > confirming
> > that all the concurrent transaction on publisher has been applied locally.
> >
> > The process of advancing the slot.xmin could be:
> >
> > 1) the apply worker call GetRunningTransactionData() to get the
> > 'oldestRunningXid' and consider this as 'candidate_xmin'.
> > 2) the apply worker send a new message to walsender to request the latest 
> > wal
> > flush position(GetFlushRecPtr) on publisher, and save it to
> > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message 
> > or
> > extend the existing keepalive message(e,g extends the requestReply bit in
> > keepalive message to add a 'request_wal_position' value)
> > 3) The apply worker can continue to apply changes. After applying all the 
> > WALs
> > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > slot.xmin to 'candidate_xmin'.
> >
> > This approach ensures that dead tuples are not removed until all concurrent
> > transactions have been applied. It can be effective for both bidirectional 
> > and
> > non-bidirectional replication cases.
> >
> > We could introduce a boolean subscription option (retain_dead_tuples) to
> > control whether this feature is enabled. Each subscription intending to 
> > detect
> > update-delete conflicts should set retain_dead_tuples to true.
> >
>
> As each apply worker needs a separate slot to retain deleted rows, the
> requirement for slots will increase. The other possibility is to
> maintain one slot by launcher or some other central process that
> traverses all subscriptions, remember the ones marked with
> retain_dead_rows (let's call this list as retain_sub_list). Then using
> running_transactions get the oldest running_xact, and then get the
> remote flush location from the other node (publisher node) and store
> those as candidate values (candidate_xmin and
> candidate_remote_wal_lsn) in slot. We can probably reuse existing
> candidate variables of the slot. Next, we can check the remote_flush
> locations from all the origins corresponding in retain_sub_list and if
> all are ahead of candidate_remote_wal_lsn, we can update the slot's
> xmin to candidate_xmin.

Yeah, I think that such an idea to reduce the number required slots
would be necessary.

>
> I think in the above idea we can an optimization to combine the
> request for remote wal LSN from different subscriptions pointing to
> the same node to avoid sending multiple requests to the same node. I
> am not sure if using pg_subscription.subconninfo is sufficient for
> this, if not we can probably leave this optimization.
>
> If this idea is feasible then it would reduce the number of slots
> required to retain the deleted rows but the launcher needs to get the
> remote wal location corresponding to each publisher node. There are
> two ways to achieve that (a) launcher requests one of the apply
> workers corresponding to subscriptions pointing to the same publisher
> node to get this information; (b) launcher launches another worker to
> get the remote wal flush location.

I think the remote wal flush location is asked using a replication
protocol. Therefore, if a new worker is responsible for asking wal
flush location from multiple publishers (like the idea (b)), the
corresponding process would need to be launched on publisher sides and
logical replication would also need to start on each connection. I
think it would be better to get the remote wal flush location using
the existing logical replication connection (i.e., between the logical
wal sender and the apply worker), and advertise the locations on the
shared memory. Then, the central process who holds the slot to retain
the deleted row versions traverses them and increases slot.xmin if
possible.

The cost of requesting the remote wal flush location would not be huge

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-24 Thread Tom Lane

Peter Eisentraut  writes:
> On 23.09.24 22:51, Shayon Mukherjee wrote:
>> I am happy to draft a patch for this as well. I think I have a working
>> idea so far of where the necessary checks might go. However if you don’t
>> mind, can you elaborate further on how the effect would be similar to
>> enable_indexscan?

> Planner settings like enable_indexscan used to just add a large number 
> (disable_cost) to the estimated plan node costs.  It's a bit more 
> sophisticated in PG17.  But in any case, I imagine "disabling an index" 
> could use the same mechanism.  Or maybe not, maybe the setting would 
> just completely ignore the index.

Yeah, I'd be inclined to implement this by having create_index_paths
just not make any paths for rejected indexes.  Or you could back up
another step and keep plancat.c from building IndexOptInfos for them.
The latter might have additional effects, such as preventing the plan
from relying on a uniqueness condition enforced by the index.  Not
clear to me if that's desirable or not.

[ thinks... ]  One good reason for implementing it in plancat.c is
that you'd have the index relation open and be able to see its name
for purposes of matching to the filter.  Anywhere else, getting the
name would involve additional overhead.

regards, tom lane

Re: [PATCH] Add native windows on arm64 support

2024-09-24 Thread Dave Cramer

On Tue, 13 Feb 2024 at 16:28, Dave Cramer  wrote:

>
>
>
> On Tue, 13 Feb 2024 at 12:52, Andres Freund  wrote:
>
>> Hi,
>>
>> On 2024-02-13 12:49:33 -0500, Dave Cramer wrote:
>> > > I think I might have been on to something - if my human emulation of a
>> > > preprocessor isn't wrong, we'd end up with
>> > >
>> > > #define S_UNLOCK(lock)  \
>> > > do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
>> > >
>> > > on msvc + arm. And that's entirely insufficient - _ReadWriteBarrier()
>> just
>> > > limits *compiler* level reordering, not CPU level reordering.  I
>> think it's
>> > > even insufficient on x86[-64], but it's definitely insufficient on
>> arm.
>> > >
>> > In fact ReadWriteBarrier has been deprecated _ReadWriteBarrier |
>> Microsoft
>> > Learn
>> > <
>> https://learn.microsoft.com/en-us/cpp/intrinsics/readwritebarrier?view=msvc-170
>> >
>>
>> I'd just ignore that, that's just pushing towards more modern stuff that's
>> more applicable to C++ than C.
>>
>>
>> > I did try using atomic_thread_fence as per atomic_thread_fence -
>> > cppreference.com
>> > 
>>
>> The semantics of atomic_thread_fence are, uh, very odd.  I'd just use
>> MemoryBarrier().
>>
>> #define S_UNLOCK(lock)  \
> do { MemoryBarrier(); (*(lock)) = 0; } while (0)
>
> #endif
>
> Has no effect.
>
> I have no idea if that is what you meant that I should do ?
>
> Dave
>


Revisiting this:

Andrew, can you explain the difference between ninja test (which passes)
and what the build farm does. The buildfarm crashes.

Dave

Re: [PATCH] Support Int64 GUCs

2024-09-24 Thread Aleksander Alekseev

Hi, Alexander!

> Thank you for your work on this subject.

Thanks for your feedback.

> It doesn't look like these *_age GUCs could benefit from being 64-bit,
> before 64-bit transaction ids get in.  However, I think there are some
> better candidates.
>
> autovacuum_vacuum_threshold
> autovacuum_vacuum_insert_threshold
> autovacuum_analyze_threshold
>
> This GUCs specify number of tuples before vacuum/analyze.  That could
> be more than 2^31.  With large tables of small tuples, I can't even
> say this is always impractical to have values greater than 2^31.

Sounds good to me. Fixed.

> > Secondly, DefineCustomInt64Variable() is not test-covered. Turned out
> > it was not even defined (although declared) in the original patch.
> > This was fixed in the attached version. Maybe one of the test modules
> > could use it even if it makes little sense for this particular module?
> > For instance, test/modules/worker_spi/ could use it for
> > worker_spi.naptime.
>
> I don't think there are good candidates among existing extension GUCs.
> I think we could add something for pure testing purposes somewhere in
> src/test/modules.

I found a great candidate in src/test/modules/delay_execution:

```
DefineCustomIntVariable("delay_execution.post_planning_lock_id",
"Sets the advisory lock ID to be
locked/unlocked after planning.",
```

Advisory lock IDs are bigints [1]. I modified the module to use Int64's.

I guess it may also answer Nathan's question.

> > Last but not least, large values like 12345678912345 could be
> > difficult to read. Perhaps we should also support 12_345_678_912_345
> > syntax? This is not implemented in the attached patch and arguably
> > could be discussed separately when and if we merge it.
>
> I also think we're good with 12345678912345 so far.

Fair enough.

PFA the updated patch.

[1]: 
https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS

--
Best regards,
Aleksander Alekseev


v2-0001-Support-64-bit-integer-GUCs.patch
Description: Binary data

Re: Add contrib/pg_logicalsnapinspect

2024-09-24 Thread Bertrand Drouvot

Hi,

On Tue, Sep 24, 2024 at 09:15:31AM +0530, shveta malik wrote:
> On Fri, Sep 20, 2024 at 12:22 PM Bertrand Drouvot
>  wrote:
> >
> >
> > Please find attached v8, that:
> >
> 
> Thank You for the patch. In one of my tests, I noticed that I got
> negative checksum:
> 
> postgres=# SELECT * FROM pg_get_logical_snapshot_meta('0/3481F20');
>magic|  checksum  | version
> ++-
>  1369563137 | -266346460 |   6
> 
> But pg_crc32c is uint32. Is it because we are getting it as
> Int32GetDatum(ondisk.checksum) in pg_get_logical_snapshot_meta()?
> Instead should it be UInt32GetDatum?

Thanks for the testing.

As the checksum could be > 2^31 - 1, then v9 (just shared up-thread) changes it
to an int8 in the pg_logicalinspect--1.0.sql file. So, to avoid CI failure on
the 32bit build, then v9 is using Int64GetDatum() instead of UInt32GetDatum().

> Same goes for below:
> values[i++] = Int32GetDatum(ondisk.magic);
> values[i++] = Int32GetDatum(ondisk.magic);

The 2 others field (magic and version) are unlikely to be > 2^31 - 1, so v9 is
making use of UInt32GetDatum() and keep int4 in the sql file.

> We need to recheck the rest of the fields in the info() function as well.

I think that the pg_get_logical_snapshot_info()'s fields are ok (I did spend 
some
time to debug CI failing on the 32bit build for some on them before submitting 
v1).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: not null constraints, again

2024-09-24 Thread Alvaro Herrera

On 2024-Sep-24, jian he wrote:

> sql-createtable.html
> SECTION: LIKE source_table [ like_option ... ]
> INCLUDING CONSTRAINTS
> CHECK constraints will be copied. No distinction is made between
> column constraints and table constraints. Not-null constraints are
> always copied to the new table.
> 
> drop table if exists t, t_1,ssa;
> create table t(a int, b int, not null a no inherit);
> create table ssa (like t INCLUDING all);
> 
> Here create table like won't include no inherit not-null constraint,
> seems to conflict with the doc?

Hmm, actually I think this is a bug, because if you have CHECK
constraint with NO INHERIT, it will be copied:

create table t (a int check (a > 0) no inherit);
create table ssa (like t including constraints);

55490 18devel 141626=# \d+ ssa
 Tabla «public.ssa»
 Columna │  Tipo   │ Ordenamiento │ Nulable │ Por omisión │ Almacenamiento>
─┼─┼──┼─┼─┼───>
 a   │ integer │  │ │ │ plain >
Restricciones CHECK:
"t_a_check" CHECK (a > 0) NO INHERIT
Método de acceso: heap

It seems that NOT NULL constraint should behave the same as CHECK
constraints in this regard, i.e., we should not heed NO INHERIT in this
case.


I have made these changes and added some tests, and will be posting a v5
shortly.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
 really, I see PHP as like a strange amalgamation of C, Perl, Shell
 inflex: you know that "amalgam" means "mixture with mercury",
   more or less, right?
 i.e., "deadly poison"

Re: Cleaning up ERRCODE usage in our XML code

2024-09-24 Thread Tom Lane

Daniel Gustafsson  writes:
> On 23 Sep 2024, at 19:17, Tom Lane  wrote:
>> Yeah, I looked at that but wasn't sure what to do with it.  We should
>> have validated the decl header when the XML value was created, so if
>> we get here then either the value got corrupted on-disk or in-transit,
>> or something forgot to do that validation, or libxml has changed its
>> mind since then about what's a valid decl.  At least some of those
>> cases are probably legitimately called INTERNAL_ERROR.  I thought for
>> awhile about ERRCODE_DATA_CORRUPTED, but I'm not convinced that's a
>> better fit.

> I agree that it might not be an obvious better fit, but also not an obvious
> worse fit.  It will make it easier to filter on during fleet analysis so I
> would be inclined to change it, but the main value of the patch are other 
> hunks
> so feel free to ignore.

Fair enough.  Pushed with ERRCODE_DATA_CORRUPTED used there.
Thanks again for reviewing.

regards, tom lane

Re: Conflict detection for update_deleted in logical replication

2024-09-24 Thread Masahiko Sawada

On Tue, Sep 24, 2024 at 12:14 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada 
>  wrote:
> >
> > On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
> >  wrote:
> > > > I'm still studying this idea but let me confirm the following scenario.
> > > >
> > > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > > and XIDs and commit LSNs of T2 and T3 are the following:
> > > >
> > > > Node A
> > > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> > commit-LSN:1000
> > > >
> > > > Node B
> > > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > > commit-LSN:5000
> > > >
> > > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > > latest flush WAL position of Node-A and Node-B are following:
> > > >
> > > > Node A
> > > >   current XID: 300
> > > >   latest flush LSN; 3000
> > > >
> > > > Node B
> > > >   current XID: 700
> > > >   latest flush LSN: 7000
> > > >
> > > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > > (i.e., the logical replication is delaying for 5 min).
> > > >
> > > > Consider the following scenario:
> > > >
> > > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > > gets 301 (set as candidate_xmin).
> > > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is 
> > > > now
> > 8000.
> > > > 4. The apply worker on Node-A continues applying changes, and
> > > > applies the transactions up to remote (commit) LSN 7100.
> > > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > > 301 (candidate_xmin).
> > > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > > deleted by T2.
> > > >
> > > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > > 8000 that might require the tuple that is deleted by T2.
> > > >
> > > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > > still depends on when vacuum removes the tuple deleted by T2.
> > >
> > > I think in this case, no matter we detect "update_delete" or
> > > "update_missing", the final data is the same. Because T4's commit
> > > timestamp should be later than
> > > T2 on node A, so in the case of "update_deleted", it will compare the
> > > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > > T4 should win, which means we will convert the update into insert and
> > > apply. Even if the deleted tuple is deleted and "update_missing" is
> > > detected, the update will still be converted into insert and applied. So, 
> > > the
> > result is the same.
> >
> > The "latest_timestamp_wins" is the default resolution method for
> > "update_deleted"? When I checked the wiki page[1], the "skip" was the 
> > default
> > solution method for that.
>
> Right, I think the wiki needs some update.
>
> I think using 'skip' as default for update_delete could easily cause data
> divergence when the dead tuple is deleted by an old transaction while the
> UPDATE has a newer timestamp like the case you mentioned. It's necessary to
> follow the last update win strategy when the incoming update has later
> timestamp, which is to convert update to insert.

Right. If "latest_timestamp_wins" is the default resolution for
"update_deleted", I think your idea works fine unless I'm missing
corner cases.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Add new COPY option REJECT_LIMIT

2024-09-24 Thread Fujii Masao





On 2024/09/24 14:25, torikoshia wrote:

Updated the patch.


Thanks for updating the patch!

+REJECT_LIMIT { integer }

The curly braces {} seem unnecessary here.

+  When a positive integer value is specified, COPY 
limits
+  the maximum tolerable number of errors while converting a column's input
+  value into its data type.
+  If input data caused more errors than the specified value, entire
+  COPY fails.
+  Otherwise, COPY discards the input row and continues
+  with the next one.
+  This option must be used with ON_ERROR to be set to
+  ignore.
+  Just setting ON_ERROR to ignore
+  tolerates unlimited number of errors.

Regarding the description of REJECT_LIMIT, how about clarifying what the option 
specifies up front, like this:


Specifies the maximum number of errors tolerated while converting a column's
input value to its data type, when on_error is set to "ignore." If the input
causes more errors than the specified value, the COPY command fails,
even with on_error set to "ignore." This value must be positive and used with
on_error="ignore". If not specified, on_error="ignore" allows an unlimited
number of errors, meaning COPY will skip all erroneous data.


+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("number for REJECT_LIMIT must be greater 
than zero")));
+   }
+   else
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("value for REJECT_LIMIT must be positive 
integer")));

The phrases "number for" and "value for" seem unnecessary in the error messages.
Also, "positive number" should be "a positive number." Would it be better to 
display
the actual specified value in the error message, like: errmsg("REJECT_LIMIT 
(%s) must
be a positive number", value)?

Lastly, during my testing, if nothing is specified for REJECT_LIMIT
(e.g., COPY ... WITH (..., REJECT_LIMIT)), the COPY command caused a 
segmentation fault.



I'd like to hear if anyone has an opinion on the need for supporting ratio.


Since there was no opinion about it, attached a patch only for REJECT_LIMIT 
specifying number.


+1


As there are no opposing opinions, removed 'INFINITY' as well.


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pg_ctl/miscinit: print "MyStartTime" as a long long instead of long to avoid 2038 problem.

2024-09-24 Thread Nathan Bossart

On Tue, Sep 24, 2024 at 07:33:24PM +, Max Johnson wrote:
> I have found an instance of a time overflow with the start time that is
> written in "postmaster.pid". On a 32-bit Linux system, if the system date
> is past 01/19/2038, when you start Postgres with `pg_ctl start -D
> {datadir} ...`,  the start time will have rolled back to approximately
> 1900. This is an instance of the "2038 problem". On my system, pg_ctl
> will not exit if the start time has overflowed.

Nice find.  I think this has been the case since the start time was added
to the lock files [0].

> - snprintf(buffer, sizeof(buffer), "%d\n%s\n%ld\n%d\n%s\n",
> + snprintf(buffer, sizeof(buffer), "%d\n%s\n%lld\n%d\n%s\n",
>amPostmaster ? (int) my_pid : -((int) my_pid),
>DataDir,
> -  (long) MyStartTime,
> +  (long long) MyStartTime,
>PostPortNumber,
>socketDir);

I think we should use INT64_FORMAT here.  That'll choose the right length
modifier for the platform.  And I don't think we need to cast MyStartTime,
since it's a pg_time_t (which is just an int64).

[0] https://postgr.es/c/30aeda4

-- 
nathan

Re: AIO writes vs hint bits vs checksums

2024-09-24 Thread Andres Freund

Hi,

On 2024-09-24 12:43:40 -0700, Noah Misch wrote:
> On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:
> > So far the AIO patchset has solved this by introducing a set of "bounce
> > buffers", which can be acquired and used as the source/target of IO when 
> > doing
> > it in-place into shared buffers isn't viable.
> >
> > I am worried about that solution however, as either acquisition of bounce
> > buffers becomes a performance issue (that's how I did it at first, it was 
> > hard
> > to avoid regressions) or we reserve bounce buffers for each backend, in 
> > which
> > case the memory overhead for instances with relatively small amount of
> > shared_buffers and/or many connections can be significant.
>
> > But: We can address this and improve performance over the status quo! Today 
> > we
> > determine tuple visiblity determination one-by-one, even when checking the
> > visibility of an entire page worth of tuples. That's not exactly free. I've
> > prototyped checking visibility of an entire page of tuples at once and it
> > indeed speeds up visibility checks substantially (in some cases seqscans are
> > over 20% faster!).
>
> Nice!  It sounds like you refactored the relationship between
> heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint
> bit setting upward or the iterate-over-tuples downward.  Is that about right?

I've tried about five variations, so I don't have one answer to this yet :).

One problem is that having repeated loops doing PageGetItemId(),
PageGetItem(), ItemIdGetLength() isn't exactly free. To some degree it can be
hidden by allowing for better superscalar execution, but not entirely.

I've been somewhat confused by the compiler generated code around ItemId
handling for a while, it looks way more expensive than it should - it
regularly is a bottleneck due to the sheer number of instructions being
executed leading to being frontend bound. But never quite looked into it
deeply enough to figure out what's causing it / how to fix it.

> > Once we have page-level visibility checks we can get the right to set hint
> > bits once for an entire page instead of doing it for every tuple - with that
> > in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
> > wins.
>
> How did page-level+BM_SETTING_HINTS performance compare to performance of the
> page-level change w/o the BM_SETTING_HINTS change?

Just ran that. There probably is a performance difference, but it's small
(<0.5%) making it somewhat hard to be certain. It looks like the main reason
for that is ConditionVariableBroadcast() on the iocv shows up even though
nobody is waiting.

I've been fighting that with AIO as well, so maybe it's time to figure out the
memory ordering rules that'd allow to check that without a full spinlock
acquisition.

If we figure it out, we perhaps should use the chance to get rid of
BM_PIN_COUNT_WAITER...

> > Does this sound like a reasonable idea?  Counterpoints?
>
> I guess the main part left to discuss is index scans or other scan types where
> we'd either not do page-level visibility or we'd do page-level visibility
> including tuples we wouldn't otherwise use.  BM_SETTING_HINTS likely won't
> show up so readily in index scan profiles, but the cost is still there.

I could indeed not make it show up in some simple index lookup heavy
workloads.  I need to try some more extreme cases though (e.g. fetching all
tuples in a table via an index or having very long HOT chains).

If it's not visible cost-wise compared to all the other costs of index scans -
does it matter?  If it's not visible it's either because it proportionally is
very small or because it's completely hidden by superscalar execution.

Another thing I forgot to mention that probably fits into the "tradeoffs"
bucket: Because BM_SETTING_HINTS would be just a single bit, one backend
setting hint bits would block out another backend setting hint bits. In most
situations that'll be fine, or even faster than not doing so due to reducing
cache line ping-pong, but in cases of multiple backends doing index lookups to
different unhinted tuples on the same page it could be a bit worse.

But I suspect that's fine because it's hard to believe that you could have
enough index lookups to unhinted tuples for that to be a bottleneck -
something has to produce all those unhinted tuples after all, and that's
rather more expensive. And for single-tuple visibility checks the window in
which hint bits are set is very small.

> How should we think about comparing the distributed cost of the buffer
> header manipulations during index scans vs. the costs of bounce buffers?

Well, the cost of bounce buffers would be born as long as postgres is up,
whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't
really show up. If there are cases where the cost of the more expensive hint
bit logic does show up, it'll get a lot harder to weigh.

So far my prototype uses the path that avoids h

Re: [PATCH] Add native windows on arm64 support

2024-09-24 Thread Andres Freund

On 2024-09-24 16:01:30 -0400, Dave Cramer wrote:
> I have a dmp file with a stack trace if anyone is interested

/me raises his hand

Re: pg_ctl/miscinit: print "MyStartTime" as a long long instead of long to avoid 2038 problem.

2024-09-24 Thread Tom Lane

Nathan Bossart  writes:
> I think we should use INT64_FORMAT here.  That'll choose the right length
> modifier for the platform.  And I don't think we need to cast MyStartTime,
> since it's a pg_time_t (which is just an int64).

Agreed.  However, a quick grep finds half a dozen other places that
are casting MyStartTime to long.  We should fix them all.

Also note that if any of the other places are using translatable
format strings, INT64_FORMAT is problematic in that context, and
"long long" is a better answer for them.

(I've not dug in the history, but I rather imagine that this is all
a hangover from MyStartTime having once been plain "time_t".)

regards, tom lane

Re: Clock-skew management in logical replication

2024-09-24 Thread Nisha Moond

On Mon, Sep 23, 2024 at 4:00 PM Nisha Moond  wrote:
>
> On Fri, Sep 20, 2024 at 7:51 PM Tom Lane  wrote:
> >
> > Nisha Moond  writes:
> > > While considering the implementation of timestamp-based conflict
> > > resolution (last_update_wins) in logical replication (see [1]), there
> > > was a feedback at [2] and the discussion on whether or not to manage
> > > clock-skew at database level.
> >
> > FWIW, I cannot see why we would do anything beyond suggesting that
> > people run NTP.  That's standard anyway on the vast majority of
> > machines these days.  Why would we add complexity that we have
> > to maintain (and document) in order to cater to somebody not doing
> > that?
> >
> > regards, tom lane
>
> Thank you for your response.
>
> I agree with suggesting users to run NTP and we can recommend it in
> the docs rather than introducing additional complexities.
>
> In my research on setting up NTP servers on Linux, I found that
> Chrony[1] is a lightweight and efficient solution for time
> synchronization across nodes. Another reliable option is the classic
> NTP daemon (ntpd)[2], which is also easy to configure and maintain.
> Both Chrony and ntpd can be used to configure a local machine as an
> NTP server for localized time synchronization, or as clients syncing
> from public NTP servers such as 'ntp.ubuntu.com' (default ntp server
> pool for Ubuntu systems) or 'time.google.com'(Google Public NTP).
> For example, on Ubuntu, Chrony is straightforward to install and
> configure[3]. Comprehensive NTP(ntpd) configuration guides are
> available for various Linux distributions, such as Ubuntu[4] and
> RedHat-Linux[5].
>
> Further, I’m exploring options for implementing NTP on Windows systems.
>

Windows platforms provide built-in time synchronization services. As a
client, they allow users to sync system time using internet or public
NTP servers. This can be easily configured by selecting a public NTP
server directly in the Date and Time settings. More details can be
found at [1].

Additionally, Windows servers can be configured as NTP servers for
localized time synchronization within a network, allowing other nodes
to sync with them. Further instructions on configuring an NTP server
on Windows can be found at [2].

[1] 
https://learn.microsoft.com/en-us/windows-server/networking/windows-time-service/how-the-windows-time-service-works
[2] 
https://learn.microsoft.com/en-us/troubleshoot/windows-server/active-directory/configure-authoritative-time-server

Thanks,
Nisha

Re: Primary and standby setting cross-checks

2024-09-24 Thread Noah Misch

On Thu, Aug 29, 2024 at 09:52:06PM +0300, Heikki Linnakangas wrote:
> Currently, if you configure a hot standby server with a smaller
> max_connections setting than the primary, the server refuses to start up:
> 
> LOG:  entering standby mode
> FATAL:  recovery aborted because of insufficient parameter settings
> DETAIL:  max_connections = 10 is a lower setting than on the primary server,
> where its value was 100.

> happen anyway:
> 
> 2024-08-29 21:44:32.634 EEST [668327] FATAL:  out of shared memory
> 2024-08-29 21:44:32.634 EEST [668327] HINT:  You might need to increase
> "max_locks_per_transaction".
> 2024-08-29 21:44:32.634 EEST [668327] CONTEXT:  WAL redo at 2/FD40FCC8 for
> Standby/LOCK: xid 996 db 5 rel 154045
> 2024-08-29 21:44:32.634 EEST [668327] WARNING:  you don't own a lock of type
> AccessExclusiveLock
> 2024-08-29 21:44:32.634 EEST [668327] LOG:  RecoveryLockHash contains entry
> for lock no longer recorded by lock manager: xid 996 database 5 relation
> 154045
> TRAP: failed Assert("false"), File: "../src/backend/storage/ipc/standby.c",

> Granted, if you restart the server, it will probably succeed because
> restarting the server will kill all the other queries that were holding
> locks. But yuck.

Agreed.

> So how to improve this? I see a few options:
> 
> a) Downgrade the error at startup to a warning, and allow starting the
> standby with smaller settings in standby. At least with a smaller
> max_locks_per_transactions. The other settings also affect the size of
> known-assigned XIDs array, but if the CSN snapshots get committed, that will
> get fixed. In most cases there is enough lock memory anyway, and it will be
> fine. Just fix the assertion failure so that the error message is a little
> nicer.
> 
> b) If you run out of lock space, kill running queries, and prevent new ones
> from starting. Track the locks in startup process' private memory until
> there is enough space in the lock manager, and then re-open for queries. In
> essence, go from hot standby mode to warm standby, until it's possible to go
> back to hot standby mode again.

Either seems fine.  Having never encountered actual lock exhaustion from this,
I'd lean toward (a) for simplicity.

> Thoughts, better ideas?

I worry about future code assuming a MaxBackends-sized array suffices for
something.  That could work almost all the time, breaking only when a standby
replays WAL from a server having a larger array.  What could we do now to
catch that future mistake promptly?  As a start, 027_stream_regress.pl could
use low settings on its standby.

Re: Normalize queries starting with SET for pg_stat_statements

2024-09-24 Thread Michael Paquier

On Tue, Sep 24, 2024 at 04:57:28PM +0900, Michael Paquier wrote:
> 0001 is straight-forward and that was I think a mistake to not include
> that from the start when I've expanded these tests in the v16 cycle
> (well, time..).  0002 also is quite conservative at the end, and this
> design can be used to tune easily the jumbling patterns from gram.y
> depending on the feedback we'd get.

Applied 0001 for now to expand the tests, with one tweak: the removal
of SET NAMES.  It was proving tricky to use something else than UTF-8,
the CI complaining on Windows.  True that this could use like unaccent
and an alternate output in a different file, but I'm not inclined to
take the cost just for this specific query pattern.

The remaining 0002 is attached for now.  I am planning to wrap that
next week after a second lookup, except if there are any comments, of
course.
--
Michael
From 40983157423d93334f30c70def5bf311fcff80f6 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 25 Sep 2024 12:08:04 +0900
Subject: [PATCH v3] Normalize queries starting with SET

---
 src/include/nodes/parsenodes.h| 20 +---
 src/backend/nodes/queryjumblefuncs.c  | 19 +++
 src/backend/parser/gram.y | 24 +++
 contrib/pg_stat_statements/expected/dml.out   |  2 +-
 .../expected/level_tracking.out   |  2 +-
 .../pg_stat_statements/expected/utility.out   | 20 ++--
 contrib/pg_stat_statements/expected/wal.out   |  2 +-
 7 files changed, 70 insertions(+), 19 deletions(-)

diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index e62ce1b753..9e76713f2d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2616,11 +2616,25 @@ typedef enum VariableSetKind
 
 typedef struct VariableSetStmt
 {
+	pg_node_attr(custom_query_jumble)
+
 	NodeTag		type;
 	VariableSetKind kind;
-	char	   *name;			/* variable to be set */
-	List	   *args;			/* List of A_Const nodes */
-	bool		is_local;		/* SET LOCAL? */
+	/* variable to be set */
+	char	   *name;
+	/* List of A_Const nodes */
+	List	   *args;
+
+	/*
+	 * True if arguments should be accounted for in query jumbling.  We use a
+	 * separate flag rather than query_jumble_ignore on "args" as several
+	 * grammar flavors of SET rely on a list of values that are parsed
+	 * directly from the grammar's keywords.
+	 */
+	bool		jumble_args;
+	/* SET LOCAL? */
+	bool		is_local;
+	ParseLoc	location pg_node_attr(query_jumble_location);
 } VariableSetStmt;
 
 /* --
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index 129fb44709..5e43fd9229 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -57,6 +57,7 @@ static void RecordConstLocation(JumbleState *jstate, int location);
 static void _jumbleNode(JumbleState *jstate, Node *node);
 static void _jumbleA_Const(JumbleState *jstate, Node *node);
 static void _jumbleList(JumbleState *jstate, Node *node);
+static void _jumbleVariableSetStmt(JumbleState *jstate, Node *node);
 
 /*
  * Given a possibly multi-statement source string, confine our attention to the
@@ -352,3 +353,21 @@ _jumbleA_Const(JumbleState *jstate, Node *node)
 		}
 	}
 }
+
+static void
+_jumbleVariableSetStmt(JumbleState *jstate, Node *node)
+{
+	VariableSetStmt *expr = (VariableSetStmt *) node;
+
+	JUMBLE_FIELD(kind);
+	JUMBLE_STRING(name);
+
+	/*
+	 * Account for the list of arguments in query jumbling only if told by the
+	 * parser.
+	 */
+	if (expr->jumble_args)
+		JUMBLE_NODE(args);
+	JUMBLE_FIELD(is_local);
+	JUMBLE_LOCATION(location);
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index b1d4642c59..4aa8646af7 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1647,6 +1647,8 @@ set_rest:
 	n->kind = VAR_SET_MULTI;
 	n->name = "TRANSACTION";
 	n->args = $2;
+	n->jumble_args = true;
+	n->location = -1;
 	$$ = n;
 }
 			| SESSION CHARACTERISTICS AS TRANSACTION transaction_mode_list
@@ -1656,6 +1658,8 @@ set_rest:
 	n->kind = VAR_SET_MULTI;
 	n->name = "SESSION CHARACTERISTICS";
 	n->args = $5;
+	n->jumble_args = true;
+	n->location = -1;
 	$$ = n;
 }
 			| set_rest_more
@@ -1669,6 +1673,7 @@ generic_set:
 	n->kind = VAR_SET_VALUE;
 	n->name = $1;
 	n->args = $3;
+	n->location = @3;
 	$$ = n;
 }
 			| var_name '=' var_list
@@ -1678,6 +1683,7 @@ generic_set:
 	n->kind = VAR_SET_VALUE;
 	n->name = $1;
 	n->args = $3;
+	n->location = @3;
 	$$ = n;
 }
 			| var_name TO DEFAULT
@@ -1686,6 +1692,7 @@ generic_set:
 
 	n->kind = VAR_SET_DEFAULT;
 	n->name = $1;
+	n->location = -1;
 	$$ = n;
 }
 			| var_name '=' DEFAULT
@@ -1694,6 +1701,7 @@ generic_set:
 
 	n->kind = VAR_SET_DEFAULT;
 	n->name = $1;
+	n->location = -1;
 	$$ = n;
 }
 		;
@@ -1706,6 +1714,7 @@

Re: Eager aggregation, take 3

2024-09-24 Thread Richard Guo

On Wed, Aug 28, 2024 at 9:01 PM Robert Haas  wrote:
> On Tue, Aug 27, 2024 at 11:57 PM Tender Wang  wrote:
> > I haven't look all of them. I just pick few simple plan test(e.g. 19.sql, 
> > 45.sql).
> > For example, 19.sql, eager agg pushdown doesn't get large gain, but a little
> > performance regress.
>
> Yeah, this is one of the things I was worried about in my previous
> reply to Richard. It would be worth Richard, or someone, probing into
> exactly why that's happening. My fear is that we just don't have good
> enough estimates to make good decisions, but there might well be
> another explanation.

Sorry it takes some time to switch back to this thread.

I revisited the part about cost estimates for grouped paths in this
patch, and I found a big issue: the row estimate for a join path could
be significantly inaccurate if there is a grouped join path beneath
it.

The reason is that it is very tricky to set the size estimates for a
grouped join relation.  For a non-grouped join relation, we know that
all its paths have the same rowcount estimate (well, in theory).  But
this is not true for a grouped join relation.  Suppose we have a
grouped join relation for t1/t2 join.  There might be two paths for
it:

Aggregate
-> Join
-> Scan on t1
-> Scan on t2

Or

Join
 -> Scan on t1
 -> Aggregate
 -> Scan on t2

These two paths can have very different rowcount estimates, and we
have no way of knowing which one to set for this grouped join
relation, because we do not know which path would be picked in the
final plan.  This issue can be illustrated with the query below.

create table t (a int, b int, c int);
insert into t select i%10, i%10, i%10 from generate_series(1,1000)i;
analyze t;

set enable_eager_aggregate to on;

explain (costs on)
select sum(t2.c) from t t1 join t t2 on t1.a = t2.a join t t3 on t2.b
= t3.b group by t3.a;
  QUERY PLAN
---
 Finalize HashAggregate  (cost=6840.60..6840.70 rows=10 width=12)
   Group Key: t3.a
   ->  Nested Loop  (cost=1672.00..1840.60 rows=100 width=12)
 Join Filter: (t2.b = t3.b)
 ->  Partial HashAggregate  (cost=1672.00..1672.10 rows=10 width=12)
   Group Key: t2.b
   ->  Hash Join  (cost=28.50..1172.00 rows=10 width=8)
 Hash Cond: (t1.a = t2.a)
 ->  Seq Scan on t t1  (cost=0.00..16.00 rows=1000 width=4)
 ->  Hash  (cost=16.00..16.00 rows=1000 width=12)
   ->  Seq Scan on t t2  (cost=0.00..16.00
rows=1000 width=12)
 ->  Materialize  (cost=0.00..21.00 rows=1000 width=8)
   ->  Seq Scan on t t3  (cost=0.00..16.00 rows=1000 width=8)
(13 rows)

Look at the Nested Loop node:

   ->  Nested Loop  (cost=1672.00..1840.60 rows=100 width=12)

How can a 10-row outer path joining a 1000-row inner path generate
100 rows?  This is because we are using the plan of the first path
described above, and the rowcount estimate of the second path.  What a
kluge!

To address this issue, one solution I’m considering is to recalculate
the row count estimate for a grouped join path using its outer and
inner paths.  While this may seem expensive, it might not be that bad
since we will cache the results of the selectivity calculation.  In
fact, this is already the approach we take for parameterized join
paths (see get_parameterized_joinrel_size).

Any thoughts on this?

Thanks
Richard

Re: Clock-skew management in logical replication

2024-09-24 Thread Michael Paquier

On Fri, Sep 20, 2024 at 10:21:34AM -0400, Tom Lane wrote:
> FWIW, I cannot see why we would do anything beyond suggesting that
> people run NTP.  That's standard anyway on the vast majority of
> machines these days.  Why would we add complexity that we have
> to maintain (and document) in order to cater to somebody not doing
> that?

Agreed.  I am on the same boat as you are here.  I don't think that
the database should be in charge of taking like decisions based on a
clock that may have gone crazy.  Precise clocks are a difficult
problem, for sure, but this patch is just providing a workaround for a
problem that should not be linked to the backend engine by default and
I agree that we will feel better if we neither maintain this stuff nor
enter in this territory.

Making that more pluggable, though, has the merit to let out-of-core
folks do what they want, even if we may finish with an community
ecosystem that has more solutions than the number of fingers on one
hand.  I've seen multiple ways of solving conflicts across multiple
logical nodes in the past years, some being clock-based, some not,
with more internal strictly-monotonic counting solution to solve any
conflicts.
--
Michael

signature.asc
Description: PGP signature

Re: Recovering from detoast-related catcache invalidations

2024-09-24 Thread Noah Misch

On Sun, Jan 14, 2024 at 12:14:11PM -0800, Noah Misch wrote:
> On Fri, Jan 12, 2024 at 03:47:13PM -0500, Tom Lane wrote:
> > Oh!  After nosing around a bit more I remembered systable_recheck_tuple,
> > which is meant for exactly this purpose.  So v4 attached.
> 
> systable_recheck_tuple() is blind to heap_inplace_update(), so it's not a
> general proxy for invalidation messages.  The commit for $SUBJECT (ad98fb1)
> doesn't create any new malfunctions, but I expect the systable_recheck_tuple()
> part will change again before the heap_inplace_update() story is over
> (https://postgr.es/m/flat/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com).

Commit f9f47f0 (2024-06-27) addressed inplace updates here.

Re: How to send patch with so many files changes?

2024-09-24 Thread Tony Wayne

On Wed, Sep 25, 2024 at 6:36 AM Michael Paquier  wrote:

> On Wed, Sep 25, 2024 at 06:19:39AM +0530, Tony Wayne wrote:
> > These changes are for core ,I think it would be better to either move
> whole
> > changes to core or contrib as an extension.
>
> Please avoid top-posting.  The community mailing lists use
> bottom-posting, to ease discussions.  See:
> https://en.wikipedia.org/wiki/Posting_style#Bottom-posting
>
> Thanks for the feedback.

> > On Wed, Sep 25, 2024 at 6:09 AM David G. Johnston <
> > david.g.johns...@gmail.com> wrote:
> >> Usually you'd want to use format-patch so your commit message(s) make it
> >> into the artifact.  Especially for something complex/large.
>
> The community wiki has some guidelines about all that:
> https://wiki.postgresql.org/wiki/Submitting_a_Patch
>
> In my experience, it is much easier to sell a feature to the community
> if a patch is organized into independent useful pieces with
> refactoring pieces presented on top of the actual feature.  In order
> to achieve that `git format-patch` is essential because it is possible
> to present a patch set organizing your ideas so as others need to
> spend less time trying to figure out what a patch set is doing when
> doing a review.  format-patch with `git am` is also quite good to
> track the addition of new files or the removal of old files.  Writing
> your ideas in the commit logs can also bring a lot of insight for
> anybody reading your patches.
>
> For simpler and localized changes, using something like git diff would
> be also OK that can be applied with a simple `patch` command can also
> be fine.  I've done plenty of work with patches sent to the lists this
> way for bug fixes.  Of course this is case-by-case, for rather complex
> bug fixes format-patch can still be a huge gain of time when reading
> somebody else's ideas on a specific matter.
> --
> Michael
>
Thanks, I got it 👍.

Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-24 Thread Michael Paquier

On Tue, Sep 24, 2024 at 02:26:08PM -0500, Nathan Bossart wrote:
> I gave this a try and, unsurprisingly, found a bunch of other problems.  I
> hastily hacked together the attached patch that should fix all of them, but
> I'd still like to comb through the code a bit more.  The three catalogs
> with problems are pg_replication_origin, pg_subscription, and
> pg_constraint.

Regression tests don't blow up after this patch and the reindex parts.

> pg_contraint has had a TOAST table for a while, and I don't
> think it's unheard of for conbin to be large, so this one is probably worth
> fixing.

Ahh.  That's the tablecmds.c part for the partition detach.

> pg_subscription hasn't had its TOAST table for quite as long, but
> presumably subpublications could be large enough to require out-of-line
> storage.  pg_replication_origin, however, only has one varlena column:
> roname.  Three out of the seven problem areas involve
> pg_replication_origin, but AFAICT that'd only ever be a problem if the name
> of your replication origin requires out-of-line storage.  So... maybe we
> should just remove pg_replication_origin's TOAST table instead...

I'd rather keep it, FWIW.  Contrary to pg_authid it does not imply
problems at the same scale because we would have access to the toast
relation in all the code paths with logical workers or table syncs.
The other one was at early authentication stages.

+   /*
+* If we might need TOAST access, make sure the caller has set up a 
valid
+* snapshot.
+*/
+   Assert(HaveRegisteredOrActiveSnapshot() ||
+  !OidIsValid(heapRel->rd_rel->reltoastrelid) ||
+  !IsNormalProcessingMode());
+

I didn't catch that we could just reuse the opened Relation in these
paths and check for reltoastrelid.  Nice.

It sounds to me that we should be much more proactive in detecting
these failures and add something like that on HEAD.  That's cheap
enough.  As the checks are the same for all these code paths, perhaps
just hide them behind a local macro to reduce the duplication?

Not the responsibility of this patch, but the business with
clear_subscription_skip_lsn() with its conditional transaction start
feels messy.  This comes down to the way handles work for 2PC and the
streaming, which may or may not be in a transaction depending on the
state of the upper caller.  Your patch looks right in the way
snapshots are set, as far as I've checked.
--
Michael

signature.asc
Description: PGP signature

Re: AIX support

2024-09-24 Thread Heikki Linnakangas


On 24/09/2024 14:25, Srirama Kucherlapati wrote:

Hi Heikki & team,

Could you please let me know your comments on the previous details?

Attached are the individual patches for AIX and gcc(__sync) routines.


Repeating what I said earlier:


Ok, if we don't need the assembler code at all, that's good. A patch
to introduce AIX support should not change it for non-AIX powerpc
systems though. That might be a good change, but would need to be
justified separately, e.g.  by some performance testing, and should
be a separate patch

--
Heikki Linnakangas
Neon (https://neon.tech)

index_delete_sort: Unnecessary variable "low" is used in heapam.c

2024-09-24 Thread btnakamurakoukil


Hi hackers,

I noticed unnecessary variable "low" in index_delete_sort() 
(/postgres/src/backend/access/heap/heapam.c), patch attached. What do 
you think?


Regards,
Koki Nakamura From c0bf7f8468cb0e4f74c41e434adf70bdbfb3ad6d Mon Sep 17 00:00:00 2001
From: KokiNaka 
Date: Tue, 24 Sep 2024 10:46:09 +0900
Subject: [PATCH] Delete low

---
 src/backend/access/heap/heapam.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f167107257..64953541cd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7940,7 +7940,6 @@ index_delete_sort(TM_IndexDeleteOp *delstate)
 {
 	TM_IndexDelete *deltids = delstate->deltids;
 	int			ndeltids = delstate->ndeltids;
-	int			low = 0;
 
 	/*
 	 * Shellsort gap sequence (taken from Sedgewick-Incerpi paper).
@@ -7956,7 +7955,7 @@ index_delete_sort(TM_IndexDeleteOp *delstate)
 
 	for (int g = 0; g < lengthof(gaps); g++)
 	{
-		for (int hi = gaps[g], i = low + hi; i < ndeltids; i++)
+		for (int hi = gaps[g], i = hi; i < ndeltids; i++)
 		{
 			TM_IndexDelete d = deltids[i];
 			int			j = i;
-- 
2.40.1

Re: Documentation to upgrade logical replication cluster

2024-09-24 Thread Amit Kapila

On Fri, Sep 20, 2024 at 5:46 PM vignesh C  wrote:
>
> I didn’t include a note because each disable/enable statement
> specifies: a) Disable all subscriptions on the node, b) Enable all
> subscriptions on the node. The attached v11 version patch just to show
> the examples with one subscription.
>

The following steps in the bi-directional node upgrade have some problems.

+   
+On node1, create any tables that were created in
+node2 between 
+and now, e.g.:
+
+node1=# CREATE TABLE distributors (did integer PRIMARY KEY, name varchar(40));
+CREATE TABLE
+
+   
+  
+
+  
+   
+Enable all the subscriptions on node2 that are
+subscribing the changes from node1 by using
+ALTER
SUBSCRIPTION ... ENABLE,
+e.g.:
+
+node2=# ALTER SUBSCRIPTION sub1_node1_node2 ENABLE;
+ALTER SUBSCRIPTION
+
+   
+  
+
+  
+   
+Refresh the node2 subscription's publications using
+ALTER
SUBSCRIPTION ... REFRESH PUBLICATION,
+e.g.:
+
+node2=# ALTER SUBSCRIPTION sub1_node1_node2 REFRESH PUBLICATION;
+ALTER SUBSCRIPTION
+
+   

If you are creating missing tables on node-1, won't that node's
subscription be refreshed to get the missing data? Also, I suggest
moving the step-2 in the above steps to enable subscriptions on node-2
should be moved before creating a table on node-1 and then issuing a
REFRESH command on node-1. The similar steps for other node's upgrade
following these steps have similar problems.

-- 
With Regards,
Amit Kapila.

Re: Documentation to upgrade logical replication cluster

2024-09-24 Thread Amit Kapila

On Tue, Sep 24, 2024 at 4:20 PM Amit Kapila  wrote:
>
> On Fri, Sep 20, 2024 at 5:46 PM vignesh C  wrote:
> >
> > I didn’t include a note because each disable/enable statement
> > specifies: a) Disable all subscriptions on the node, b) Enable all
> > subscriptions on the node. The attached v11 version patch just to show
> > the examples with one subscription.
> >
>
> The following steps in the bi-directional node upgrade have some problems.
>

One more point to note is that I am planning to commit this patch only
for HEAD. We can have an argument to backpatch this to 17 as well but
as users would be able to upgrade the slots from 17 onwards, I am
inclined to push this to HEAD only.

-- 
With Regards,
Amit Kapila.

Re: not null constraints, again

2024-09-24 Thread Alvaro Herrera

On 2024-Sep-24, jian he wrote:

> static void
> set_attnotnull(List **wqueue, Relation rel, AttrNumber attnum, bool recurse,
>LOCKMODE lockmode)
> {

> What do you think of the above refactor?
> (I intentionally deleted empty new line)

Looks nicer ... but you know what?  After spending some more time with
it, I realized that one caller is dead code (in
AttachPartitionEnsureIndexes) and another caller doesn't need to ask for
recursion, because it recurses itself (in ATAddCheckNNConstraint).  So
that leaves us with a grand total of zero callers that need the
recursion here ... which means we can simplify it to the case that it
only examines a single relation and never recurses.

So I've stripped it down to its bare minimum:

/*
 * Helper to set pg_attribute.attnotnull if it isn't set, and to tell phase 3
 * to verify it.
 *
 * When called to alter an existing table, 'wqueue' must be given so that we
 * can queue a check that existing tuples pass the constraint.  When called
 * from table creation, 'wqueue' should be passed as NULL.
 */
static void
set_attnotnull(List **wqueue, Relation rel, AttrNumber attnum,
   LOCKMODE lockmode)
{
Oid reloid = RelationGetRelid(rel);
HeapTuple   tuple;
Form_pg_attribute attForm;

CheckAlterTableIsSafe(rel);

tuple = SearchSysCacheCopyAttNum(reloid, attnum);
if (!HeapTupleIsValid(tuple))
elog(ERROR, "cache lookup failed for attribute %d of relation 
%u",
 attnum, reloid);
attForm = (Form_pg_attribute) GETSTRUCT(tuple);
if (!attForm->attnotnull)
{
Relationattr_rel;

attr_rel = table_open(AttributeRelationId, RowExclusiveLock);

attForm->attnotnull = true;
CatalogTupleUpdate(attr_rel, &tuple->t_self, tuple);

if (wqueue && !NotNullImpliedByRelConstraints(rel, attForm))
{
AlteredTableInfo *tab;

tab = ATGetQueueEntry(wqueue, rel);
tab->verify_new_notnull = true;
}

CommandCounterIncrement();

table_close(attr_rel, RowExclusiveLock);
}

heap_freetuple(tuple);
}

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: [PATCH] Support Int64 GUCs

2024-09-24 Thread Li Japin

Hi, Aleksander Alekseev

Thanks for updating the patch.

> On Sep 24, 2024, at 17:27, Aleksander Alekseev  
> wrote:
> 
> Hi, Alexander!
> 
>> Thank you for your work on this subject.
> 
> Thanks for your feedback.
> 
>> It doesn't look like these *_age GUCs could benefit from being 64-bit,
>> before 64-bit transaction ids get in.  However, I think there are some
>> better candidates.
>> 
>> autovacuum_vacuum_threshold
>> autovacuum_vacuum_insert_threshold
>> autovacuum_analyze_threshold
>> 
>> This GUCs specify number of tuples before vacuum/analyze.  That could
>> be more than 2^31.  With large tables of small tuples, I can't even
>> say this is always impractical to have values greater than 2^31.
> 
> Sounds good to me. Fixed.

I found the autovacuum_vacuum_threshold, autovacuum_vacuum_insert_threshold
and autovacuum_analyze_threshold is change to int64 for relation option,
however the GUCs are still integers.

```
postgres=# select * from pg_settings where name = 'autovacuum_vacuum_threshold' 
\gx
-[ RECORD 1 ]---+
name| autovacuum_vacuum_threshold
setting | 50
unit|
category| Autovacuum
short_desc  | Minimum number of tuple updates or deletes prior to vacuum.
extra_desc  |
context | sighup
vartype | integer
source  | default
min_val | 0
max_val | 2147483647
enumvals|
boot_val| 50
reset_val   | 50
sourcefile  |
sourceline  |
pending_restart | f
```

Is there something I missed?

> 
>>> Secondly, DefineCustomInt64Variable() is not test-covered. Turned out
>>> it was not even defined (although declared) in the original patch.
>>> This was fixed in the attached version. Maybe one of the test modules
>>> could use it even if it makes little sense for this particular module?
>>> For instance, test/modules/worker_spi/ could use it for
>>> worker_spi.naptime.
>> 
>> I don't think there are good candidates among existing extension GUCs.
>> I think we could add something for pure testing purposes somewhere in
>> src/test/modules.
> 
> I found a great candidate in src/test/modules/delay_execution:
> 
> ```
>DefineCustomIntVariable("delay_execution.post_planning_lock_id",
>"Sets the advisory lock ID to be
> locked/unlocked after planning.",
> ```
> 
> Advisory lock IDs are bigints [1]. I modified the module to use Int64's.

I check the delay_execution.post_planning_lock_id parameter, and it’s varitype
is int64, maybe bigint is better, see [1].

```
postgres=# select * from pg_settings where name = 
'delay_execution.post_planning_lock_id' \gx
-[ RECORD 1 
]---+
name| delay_execution.post_planning_lock_id
setting | 0
unit|
category| Customized Options
short_desc  | Sets the advisory lock ID to be locked/unlocked after 
planning.
extra_desc  | Zero disables the delay.
context | user
vartype | int64
source  | default
min_val | 0
max_val | 9223372036854775807
enumvals|
boot_val| 0
reset_val   | 0
sourcefile  |
sourceline  |
pending_restart | f
```

[1] https://www.postgresql.org/docs/current/datatype-numeric.html


--
Regrads,
Japin Li

Re: pgbench: Improve result outputs related to failed transactinos

2024-09-24 Thread Yugo NAGATA

On Tue, 24 Sep 2024 19:00:04 +0900 (JST)
Tatsuo Ishii  wrote:

> > I overlooked the "NaN% of total" in per-script results.
> > I think this NaN also should be avoided.
> > 
> > I fixed  the number of transactions in per-script results to include
> > skipped and failed transactions. It prevents to print  "total of NaN%"
> > when any transactions are not successfully  processed. 
> 
> Thanks for the fix. Here is the new run with the v2 patch. The result
> looks good to me.
> 
> src/bin/pgbench/pgbench -p 11002 -c1 -t 1 -f c.sql -f d.sql  
> --failures-detailed -r test
> pgbench (18devel)
> starting vacuum...end.
> transaction type: multiple scripts
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> maximum number of tries: 1
> number of transactions per client: 1
> number of transactions actually processed: 1/1
> number of failed transactions: 0 (0.000%)
> number of serialization failures: 0 (0.000%)
> number of deadlock failures: 0 (0.000%)
> latency average = 2.434 ms
> initial connection time = 2.117 ms
> tps = 410.846343 (without initial connection time)
> SQL script 1: c.sql
>  - weight: 1 (targets 50.0% of total)
>  - 1 transactions (100.0% of total)
>  - number of transactions actually pocessed: 1 (tps = 410.846343)
>  - number of failed transactions: 0 (0.000%)
>  - number of serialization failures: 0 (0.000%)
>  - number of deadlock failures: 0 (0.000%)
>  - latency average = 2.419 ms
>  - latency stddev = 0.000 ms
>  - statement latencies in milliseconds and failures:
>  0.187   0  begin;
>  0.153   0  set transaction isolation level serializable;
>  0.977   0  insert into t1 select max(i)+1,2 from t1;
>  1.102   0  end;
> SQL script 2: d.sql
>  - weight: 1 (targets 50.0% of total)
>  - 0 transactions (0.0% of total)
>  - statement latencies in milliseconds and failures:
>  0.000   0  begin;
>  0.000   0  set transaction isolation level serializable;
>  0.000   0  insert into t1 select max(i)+1,2 from t1;
>  0.000   0  end;
> 
> > Although it breaks the back-compatibility, this seems reasonable
> > modification because not only succeeded transactions but also skips and
> > failures ones are now handled and reported for each script.  Also, the
> > number of transactions actually processed per-script and TPS based on
> > it are now output explicitly in a separate line.
> 
> Okay for me as long as the patch is pushed to master branch.
> 
> A small comment on the comments in the patch: pgindent dislikes some
> of the comment indentation styles. See attached pgindent.txt. Although
> such a small defect would be fixed by committers when a patch gets
> committed anyway, you might want to help committers beforehand.

Thank you for your comments.
I've attached a updated patch that I applied pgindent.

Regards,
Yugo Nagata


> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp


-- 
Yugo NAGATA 
>From 7ecc6ceb56ed10c1cf77bc83f393dbe3515e517e Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Thu, 19 Sep 2024 01:37:54 +0900
Subject: [PATCH v3] pgbench: Improve result outputs related to failed
 transactions

Previously, per-script statistics were never output when all
transactions are failed due to serialization or deadlock errors.
However, it is reasonable to report such information if there
are ones even when there are no successful transaction since
these failed transactions are now objects to be reported.

Meanwhile, if the total number of successful, skipped, and
failed transactions is zero, we don't have to report the number
of failed transactions as similar to the number of skipped
transactions, which avoids to print "NaN%" in lines on failed
transaction reports.

Also, the number of transactions in per-script results now
includes skipped and failed transactions. It prevents to print
"total of NaN%" when any transactions are not successfully
processed. The number of transactions actually processed per-script
and TPS based on it are now output explicitly in a separate line.
---
 src/bin/pgbench/pgbench.c | 81 ++-
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 61618f2e18..7d0d729e86 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6391,6 +6391,13 @@ printResults(StatsData *total,
 			   total->cnt);
 	}
 
+	/*
+	 * Remaining stats are nonsensical if we failed to execute any xacts due
+	 * to others than serialization or deadlock errors
+	 */
+	if (total_cnt <= 0)
+		return;
+
 	printf("number of failed transactions: " INT64_FORMAT " (%.3f%%)\n",
 		   failures, 100.0 * failures / total_cnt);
 
@@ -6412,10 +6419,6 @@ printResults(StatsData *total,
 		printf("total number of retries: " INT64_FORMAT "\n", total->retries);
 	}
 
-	/

Re: index_delete_sort: Unnecessary variable "low" is used in heapam.c

2024-09-24 Thread Daniel Gustafsson

> On 24 Sep 2024, at 10:32, btnakamurakoukil  
> wrote:

> I noticed unnecessary variable "low" in index_delete_sort() 
> (/postgres/src/backend/access/heap/heapam.c), patch attached. What do you 
> think?

That variable does indeed seem to not be used, and hasn't been used since it
was committed in d168b666823.  The question is if it's a left-over from
development which can be removed, or if it should be set and we're missing an
optimization.  Having not read the referenced paper I can't tell so adding
Peter Geoghegan who wrote this for clarification.

--
Daniel Gustafsson

Re: Add llvm version into the version string

2024-09-24 Thread Alastair Turner

On Mon, 23 Sept 2024 at 19:45, Tom Lane  wrote:
...

> Maybe another idea could be a new system view?
>
> => select * from pg_system_version;
>  property | value
> 
>  core version | 18.1
>  architecture | x86_64-pc-linux-gnu
>  word size| 64 bit
>  compiler | gcc (GCC) 12.5.0
>  ICU version  | 60.3
>  LLVM version | 18.1.0
>  ...
>
>
A view like that sounds very useful.


> Adding rows to a view over time seems way less likely to cause
> problems than messing with a string people probably have crufty
> parsing code for.
>
> >> If it's going to be a new separate function, I guess it won't
> >> make much difference to ask either to call the new function or the
> >> llvm-config, right?
>
> I do think that if we can get a version number out of the loaded
> library, that is worth doing.  I don't trust the llvm-config that
> happens to be in somebody's PATH to be the same version that their
> PG is actually built with.
>
>
Since the build and runtime versions may differ for some things like llvm,
libxml2 and the interpreters behind some of the PLs, it may be valuable to
expand the view and show two values - a build time (or configure time)
value and a runtime value for these.

Regards

Alastair

Re: Why don't we consider explicit Incremental Sort?

2024-09-24 Thread Tomas Vondra




On 9/24/24 00:21, David Rowley wrote:
> On Tue, 24 Sept 2024 at 02:01, Tomas Vondra  wrote:
>>
>> On 9/23/24 05:03, Richard Guo wrote:
>>> On Sun, Sep 22, 2024 at 1:38 PM David Rowley  wrote:
 Just looking at the commit message:

> The rationale is based on the assumption that incremental sort is
> always faster than full sort when there are presorted keys, a premise
> that has been applied in various parts of the code.  This assumption
> does not always hold, particularly in cases with a large skew in the
> number of rows within the presorted groups.

 My understanding is that the worst case for incremental sort is the
 same as sort, i.e. only 1 presorted group, which is the same effort to
 sort. Is there something I'm missing?
>>>
>>> I was thinking that when there’s a large skew in the number of rows
>>> per pre-sorted group, incremental sort might underperform full sort.
>>> As mentioned in the comments for cost_incremental_sort, it has to
>>> detect the sort groups, plus it needs to perform tuplesort_reset after
>>> each group.  However, I doubt these factors would have a substantial
>>> impact on the performance of incremental sort.  So maybe in the worst
>>> skewed groups case, incremental sort may still perform similarly to
>>> full sort.
>>>
>>> Another less obvious reason is that in cases of skewed groups, we may
>>> significantly underestimate the cost of incremental sort.  This could
>>> result in choosing a more expensive subpath under the sort.  Such as
>>> the example in [1], we end up with IndexScan->IncrementalSort rather
>>> than SeqScan->FullSort, while the former plan is more expensive to
>>> execute.  However, this point does not affect this patch, because for
>>> a mergejoin, we only consider outerrel's cheapest-total-cost when we
>>> assume that an explicit sort atop is needed, i.e., we do not have a
>>> chance to select which subpath to use in this case.
>>>
>>> [1] 
>>> https://postgr.es/m/CAMbWs49+CXsrgbq0LD1at-5jR=ahhn0ytdy9yvgxasmfndz...@mail.gmail.com
>>>
>>
>> You don't need any skew. Consider this perfectly uniform dataset:
>>
>>  Sort  (cost=127757.34..130257.34 rows=100 width=8)
>>(actual time=186.288..275.813 rows=100 loops=1)
>>Sort Key: a, b
>>Sort Method: external merge  Disk: 17704kB
>>->  Seq Scan on t  (cost=0.00..14425.00 rows=100 width=8)
>> (actual time=0.005..35.989 rows=100 loops=1)
>>  Planning Time: 0.075 ms
>>  Execution Time: 301.143 ms
>>
>> set enable_incremental_sort = on;
> 
>>  Incremental Sort  (cost=1.03..68908.13 rows=100 width=8)
>> (actual time=0.102..497.143 rows=100 loops=1)
>>Sort Key: a, b
>>Presorted Key: a
>>Full-sort Groups: 27039  Sort Method: quicksort
>>  Average Memory: 25kB  Peak Memory: 25kB
>>->  Index Scan using t_a_idx on t
>>(cost=0.42..37244.25 rows=100 width=8)
>> (actual time=0.050..376.403 rows=100 loops=1)
>>  Planning Time: 0.057 ms
>>  Execution Time: 519.301 ms
> 
> Ok, let's first look at the total Seq Scan cost of the first EXPLAIN.
> 14425.00 units and 35.989 milliseconds to execute. That's about 400.81
> units per millisecond. The Index Scan is only being charged 98.94
> units per millisecond of execution.  If the Index Scan was costed the
> same per unit as the Seq Scan, the total Index Scan cost would be
> 150868 which is already more than the Seq Scan plan without even
> adding the Incremental Sort costs on. To me, that seems to be an
> inaccuracy either with the Seq Scan costings coming out too expensive
> or Index Scan coming out too cheap.
> 
> If you think that the Incremental Sort plan shouldn't be chosen
> because the Index Scan costing came out too cheap (or the Seq Scan
> costing was too expensive) then I disagree. Applying some penalty to
> one node type because some other node type isn't costed accurately is
> just not a practice we should be doing. Instead, we should be trying
> our best to cost each node type as accurately as possible.  If you
> think there's some inaccuracy with Incremental Sort, then let's look
> into that. If you want to re-add the penalty because Index Scan
> costing isn't good, let's see if we can fix Index Scan costings.
> 

You're right, this wasn't a good example. I tried to come up with
something quickly, and I didn't realize the extra time comes from the
other node in the plan, not the sort :-(

I vaguely remember there were a couple reports about slow queries
involving incremental sort, but I didn't find any that would show
incremental sort itself being slower. So maybe I'm wrong ...

IIRC the concerns were more about planning - what happens when the
multi-column ndistinct estimates (which are quite shaky) are wrong, etc.
Or how it interacts with LIMIT with hidden correlations, etc.

Sure, those are not problems of incremental sort, it just makes it
easier to hit some of these issues. Of course, that d

Re: [PATCH] Support Int64 GUCs

2024-09-24 Thread Aleksander Alekseev

Hi,

> PFA the updated patch.

It is worth mentioning that v2 should not be merged as is.
Particularly although it changes the following GUCs:

> autovacuum_vacuum_threshold
> autovacuum_vacuum_insert_threshold
> autovacuum_analyze_threshold

... it doesn't affect the code that uses these GUCs which results in
casting int64s to ints.

I would appreciate a bit more feedback on v2. If the community is fine
with modifying these GUCs I will correct the patch in this respect.

-- 
Best regards,
Aleksander Alekseev

Re: not null constraints, again

2024-09-24 Thread Alvaro Herrera

On 2024-Sep-24, jian he wrote:

> static Oid
> StoreRelNotNull(Relation rel, const char *nnname, AttrNumber attnum,
> bool is_validated, bool is_local, int inhcount,
> bool is_no_inherit)
> {
> OidconstrOid;
> Assert(attnum > InvalidAttrNumber);
> constrOid =
> CreateConstraintEntry(nnname,
>   RelationGetNamespace(rel),
>   CONSTRAINT_NOTNULL,
>   false,
>   false,
>   is_validated
> 
> }
> is is_validated always true, can we add an Assert on it?

Sure.  FWIW the reason it's a parameter at all, is that the obvious next
patch is to add support for NOT VALID constraints.  I don't want to
introduce support for NOT VALID immediately with the first patch because
I'm sure some wrinkles will appear; but a followup patch will surely
follow shortly.

> in AddRelationNotNullConstraints
> for (int outerpos = 0; outerpos < list_length(old_notnulls); outerpos++)
> {
> }
> CookedConstraint struct already has "int inhcount;"
> can we rely on that, rather than using add_inhcount?
> we can also add an Assert: "Assert(!cooked->is_no_inherit);"

I'm not sure that works, because if your parent has two parents, you
don't want to add two -- you still have only one immediate parent.

I think the best way to check for correctness is to set up a scenario
where you would have that cooked->inhcount=2 (using whatever CREATE
TABLEs are necessary) and then see if ALTER TABLE NO INHERIT reach the
correct count (0) when all [immediate] parents are detached.  But
anyway, keep in mind that inhcount keeps the number of _immediate_
parents, not the number of ancestors.

> /*
>  * Remember primary key index, if any.  We do this only if the index
>  * is valid; but if the table is partitioned, then we do it even if
>  * it's invalid.
>  *
>  * The reason for returning invalid primary keys for foreign tables is
>  * because of pg_dump of NOT NULL constraints, and the fact that PKs
>  * remain marked invalid until the partitions' PKs are attached to it.
>  * If we make rd_pkindex invalid, then the attnotnull flag is reset
>  * after the PK is created, which causes the ALTER INDEX ATTACH
>  * PARTITION to fail with 'column ... is not marked NOT NULL'.  With
>  * this, dropconstraint_internal() will believe that the columns must
>  * not have attnotnull reset, so the PKs-on-partitions can be attached
>  * correctly, until finally the PK-on-parent is marked valid.
>  *
>  * Also, this doesn't harm anything, because rd_pkindex is not a
>  * "real" index anyway, but a RELKIND_PARTITIONED_INDEX.
>  */
> if (index->indisprimary &&
> (index->indisvalid ||
>  relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
> {
> pkeyIndex = index->indexrelid;
> pkdeferrable = !index->indimmediate;
> }
> The comment (especially paragraph "The reason for returning invalid
> primary keys") is overwhelming.
> Can you also add some sql examples into the comments.
> I guess some sql examples, people can understand it more easily?

Ooh, thanks for catching this -- this comment is a leftover from
previous idea that you could have PKs without NOT NULL.  I think it
mostly needs to be removed, and maybe the whole "if" clause put back to
its original form.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"If it is not right, do not do it.
If it is not true, do not say it." (Marcus Aurelius, Meditations)

Re: Compress ReorderBuffer spill files using LZ4

2024-09-24 Thread Tomas Vondra




On 9/23/24 21:58, Julien Tachoires wrote:
> Hi Tomas,
> 
> Le lun. 23 sept. 2024 à 18:13, Tomas Vondra  a écrit :
>>
>> Hi,
>>
>> I've spent a bit more time on this, mostly running tests to get a better
>> idea of the practical benefits.
> 
> Thank you for your code review and testing!
> 
>> Firstly, I think there's a bug in ReorderBufferCompress() - it's legal
>> for pglz_compress() to return -1. This can happen if the data is not
>> compressible, and would not fit into the output buffer. The code can't
>> just do elog(ERROR) in this case, it needs to handle that by storing the
>> raw data. The attached fixup patch makes this work for me - I'm not
>> claiming this is the best way to handle this, but it works.
>>
>> FWIW I find it strange the tests included in the patch did not trigger
>> this. That probably means the tests are not quite sufficient.
>>
>>
>> Now, to the testing. Attached are two scripts, testing different cases:
>>
>> test-columns.sh - Table with a variable number of 'float8' columns.
>>
>> test-toast.sh - Table with a single text column.
>>
>> The script always sets up a publication/subscription on two instances,
>> generates certain amount of data (~1GB for columns, ~3.2GB for TOAST),
>> waits for it to be replicated to the replica, and measures how much data
>> was spilled to disk with the different compression methods (off, pglz
>> and lz4). There's a couple more metrics, but that's irrelevant here.
> 
> It would be interesting to run the same tests with zstd: in my early
> testing I found that zstd was able to provide a better compression
> ratio than lz4, but seemed to use more CPU resources/is slower.
> 

Oh, I completely forgot about zstd. I don't think it'd substantially
change the conclusions, though. It might compress better/worse for some
cases, but the overall behavior would remain the same.

I can't test this right now, the testmachine is busy with some other
stuff. But it should not be difficult to update the test scripts I
attached and get results yourself. There's a couple hard-coded paths
that need to be updated, ofc.

>> For the "column" test, it looks like this (this is in MB):
>>
>> rowscolumnsdistributionoff pglzlz4
>>   
>>   10   1000compressible778  20   9
>>  random778 778  16
>>   
>>   100   100compressible916 116  62
>>  random916 916  67
>>
>> It's very clear that for the "compressible" data (which just copies the
>> same value into all columns), both pglz and lz4 can significantly reduce
>> the amount of data. For 1000 columns it's 780MB -> 20MB/9MB, for 100
>> columns it's a bit less efficient, but still good.
>>
>> For the "random" data (where every column gets a random value, but rows
>> are copied), it's a very different story - pglz does not help at all,
>> while lz4 still massively reduces the amount of spilled data.
>>
>> I think the explanation is very simple - for pglz, we compress each row
>> on it's own, there's no concept of streaming/context. If a row is
>> compressible, it works fine, but when the row gets random, pglz can't
>> compress it at all. For lz4, this does not matter, because with the
>> streaming mode it still sees that rows are just repeated, and so can
>> compress them efficiently.
> 
> That's correct.
> 
>> For TOAST test, the results look like this:
>>
>>   distribution repeatstoast   offpglz lz4
>>   ===
>>   compressible   1lz4  14   2   1
>>   pglz 40   4   3
>>   1000lz4  32  16   9
>>   pglz 54  17  10
>> -
>> random   1lz4330533053157
>>   pglz   330533053157
>>   1000lz4316631621580
>>   pglz   333433261745
>>--
>>random2   1lz4330533053157
>>   pglz   330533053158
>>   1000lz4316031563010
>>   pglz   333433263172
>>
>> The "repeats" value means how long the string is - it's the number of
>> "md5" hashes added to the string. The number of rows is calculated to
>> keep the total amount of data the same. The "toast" column tracks what
>> compression was used for TOAST, I was wondering if it matters.
>>
>> This time there are three data distributions

Re: pgbench: Improve result outputs related to failed transactinos

2024-09-24 Thread Tatsuo Ishii

> I overlooked the "NaN% of total" in per-script results.
> I think this NaN also should be avoided.
> 
> I fixed  the number of transactions in per-script results to include
> skipped and failed transactions. It prevents to print  "total of NaN%"
> when any transactions are not successfully  processed. 

Thanks for the fix. Here is the new run with the v2 patch. The result
looks good to me.

src/bin/pgbench/pgbench -p 11002 -c1 -t 1 -f c.sql -f d.sql  
--failures-detailed -r test
pgbench (18devel)
starting vacuum...end.
transaction type: multiple scripts
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1
number of transactions actually processed: 1/1
number of failed transactions: 0 (0.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
latency average = 2.434 ms
initial connection time = 2.117 ms
tps = 410.846343 (without initial connection time)
SQL script 1: c.sql
 - weight: 1 (targets 50.0% of total)
 - 1 transactions (100.0% of total)
 - number of transactions actually pocessed: 1 (tps = 410.846343)
 - number of failed transactions: 0 (0.000%)
 - number of serialization failures: 0 (0.000%)
 - number of deadlock failures: 0 (0.000%)
 - latency average = 2.419 ms
 - latency stddev = 0.000 ms
 - statement latencies in milliseconds and failures:
 0.187   0  begin;
 0.153   0  set transaction isolation level serializable;
 0.977   0  insert into t1 select max(i)+1,2 from t1;
 1.102   0  end;
SQL script 2: d.sql
 - weight: 1 (targets 50.0% of total)
 - 0 transactions (0.0% of total)
 - statement latencies in milliseconds and failures:
 0.000   0  begin;
 0.000   0  set transaction isolation level serializable;
 0.000   0  insert into t1 select max(i)+1,2 from t1;
 0.000   0  end;

> Although it breaks the back-compatibility, this seems reasonable
> modification because not only succeeded transactions but also skips and
> failures ones are now handled and reported for each script.  Also, the
> number of transactions actually processed per-script and TPS based on
> it are now output explicitly in a separate line.

Okay for me as long as the patch is pushed to master branch.

A small comment on the comments in the patch: pgindent dislikes some
of the comment indentation styles. See attached pgindent.txt. Although
such a small defect would be fixed by committers when a patch gets
committed anyway, you might want to help committers beforehand.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
*** /tmp/pgbench.c  2024-09-24 18:42:20.632311240 +0900
--- src/bin/pgbench/pgbench.c   2024-09-24 18:42:51.824299286 +0900
***
*** 6392,6399 
}
  
/*
!* Remaining stats are nonsensical if we failed to execute any xacts
!* due to others than serialization or deadlock errors
 */
if (total_cnt <= 0)
return;
--- 6392,6399 
}
  
/*
!* Remaining stats are nonsensical if we failed to execute any xacts due
!* to others than serialization or deadlock errors
 */
if (total_cnt <= 0)
return;
***
*** 6514,6520 

script_total_cnt));
}
  
!   /* it can be non-zero only if max_tries 
is not equal to one */
if (max_tries != 1)
{
printf(" - number of 
transactions retried: " INT64_FORMAT " (%.3f%%)\n",
--- 6514,6523 

script_total_cnt));
}
  
!   /*
!* it can be non-zero only if max_tries 
is not equal to
!* one
!*/
if (max_tries != 1)
{
printf(" - number of 
transactions retried: " INT64_FORMAT " (%.3f%%)\n",

Re: Normalize queries starting with SET for pg_stat_statements

2024-09-24 Thread Michael Paquier

On Mon, Aug 19, 2024 at 03:28:52PM +0900, Michael Paquier wrote:
> FWIW, I'm OK with hiding the value when it comes to a SET clause in a
> CREATE FUNCTION.  We already hide the contents of SQL queries inside
> the SQL functions when these are queries that can be normalized, so
> there is a kind of thin argument for consistency, or something close
> to that.

This thread was one of the things I wanted to look at for this commit
fest because that's an issue I have on my stack for some time.  And
here you go with a revised patch set.

First, the TAP test proposed upthread is not required, so let's remove
it and rely on pg_stat_statements.  It is true that there are a lot of
coverage holes in pg_stat_statements with the various flavors of SET
queries.  The TAP test was able to cover a good part of them, still
missed a few spots.

A second thing is that like you I have settled down to a custom
implementation for VariableSetStmt because we have too many grammar
patterns, some of them with values nested in clauses (SET TIME ZONE is
one example), and we should report the grammar keywords without
hardcoding a location.  Like for the TIME ZONE part, this comes to a
limitation because it is not possible to normalize the case of SET
TIME ZONE 'value' without some refactoring of gram.y.  Perhaps we
could do that in the long-term, but I come down to the fact that I'm
also OK with letting things as they are in the patch, because the
primary case I want to tackle at the SET name = value patterns, and
SET TIME ZONE is just a different flavor of that that can be
translated as well to the most common "name = value" pattern.  A
second case is SET SESSION CHARACTERISTICS AS TRANSACTION with its
list of options.  Contrary to the patch proposed, I don't think that
it is a good idea to hide that arguments may be included in the
jumbling in the custom function, so I have added one field in
VariableSetStmt to do that, and documented why we need the field in
parsenodes.h.  That strikes me as the best balance, and that's going
to be hard to miss each time somebody adds a new grammar for
VariableSetStmt.  That's very unlikely at this stage of the project,
but who knows, people like fancy new features.

Attached are two patches:
- 0001 adds a bunch of tests in pg_stat_statements, to cover the SET
patterns.  (typo in commit message of this patch, will fix later)
- 0002 is the addition of the normalization.  It is possible to see
how the normalization changes things in pg_stat_statements.

0001 is straight-forward and that was I think a mistake to not include
that from the start when I've expanded these tests in the v16 cycle
(well, time..).  0002 also is quite conservative at the end, and this
design can be used to tune easily the jumbling patterns from gram.y
depending on the feedback we'd get.
--
Michael
From 919e03ec0d0673183a9266c1ef16893754005c7e Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 24 Sep 2024 16:19:08 +0900
Subject: [PATCH v2 1/2] pg_stta_statements: Expand tests with SET grammar

There are many grammar flavors that depend on the parse node
VariableSetStmt.  This closes the gap in pg_stat_statements by providing
test coverage for all of them (well, a large majority of them, at
least).
---
 .../pg_stat_statements/expected/utility.out   | 100 --
 contrib/pg_stat_statements/sql/utility.sql|  50 -
 2 files changed, 140 insertions(+), 10 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index fa738b5c00..58ba2018c1 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -66,7 +66,8 @@ DROP SERVER server_stats;
 DROP FOREIGN DATA WRAPPER wrapper_stats;
 -- Functions
 CREATE FUNCTION func_stats(a text DEFAULT 'a_data', b text DEFAULT lower('b_data'))
-  RETURNS text AS $$ SELECT $1::text || '_' || $2::text; $$ LANGUAGE SQL;
+  RETURNS text AS $$ SELECT $1::text || '_' || $2::text; $$ LANGUAGE SQL
+  SET work_mem = '256kB';
 DROP FUNCTION func_stats;
 -- Rules
 CREATE TABLE tab_rule_stats (a int, b int);
@@ -106,7 +107,8 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
  1 |0 | CREATE FOREIGN DATA WRAPPER wrapper_stats
  1 |0 | CREATE FOREIGN TABLE foreign_stats (a int) SERVER server_stats
  1 |0 | CREATE FUNCTION func_stats(a text DEFAULT 'a_data', b text DEFAULT lower('b_data'))+
-   |  |   RETURNS text AS $$ SELECT $1::text || '_' || $2::text; $$ LANGUAGE SQL
+   |  |   RETURNS text AS $$ SELECT $1::text || '_' || $2::text; $$ LANGUAGE SQL   +
+   |  |   SET work_mem = '256kB'
  1 |0 | CREATE FUNCTION trigger_func_stats () RETURNS trigger LANGUAGE plpgsql +
|  |   AS $$ BEGIN return OLD; end; $$
  1 |0 | CREATE INDEX pt_stats2_index ON ONLY pt_stats2 (a)
@@ -551,12 +553,26 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t

Re: Add on_error and log_verbosity options to file_fdw

2024-09-24 Thread torikoshia


On 2024-09-20 11:27, Fujii Masao wrote:

Thanks for your comments!


On 2024/09/19 23:16, torikoshia wrote:
-   COPY_LOG_VERBOSITY_DEFAULT = 0, /* logs no additional 
messages, default */

-   COPY_LOG_VERBOSITY_VERBOSE, /* logs additional messages */
+   COPY_LOG_VERBOSITY_SILENT = -1, /* logs none */
+   COPY_LOG_VERBOSITY_DEFAULT = 0, /* logs no additional 
messages, default */
+   COPY_LOG_VERBOSITY_VERBOSE, /* logs additional messages 
*/


Why do we need to assign specific numbers like -1 or 0 in this enum 
definition?


CopyFormatOptions is initialized by palloc0() at the beginning of 
ProcessCopyOptions().
The reason to assign specific numbers here is to assign 
COPY_LOG_VERBOSITY_DEFAULT to 0 as default value and sort elements of 
enum according to the amount of logging.


Understood.


BTW CopyFrom() also uses local variable skipped. It isn't reset like 
file_fdw, but using local variable seems not necessary since we can 
use cstate->num_errors here as well.


Sounds reasonable to me.



+   if (cstate->opts.on_error != COPY_ON_ERROR_STOP &&
+   cstate->escontext->error_occurred)
+   {
+   /*
+    * Soft error occurred, skip this tuple and 
deal with error

+    * information according to ON_ERROR.
+    */
+   if (cstate->opts.on_error == 
COPY_ON_ERROR_IGNORE)


If COPY_ON_ERROR_IGNORE indicates tuple skipping, shouldn’t we not 
only reset
error_occurred but also call "pgstat_progress_update_param" and 
continue within

this block?


I may misunderstand your comment, but I thought it to behave as you 
expect in the below codes:


The "on_error == COPY_ON_ERROR_IGNORE" condition isn't needed since
"on_error != COPY_ON_ERROR_STOP" is already checked, and on_error 
accepts

only two values "ignore" and "stop." I assume you added it with
a future option in mind, like "set_to_null" (as discussed in another 
thread).

However, I’m not sure how much this helps such future changes.
So, how about simplifying the code by replacing "on_error != 
COPY_ON_ERROR_STOP"

with "on_error == COPY_ON_ERROR_IGNORE" at the top and removing
the "on_error == COPY_ON_ERROR_IGNORE" check in the middle?
It could improve readability.


Thanks for the explanation and suggestion.
Since there is almost the same code in copyfrom.c, attached 0003 patch 
for refactoring both.



+   for(;;)
+   {
Using "goto" here might improve readability instead of using a "for" 
loop.


Hmm, AFAIU we need to return a slot here before the end of file is 
reached.


```
--src/backend/executor/execMain.c [ExecutePlan()]
    /*
     * if the tuple is null, then we assume there is nothing 
more to

     * process so we just end the loop...
     */
    if (TupIsNull(slot))
    break;
```

When ignoring errors, we have to keep calling NextCopyFrom() until we 
find a non error tuple or EOF and to do so calling NextCopyFrom() in 
for loop seems straight forward.


Replacing the "for" loop using "goto" as follows is possible, but 
seems not so readable because of the upward "goto":


Understood.



Attached v4 patches reflected these comments.


Thanks for updating the patches!

The tab-completion needs to be updated to support the "silent" option?


Yes, updated 0002 patch.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom 89a97461eab2e14ab84778d93f0b4e91999606df Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Tue, 24 Sep 2024 14:32:54 +0900
Subject: [PATCH v4 1/3] Add log_verbosity to 'silent'

Currently when on_error is set to ignore, COPY logs a NOTICE message
containing the ignored row count.
When using on_error option to file_fdw, it'd be better to omit this
message in some cases, such as frequent access to a malformed file
via the same foreign table.
As a preliminary step to add on_error option to file_fdw, this patch
adds new value 'silent' to log_verbosity and enables to silence
the message.
---
 doc/src/sgml/ref/copy.sgml  | 10 +++---
 src/backend/commands/copy.c |  4 +++-
 src/backend/commands/copyfrom.c |  3 ++-
 src/include/commands/copy.h |  6 --
 src/test/regress/expected/copy2.out |  4 +++-
 src/test/regress/sql/copy2.sql  |  4 
 6 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 1518af8a04..d87684a5be 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -407,6 +407,8 @@ COPY { table_name [ ( verbose, a NOTICE message
   containing the line of the input file and the column name whose input
   conversion has failed is emitted for each discarded row.
+  When it is set to silent, no message is emitted
+  regarding ignored rows.
  
 

@@ -428,9 +430,11 @@ COPY { table_name [ ( 
  
   Specify the amount of messages em

Re: Index AM API cleanup

2024-09-24 Thread Mark Dilger




> On Sep 24, 2024, at 10:50 AM, Peter Eisentraut  wrote:
> 
> Next, I have reviewed patches
> 
> v17-0010-Track-sort-direction-in-SortGroupClause.patch
> v17-0011-Track-scan-reversals-in-MergeJoin.patch
> 
> Both of these seem ok and sensible to me.
> 
> They take the concept of the "reverse" flag that already exists in the 
> affected code and just apply it more consistently throughout the various code 
> layers, instead of relying on strategy numbers as intermediate storage.  This 
> is both helpful for your ultimate goal in this patch series, and it also 
> makes the affected code areas simpler and more consistent and robust.
> 

Thanks for the review!

Yes, I found the existing use of a btree strategy number rather than a boolean 
"reverse" flag made using the code from other index AMs needlessly harder.  I 
am glad you see it the same way.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Increase of maintenance_work_mem limit in 64-bit Windows

2024-09-24 Thread Vladlen Popolitov


David Rowley писал(а) 2024-09-24 01:07:

On Tue, 24 Sept 2024 at 02:47, Vladlen Popolitov
 wrote:
I agree, it is better to fix all them together. I also do not like 
this

hack, it will be removed from the patch, if I check and change
all  at once.
I think, it will take about 1 week to fix and test all changes. I will
estimate the total volume of the changes and think, how to group them
in the patch ( I hope, it will be only one patch)


There's a few places that do this:

Size maxBlockSize = ALLOCSET_DEFAULT_MAXSIZE;

/* choose the maxBlockSize to be no larger than 1/16 of work_mem */
while (16 * maxBlockSize > work_mem * 1024L)

I think since maxBlockSize is a Size variable, that the above should
probably be:

while (16 * maxBlockSize > (Size) work_mem * 1024)

Maybe there can be a precursor patch to fix all those to get rid of
the 'L' and cast to the type we're comparing to or assigning to rather
than trying to keep the result of the multiplication as a long.

Yes. It is what I mean, when I wrote about the context - in this case
variable is used in "Size" context and the cast to Size type should be
used. It is why I need to check all places in code. I am going to do it
during this week.

--
Best regards,

Vladlen Popolitov.

RE: Conflict detection for update_deleted in logical replication

2024-09-24 Thread Zhijie Hou (Fujitsu)

On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada  
wrote:
> 
> On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
>  wrote:
> > > I'm still studying this idea but let me confirm the following scenario.
> > >
> > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > and XIDs and commit LSNs of T2 and T3 are the following:
> > >
> > > Node A
> > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> commit-LSN:1000
> > >
> > > Node B
> > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > commit-LSN:5000
> > >
> > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > latest flush WAL position of Node-A and Node-B are following:
> > >
> > > Node A
> > >   current XID: 300
> > >   latest flush LSN; 3000
> > >
> > > Node B
> > >   current XID: 700
> > >   latest flush LSN: 7000
> > >
> > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > (i.e., the logical replication is delaying for 5 min).
> > >
> > > Consider the following scenario:
> > >
> > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > gets 301 (set as candidate_xmin).
> > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now
> 8000.
> > > 4. The apply worker on Node-A continues applying changes, and
> > > applies the transactions up to remote (commit) LSN 7100.
> > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > 301 (candidate_xmin).
> > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > deleted by T2.
> > >
> > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > 8000 that might require the tuple that is deleted by T2.
> > >
> > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > still depends on when vacuum removes the tuple deleted by T2.
> >
> > I think in this case, no matter we detect "update_delete" or
> > "update_missing", the final data is the same. Because T4's commit
> > timestamp should be later than
> > T2 on node A, so in the case of "update_deleted", it will compare the
> > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > T4 should win, which means we will convert the update into insert and
> > apply. Even if the deleted tuple is deleted and "update_missing" is
> > detected, the update will still be converted into insert and applied. So, 
> > the
> result is the same.
> 
> The "latest_timestamp_wins" is the default resolution method for
> "update_deleted"? When I checked the wiki page[1], the "skip" was the default
> solution method for that.

Right, I think the wiki needs some update.

I think using 'skip' as default for update_delete could easily cause data
divergence when the dead tuple is deleted by an old transaction while the
UPDATE has a newer timestamp like the case you mentioned. It's necessary to
follow the last update win strategy when the incoming update has later
timestamp, which is to convert update to insert.

Best Regards,
Hou zj

Re: not null constraints, again

2024-09-24 Thread jian he

On Fri, Sep 20, 2024 at 8:08 PM Alvaro Herrera  wrote:
>
> On 2024-Sep-20, jian he wrote:
>
> > about set_attnotnull.
> >
> > we can make set_attnotnull  look less recursive.
> > instead of calling find_inheritance_children,
> > let's just one pass, directly call  find_all_inheritors
> > overall, I think it would be more intuitive.
> >
> > please check the attached refactored set_attnotnull.
> > regress test passed, i only test regress.
>
> Hmm, what do we gain from doing this change?  It's longer in number of
> lines of code, and it's not clear to me that it is simpler.
>


static void
set_attnotnull(List **wqueue, Relation rel, AttrNumber attnum, bool recurse,
   LOCKMODE lockmode)
{
HeapTupletuple;
Form_pg_attribute attForm;
boolchanged = false;
List   *all_oids;
Relationthisrel;
AttrNumberchildattno;
const char *attrname;
CheckAlterTableIsSafe(rel);
attrname = get_attname(RelationGetRelid(rel), attnum, false);
if (recurse)
all_oids = find_all_inheritors(RelationGetRelid(rel), lockmode,
 NULL);
else
all_oids = list_make1_int(RelationGetRelid(rel));
foreach_oid(reloid, all_oids)
{
thisrel = table_open(reloid, NoLock);
if (reloid != RelationGetRelid(rel))
CheckAlterTableIsSafe(thisrel);
childattno = get_attnum(reloid, attrname);
tuple = SearchSysCacheCopyAttNum(reloid, childattno);
if (!HeapTupleIsValid(tuple))
elog(ERROR, "cache lookup failed for attribute %d of relation %s",
attnum, RelationGetRelationName(thisrel));
attForm = (Form_pg_attribute) GETSTRUCT(tuple);
if (!attForm->attnotnull)
{
Relationattr_rel;
attr_rel = table_open(AttributeRelationId, RowExclusiveLock);
attForm->attnotnull = true;
CatalogTupleUpdate(attr_rel, &tuple->t_self, tuple);
table_close(attr_rel, RowExclusiveLock);
if (wqueue && !NotNullImpliedByRelConstraints(thisrel, attForm))
{
AlteredTableInfo *tab;
tab = ATGetQueueEntry(wqueue, thisrel);
tab->verify_new_notnull = true;
}
changed = true;
}
if (changed)
CommandCounterIncrement();
changed = false;
table_close(thisrel, NoLock);
}
}


What do you think of the above refactor?
(I intentionally deleted empty new line)

Re: Index AM API cleanup

2024-09-24 Thread Peter Eisentraut


Next, I have reviewed patches

v17-0010-Track-sort-direction-in-SortGroupClause.patch
v17-0011-Track-scan-reversals-in-MergeJoin.patch

Both of these seem ok and sensible to me.

They take the concept of the "reverse" flag that already exists in the 
affected code and just apply it more consistently throughout the various 
code layers, instead of relying on strategy numbers as intermediate 
storage.  This is both helpful for your ultimate goal in this patch 
series, and it also makes the affected code areas simpler and more 
consistent and robust.

Re: not null constraints, again

2024-09-24 Thread jian he

static Oid
StoreRelNotNull(Relation rel, const char *nnname, AttrNumber attnum,
bool is_validated, bool is_local, int inhcount,
bool is_no_inherit)
{
OidconstrOid;
Assert(attnum > InvalidAttrNumber);
constrOid =
CreateConstraintEntry(nnname,
  RelationGetNamespace(rel),
  CONSTRAINT_NOTNULL,
  false,
  false,
  is_validated

}
is is_validated always true, can we add an Assert on it?


in AddRelationNotNullConstraints
for (int outerpos = 0; outerpos < list_length(old_notnulls); outerpos++)
{
}
CookedConstraint struct already has "int inhcount;"
can we rely on that, rather than using add_inhcount?
we can also add an Assert: "Assert(!cooked->is_no_inherit);"

I've put these points into a patch,
please check the attached.




/*
 * Remember primary key index, if any.  We do this only if the index
 * is valid; but if the table is partitioned, then we do it even if
 * it's invalid.
 *
 * The reason for returning invalid primary keys for foreign tables is
 * because of pg_dump of NOT NULL constraints, and the fact that PKs
 * remain marked invalid until the partitions' PKs are attached to it.
 * If we make rd_pkindex invalid, then the attnotnull flag is reset
 * after the PK is created, which causes the ALTER INDEX ATTACH
 * PARTITION to fail with 'column ... is not marked NOT NULL'.  With
 * this, dropconstraint_internal() will believe that the columns must
 * not have attnotnull reset, so the PKs-on-partitions can be attached
 * correctly, until finally the PK-on-parent is marked valid.
 *
 * Also, this doesn't harm anything, because rd_pkindex is not a
 * "real" index anyway, but a RELKIND_PARTITIONED_INDEX.
 */
if (index->indisprimary &&
(index->indisvalid ||
 relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
{
pkeyIndex = index->indexrelid;
pkdeferrable = !index->indimmediate;
}
The comment (especially paragraph "The reason for returning invalid
primary keys") is overwhelming.
Can you also add some sql examples into the comments.
I guess some sql examples, people can understand it more easily?


AddRelationNotNullConstraints.no-cfbot
Description: Binary data

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-24 Thread Shayon Mukherjee

Hello,

Regarding GUC implementation for index disabling, I was imagining something
like the attached PATCH. The patch compiles and can be applied for testing.
It's not meant to be production ready, but I am sharing it as a way to get
a sense of the nuts and bolts. It requires more proper test cases and docs,
etc. Example towards the end of the email.

That said, I am still quite torn between GUC setting or having a dedicated
ALTER grammar. My additional thoughts which is mostly a summary of what
David and Peter have already very nicely raised earlier are:

- GUC allows a non-privileged user to disable one or more indexes per
session.

- If we think of the task of disabling indexes temporarily (without
stopping any updates to the index), then it feels more in the territory of
query tuning than index maintenance. In which case, a GUC setting makes
more sense and sits well with others in the team like enable_indexscan,
enable_indexonlyscan and so on.

- At the same time, as David pointed out earlier, GUC is also a global
setting and perhaps storing the state of whether or not an index is being
used is perhaps better situated along with other index properties in
pg_index.

- One of my original motivations for the proposal was also that we can
disable an index for _all_ sessions quickly without it impacting index
build and turn it back on quickly as well. To do so with GUC, we would need
to do something like the following, if I am not mistaken, in which case
that is not something an unprivileged user may be able to perform, so just
calling it out.

  ALTER USER example_user SET disabled_indexes = 'idx_foo_bar';

- For an ALTER statement, I think an ALTER INDEX makes more sense than
ALTER TABLE, especially since we have the existing ALTER INDEX grammar and
functionality. But let me know if I am missing something here.

- Resetting plan cache could still be an open question for GUC. I was
wondering if we can reset the plan cache local to the session for GUC (like
the one in the PATCH attached) and if that is enough? This concern doesn't
apply with managing property in pg_index.

- With a GUC attribute, the state of an index being enabled/disabled won't
be captured in pg_get_indexdef(), and that is likely OK, but maybe that
would need to be made explicit through docs.

Example 1

CREATE TABLE b AS SELECT generate_series(1,1000) AS b;
CREATE INDEX ON b((b%10));
ANALYZE b;
EXPLAIN SELECT DISTINCT b%10 FROM b;

SET disabled_indexes = 'b_expr_idx';

EXPLAIN SELECT DISTINCT b%10 FROM b; -- HashAggregate rows=1

Example 2

CREATE TABLE disabled_index_test(id int PRIMARY KEY, data text);
INSERT INTO disabled_index_test SELECT g, 'data ' || g FROM
generate_series(1, 1000) g;
CREATE INDEX disabled_index_idx1 ON disabled_index_test(data);
EXPLAIN (COSTS OFF) SELECT * FROM disabled_index_test WHERE data = 'data
500';

SET disabled_indexes = 'b_expr_idx, disabled_index_idx1';

EXPLAIN SELECT * FROM disabled_index_test WHERE data = 'data 500'; -- no
index is used

Wrapping up...

I am sure there are things I am missing or unintentionally overlooking.
Since this would be a nice feature to have, I'd love some guidance on which
approach seems like a good next step to take. I am happy to work
accordingly on the patch.

Thank you
Shayon

On Tue, Sep 24, 2024 at 12:38 AM Maciek Sakrejda 
wrote:

> If one of the use cases is soft-dropping indexes, would a GUC approach
> still support that? ALTER TABLE?
>


-- 
Kind Regards,
Shayon Mukherjee


v1-0001-Proof-of-Concept-Ability-to-enable-disable-indexe.patch
Description: Binary data

Re: Documentation to upgrade logical replication cluster

2024-09-24 Thread vignesh C

On Tue, 24 Sept 2024 at 16:20, Amit Kapila  wrote:
>
> On Fri, Sep 20, 2024 at 5:46 PM vignesh C  wrote:
> >
> > I didn’t include a note because each disable/enable statement
> > specifies: a) Disable all subscriptions on the node, b) Enable all
> > subscriptions on the node. The attached v11 version patch just to show
> > the examples with one subscription.
> >
>
> The following steps in the bi-directional node upgrade have some problems.
>
> +   
> +On node1, create any tables that were created in
> +node2 between  linkend="circular-cluster-disable-sub-node2"/>
> +and now, e.g.:
> +
> +node1=# CREATE TABLE distributors (did integer PRIMARY KEY, name 
> varchar(40));
> +CREATE TABLE
> +
> +   
> +  
> +
> +  
> +   
> +Enable all the subscriptions on node2 that are
> +subscribing the changes from node1 by using
> +ALTER
> SUBSCRIPTION ... ENABLE,
> +e.g.:
> +
> +node2=# ALTER SUBSCRIPTION sub1_node1_node2 ENABLE;
> +ALTER SUBSCRIPTION
> +
> +   
> +  
> +
> +  
> +   
> +Refresh the node2 subscription's publications 
> using
> + linkend="sql-altersubscription-params-refresh-publication">ALTER
> SUBSCRIPTION ... REFRESH PUBLICATION,
> +e.g.:
> +
> +node2=# ALTER SUBSCRIPTION sub1_node1_node2 REFRESH PUBLICATION;
> +ALTER SUBSCRIPTION
> +
> +   
>
> If you are creating missing tables on node-1, won't that node's
> subscription be refreshed to get the missing data? Also, I suggest
> moving the step-2 in the above steps to enable subscriptions on node-2
> should be moved before creating a table on node-1 and then issuing a
> REFRESH command on node-1. The similar steps for other node's upgrade
> following these steps have similar problems.

Reordered the docs to enable the subscription before creating the
table. For bi-directional replication, a publication refresh is
necessary on both nodes: a) First, refresh the publication on the old
version server to set the newly added tables to a ready state in the
pg_subscription_rel catalog. b) Next, refresh the publication on the
upgraded version server to initiate the initial sync and update the
pg_subscription_rel with the ready state. This change has been
incorporated into the attached v12 version patch.

Regards,
Vignesh
From 8c7f5aa9b7e36a88a1f6eb1243dd04d1e86b78fb Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Mon, 6 May 2024 10:29:01 +0530
Subject: [PATCH v12] Documentation for upgrading logical replication cluster.

Documentation for upgrading logical replication cluster.
---
 doc/src/sgml/glossary.sgml|  10 +
 doc/src/sgml/logical-replication.sgml | 802 ++
 doc/src/sgml/ref/pgupgrade.sgml   | 131 +
 3 files changed, 820 insertions(+), 123 deletions(-)

diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 405fe6dc8b..f54f25c1c6 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1069,6 +1069,16 @@

   
 
+  
+   Logical replication cluster
+   
+
+ A set of publisher and subscriber instances with the publisher instance
+ replicating changes to the subscriber instance.
+
+   
+  
+
   
Log record
 
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index df62eb45ff..1b073053d8 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -2231,6 +2231,808 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
 
  
 
+ 
+  Upgrade
+
+  
+   Migration of logical replication clusters
+   is possible only when all the members of the old logical replication
+   clusters are version 17.0 or later.
+  
+
+  
+   Prepare for publisher upgrades
+
+   
+pg_upgrade attempts to migrate logical
+slots. This helps avoid the need for manually defining the same
+logical slots on the new publisher. Migration of logical slots is
+only supported when the old cluster is version 17.0 or later.
+Logical slots on clusters before version 17.0 will silently be
+ignored.
+   
+
+   
+Before you start upgrading the publisher cluster, ensure that the
+subscription is temporarily disabled, by executing
+ALTER SUBSCRIPTION ... DISABLE.
+Re-enable the subscription after the upgrade.
+   
+
+   
+There are some prerequisites for pg_upgrade to
+be able to upgrade the logical slots. If these are not met an error
+will be reported.
+   
+
+   
+
+ 
+  The new cluster must have
+  wal_level as
+  logical.
+ 
+
+
+ 
+  The new cluster must have
+  max_replication_slots
+  configured to a value greater than or equal to the number of slots
+  present in the old cluster.
+ 
+
+
+ 
+  The output plugins referenced by the slots on the old cluster must be
+  installed in the new PostgreSQL executable directory.
+ 
+
+
+ 
+

Re: Add contrib/pg_logicalsnapinspect

2024-09-24 Thread Bertrand Drouvot

Hi,

On Mon, Sep 23, 2024 at 12:27:27PM +1000, Peter Smith wrote:
> My review comments for v8-0001
> 
> ==
> contrib/pg_logicalinspect/pg_logicalinspect.c
> 
> 1.
> +/*
> + * Lookup table for SnapBuildState.
> + */
> +
> +#define SNAPBUILD_STATE_INCR 1
> +
> +static const char *const SnapBuildStateDesc[] = {
> + [SNAPBUILD_START + SNAPBUILD_STATE_INCR] = "start",
> + [SNAPBUILD_BUILDING_SNAPSHOT + SNAPBUILD_STATE_INCR] = "building",
> + [SNAPBUILD_FULL_SNAPSHOT + SNAPBUILD_STATE_INCR] = "full",
> + [SNAPBUILD_CONSISTENT + SNAPBUILD_STATE_INCR] = "consistent",
> +};
> +
> +/*
> 
> nit - the SNAPBUILD_STATE_INCR made this code appear more complicated
> than it is. Please take a look at the attachment for an alternative
> implementation which includes an explanatory comment. YMMV. Feel free
> to ignore it.

Thanks for the feedback!

I like the commment, so added it in v9 attached. OTOH I think that's better
to keep SNAPBUILD_STATE_INCR as those "+1" are all linked and that would be
easy to miss the one in pg_get_logical_snapshot_info() should we change the
increment in the future.

> ==
> src/include/replication/snapbuild.h
> 
> 2.
> + * Please keep SnapBuildStateDesc[] (located in the pg_logicalinspect module)
> + * updated should a change needs to be done in SnapBuildState.
> 
> nit - "...should a change needs to be done" -- the word "needs" is
> incorrect here.
> 
> How about:
> "...if a change needs to be made to SnapBuildState."

Thanks, used this one in v9.

[1]: 
https://www.postgresql.org/message-id/CAJpy0uCppUNdod4F3NaPpMCtrySdw1S0T1d8CA-2c4CX%3DShMOQ%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From ae831c9042244f7da6a78d5e350d9c30672f1cd5 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Wed, 14 Aug 2024 08:46:05 +
Subject: [PATCH v9] Add contrib/pg_logicalinspect

Provides SQL functions that allow to inspect logical decoding components.

It currently allows to inspect the contents of serialized logical snapshots of
a running database cluster, which is useful for debugging or educational
purposes.
---
 contrib/Makefile  |   1 +
 contrib/meson.build   |   1 +
 contrib/pg_logicalinspect/.gitignore  |   4 +
 contrib/pg_logicalinspect/Makefile|  31 ++
 .../expected/logical_inspect.out  |  52 
 contrib/pg_logicalinspect/logicalinspect.conf |   1 +
 contrib/pg_logicalinspect/meson.build |  39 +++
 .../pg_logicalinspect--1.0.sql|  43 +++
 contrib/pg_logicalinspect/pg_logicalinspect.c | 265 ++
 .../pg_logicalinspect.control |   5 +
 .../specs/logical_inspect.spec|  34 +++
 doc/src/sgml/contrib.sgml |   1 +
 doc/src/sgml/filelist.sgml|   1 +
 doc/src/sgml/pglogicalinspect.sgml| 151 ++
 src/backend/replication/logical/snapbuild.c   | 190 +
 src/include/port/pg_crc32c.h  |  16 +-
 src/include/replication/snapbuild.h   |   6 +-
 src/include/replication/snapbuild_internal.h  | 204 ++
 18 files changed, 852 insertions(+), 193 deletions(-)
   7.8% contrib/pg_logicalinspect/expected/
   5.7% contrib/pg_logicalinspect/specs/
  33.1% contrib/pg_logicalinspect/
  13.6% doc/src/sgml/
  16.8% src/backend/replication/logical/
   4.0% src/include/port/
  18.6% src/include/replication/

diff --git a/contrib/Makefile b/contrib/Makefile
index abd780f277..952855d9b6 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -32,6 +32,7 @@ SUBDIRS = \
 		passwordcheck	\
 		pg_buffercache	\
 		pg_freespacemap \
+		pg_logicalinspect \
 		pg_prewarm	\
 		pg_stat_statements \
 		pg_surgery	\
diff --git a/contrib/meson.build b/contrib/meson.build
index 14a8906865..159ff41555 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -46,6 +46,7 @@ subdir('passwordcheck')
 subdir('pg_buffercache')
 subdir('pgcrypto')
 subdir('pg_freespacemap')
+subdir('pg_logicalinspect')
 subdir('pg_prewarm')
 subdir('pgrowlocks')
 subdir('pg_stat_statements')
diff --git a/contrib/pg_logicalinspect/.gitignore b/contrib/pg_logicalinspect/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/contrib/pg_logicalinspect/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/pg_logicalinspect/Makefile b/contrib/pg_logicalinspect/Makefile
new file mode 100644
index 00..55124514d4
--- /dev/null
+++ b/contrib/pg_logicalinspect/Makefile
@@ -0,0 +1,31 @@
+# contrib/pg_logicalinspect/Makefile
+
+MODULE_big = pg_logicalinspect
+OBJS = \
+	$(WIN32RES) \
+	pg_logicalinspect.o
+PGFILEDESC = "pg_logicalinspect - functions to inspect logical decoding components"
+
+EXTENSION = pg_logicalinspect
+DATA = pg_logicalinspect--1.0.sql
+
+EXTRA_INSTALL = contrib/test_decoding
+
+ISOLATION = logical_inspec

Re: Detect buffer underflow in get_th()

2024-09-24 Thread Alexander Kuznetsov


Hello,

is there anything else we can help with or discuss in order to apply this fix?

24.07.2024 18:53, Alexander Kuznetsov пишет:


24.07.2024 18:39, Peter Eisentraut wrote:

If it can't happen in practice, maybe an assertion would be enough?



In practice, the function should not receive non-numeric strings either; 
however, since there is an exception in place for such cases, I thought it 
would be good to add a check for zero-length input in a similar manner.

But of course it's open for discussion and team decision whether this should be 
addressed as an assertion or handled differently.



--
Best regards,
Alexander Kuznetsov

Re: [PATCH] Add native windows on arm64 support

2024-09-24 Thread Dave Cramer

On Tue, 24 Sept 2024 at 14:31, Dave Cramer 
wrote:

>
>
> On Tue, 13 Feb 2024 at 16:28, Dave Cramer 
> wrote:
>
>>
>>
>>
>> On Tue, 13 Feb 2024 at 12:52, Andres Freund  wrote:
>>
>>> Hi,
>>>
>>> On 2024-02-13 12:49:33 -0500, Dave Cramer wrote:
>>> > > I think I might have been on to something - if my human emulation of
>>> a
>>> > > preprocessor isn't wrong, we'd end up with
>>> > >
>>> > > #define S_UNLOCK(lock)  \
>>> > > do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
>>> > >
>>> > > on msvc + arm. And that's entirely insufficient -
>>> _ReadWriteBarrier() just
>>> > > limits *compiler* level reordering, not CPU level reordering.  I
>>> think it's
>>> > > even insufficient on x86[-64], but it's definitely insufficient on
>>> arm.
>>> > >
>>> > In fact ReadWriteBarrier has been deprecated _ReadWriteBarrier |
>>> Microsoft
>>> > Learn
>>> > <
>>> https://learn.microsoft.com/en-us/cpp/intrinsics/readwritebarrier?view=msvc-170
>>> >
>>>
>>> I'd just ignore that, that's just pushing towards more modern stuff
>>> that's
>>> more applicable to C++ than C.
>>>
>>>
>>> > I did try using atomic_thread_fence as per atomic_thread_fence -
>>> > cppreference.com
>>> > 
>>>
>>> The semantics of atomic_thread_fence are, uh, very odd.  I'd just use
>>> MemoryBarrier().
>>>
>>> #define S_UNLOCK(lock)  \
>> do { MemoryBarrier(); (*(lock)) = 0; } while (0)
>>
>> #endif
>>
>> Has no effect.
>>
>> I have no idea if that is what you meant that I should do ?
>>
>> Dave
>>
>
>
> Revisiting this:
>
> Andrew, can you explain the difference between ninja test (which passes)
> and what the build farm does. The buildfarm crashes.
>

I have a dmp file with a stack trace if anyone is interested

Dave

>
> Dave
>

Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-24 Thread Nathan Bossart

On Tue, Sep 24, 2024 at 01:21:45PM +0900, Michael Paquier wrote:
> On Mon, Sep 23, 2024 at 10:50:21AM -0500, Nathan Bossart wrote:
>> I carefully inspected all the code paths this patch touches, and I think
>> I've got all the details right, but I would be grateful if someone else
>> could take a look.
> 
> No objections from here with putting the snapshots pops and pushes
> outside the inner routines of reindex/drop concurrently, meaning that
> ReindexRelationConcurrently(), DefineIndex() and index_drop() are fine
> to do these operations.

Great.  I plan to push 0001 shortly.

> Actually, thinking more...  Could it be better to have some more
> sanity checks in the stack outside the toast code for catalogs with
> toast tables?  For example, I could imagine adding a check in
> CatalogTupleUpdate() so as all catalog accessed that have a toast 
> relation require an active snapshot.  That would make checks more
> aggressive, because we would not need any toast data in a catalog to
> make sure that there is a snapshot set.  This strikes me as something
> we could do better to improve the detection of failures like the one
> reported by Alexander when updating catalog tuples as this can be
> triggered each time we do a CatalogTupleUpdate() when dirtying a
> catalog tuple.  The idea is then to have something before the
> HaveRegisteredOrActiveSnapshot() in the toast internals, for catalogs,
> and we would not require toast data to detect problems.

I gave this a try and, unsurprisingly, found a bunch of other problems.  I
hastily hacked together the attached patch that should fix all of them, but
I'd still like to comb through the code a bit more.  The three catalogs
with problems are pg_replication_origin, pg_subscription, and
pg_constraint.  pg_contraint has had a TOAST table for a while, and I don't
think it's unheard of for conbin to be large, so this one is probably worth
fixing.  pg_subscription hasn't had its TOAST table for quite as long, but
presumably subpublications could be large enough to require out-of-line
storage.  pg_replication_origin, however, only has one varlena column:
roname.  Three out of the seven problem areas involve
pg_replication_origin, but AFAICT that'd only ever be a problem if the name
of your replication origin requires out-of-line storage.  So... maybe we
should just remove pg_replication_origin's TOAST table instead...

-- 
nathan
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index d0d1abda58..b8be124555 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -22,6 +22,7 @@
 #include "catalog/index.h"
 #include "catalog/indexing.h"
 #include "executor/executor.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -234,6 +235,14 @@ CatalogTupleInsert(Relation heapRel, HeapTuple tup)
 {
CatalogIndexState indstate;
 
+   /*
+* If we might need TOAST access, make sure the caller has set up a 
valid
+* snapshot.
+*/
+   Assert(HaveRegisteredOrActiveSnapshot() ||
+  !OidIsValid(heapRel->rd_rel->reltoastrelid) ||
+  !IsNormalProcessingMode());
+
CatalogTupleCheckConstraints(heapRel, tup);
 
indstate = CatalogOpenIndexes(heapRel);
@@ -256,6 +265,14 @@ void
 CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
   CatalogIndexState indstate)
 {
+   /*
+* If we might need TOAST access, make sure the caller has set up a 
valid
+* snapshot.
+*/
+   Assert(HaveRegisteredOrActiveSnapshot() ||
+  !OidIsValid(heapRel->rd_rel->reltoastrelid) ||
+  !IsNormalProcessingMode());
+
CatalogTupleCheckConstraints(heapRel, tup);
 
simple_heap_insert(heapRel, tup);
@@ -273,6 +290,14 @@ void
 CatalogTuplesMultiInsertWithInfo(Relation heapRel, TupleTableSlot **slot,
 int ntuples, 
CatalogIndexState indstate)
 {
+   /*
+* If we might need TOAST access, make sure the caller has set up a 
valid
+* snapshot.
+*/
+   Assert(HaveRegisteredOrActiveSnapshot() ||
+  !OidIsValid(heapRel->rd_rel->reltoastrelid) ||
+  !IsNormalProcessingMode());
+
/* Nothing to do */
if (ntuples <= 0)
return;
@@ -315,6 +340,14 @@ CatalogTupleUpdate(Relation heapRel, ItemPointer otid, 
HeapTuple tup)
CatalogIndexState indstate;
TU_UpdateIndexes updateIndexes = TU_All;
 
+   /*
+* If we might need TOAST access, make sure the caller has set up a 
valid
+* snapshot.
+*/
+   Assert(HaveRegisteredOrActiveSnapshot() ||
+  !OidIsValid(heapRel->rd_rel->reltoastrelid) ||
+  !IsNormalProcessingMode());
+
CatalogTupleCheckConstraints(heapRel, tup);
 
indstate = CatalogOpenIndexes(heapRel);
@@ -339,6 +372,14 @@ Ca

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-24 Thread Shayon Mukherjee

Thank you for the historical context and working, I understand what you were 
referring to before now. 

Shayon

> On Sep 24, 2024, at 2:08 PM, Peter Eisentraut  wrote:
> 
> On 23.09.24 22:51, Shayon Mukherjee wrote:
>> I am happy to draft a patch for this as well. I think I have a working
>> idea so far of where the necessary checks might go. However if you don’t
>>  mind, can you elaborate further on how the effect would be similar to
>> enable_indexscan?
> 
> Planner settings like enable_indexscan used to just add a large number 
> (disable_cost) to the estimated plan node costs.  It's a bit more 
> sophisticated in PG17.  But in any case, I imagine "disabling an index" could 
> use the same mechanism.  Or maybe not, maybe the setting would just 
> completely ignore the index.

Re: AIO writes vs hint bits vs checksums

2024-09-24 Thread Noah Misch

On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:
> So far the AIO patchset has solved this by introducing a set of "bounce
> buffers", which can be acquired and used as the source/target of IO when doing
> it in-place into shared buffers isn't viable.
> 
> I am worried about that solution however, as either acquisition of bounce
> buffers becomes a performance issue (that's how I did it at first, it was hard
> to avoid regressions) or we reserve bounce buffers for each backend, in which
> case the memory overhead for instances with relatively small amount of
> shared_buffers and/or many connections can be significant.

> But: We can address this and improve performance over the status quo! Today we
> determine tuple visiblity determination one-by-one, even when checking the
> visibility of an entire page worth of tuples. That's not exactly free. I've
> prototyped checking visibility of an entire page of tuples at once and it
> indeed speeds up visibility checks substantially (in some cases seqscans are
> over 20% faster!).

Nice!  It sounds like you refactored the relationship between
heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint
bit setting upward or the iterate-over-tuples downward.  Is that about right?

> Once we have page-level visibility checks we can get the right to set hint
> bits once for an entire page instead of doing it for every tuple - with that
> in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
> wins.

How did page-level+BM_SETTING_HINTS performance compare to performance of the
page-level change w/o the BM_SETTING_HINTS change?

> Having a page level approach to setting hint bits has other advantages:
> 
> E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set
> on the page and we don't mark a page dirty on hot standby. Which often will
> result in hint bits notpersistently set on replicas until the page is frozen.

Nice way to improve that.

> Does this sound like a reasonable idea?  Counterpoints?

I guess the main part left to discuss is index scans or other scan types where
we'd either not do page-level visibility or we'd do page-level visibility
including tuples we wouldn't otherwise use.  BM_SETTING_HINTS likely won't
show up so readily in index scan profiles, but the cost is still there.  How
should we think about comparing the distributed cost of the buffer header
manipulations during index scans vs. the costs of bounce buffers?

Thanks,
nm

Re: Possible null pointer dereference in afterTriggerAddEvent()

2024-09-24 Thread Alexander Kuznetsov


Hello,

is there anything else we can help with or discuss in order to apply this fix?

26.07.2024 12:16, Alexander Kuznetsov пишет:

25.07.2024 20:07, Alvaro Herrera wrote:

Maybe for sanity (and perhaps for Svace compliance) we could do it the
other way around, i.e. by testing events->tail for nullness instead of
events->head, then add the assertion:

    if (events->tail == NULL)
    {
    Assert(events->head == NULL);
    events->head = chunk;
    }
    else
    events->tail->next = chunk;

This way, it's not wholly redundant.

Thanks for your response!
I agree with the proposed changes and have updated the patch accordingly. 
Version 2 is attached.

That said, I'm not sure we actually *need* to change this.

I understand and partly agree. But it appears that with these changes, the 
dereference of a null pointer is impossible even in builds where assertions are 
disabled. Previously, this issue could theoretically occur. Consequently, these 
changes slightly enhance overall security.



--
Best regards,
Alexander Kuznetsov

pg_ctl/miscinit: print "MyStartTime" as a long long instead of long to avoid 2038 problem.

2024-09-24 Thread Max Johnson

Hello  hackers,

I have found an instance of a time overflow with the start time that is written 
in "postmaster.pid". On a 32-bit Linux system, if the system date is past 
01/19/2038, when you start Postgres with `pg_ctl start -D {datadir} ...`,  the 
start time will have rolled back to approximately 1900. This is an instance of 
the "2038 problem". On my system, pg_ctl will not exit if the start time has 
overflowed.

This can be fixed by casting "MyStartTime" to a long long instead of just a 
long in "src/backend/utils/init/miscinit.c". Additionally, in 
"src/bin/pg_ctl/pg_ctl.c", when we read that value from the file, we should use 
"atoll()" instead of "atol()" to ensure we are reading it as a long long.
I have verified that this fixes the start time overflow on my 32-bit arm 
system. My glibc is compiled with 64-bit time_t.
Most systems running Postgres likely aren't 32-bit, but for embedded systems, 
this is important to ensure 2038 compatibility.

This is a fairly trivial patch, and I do not currently see any issues with 
using long long. I was told on IRC that a regression test is likely not 
necessary for this patch.
I look forward to hearing any feedback. This is my first open-source 
contribution!

Thank you,


Max Johnson

Embedded Linux Engineer I



NovaTech, LLC

13555 W. 107th Street | Lenexa, KS 66215 

O: 913.451.1880

M: 913.742.4580

novatechautomation.com | 
NovaTechLinkedIn


NovaTech Automation is Net Zero committed. 
#KeepItCool

Receipt of this email implies compliance with our terms and 
conditions.

From 70ec929f971577e3a78ab32d15b631954286aaf7 Mon Sep 17 00:00:00 2001
From: Max Johnson 
Date: Mon, 23 Sep 2024 13:41:43 -0500
Subject: [PATCH] pg_ctl and miscinit: change type of MyStartTime

The start time variable needs to be 64 bits wide on all platforms in order
to support dates past 01/19/2038. This patch fixes an overflow error on
the start time in "postmaster.pid", which can cause pg_ctl to hang.
Change the value of "MyStartTime" to long long.
Signed-off-by: Max Johnson 
---
 src/backend/utils/init/miscinit.c | 4 ++--
 src/bin/pg_ctl/pg_ctl.c   | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 1f0b3175ae..8669e95817 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -1205,10 +1205,10 @@ CreateLockFile(const char *filename, bool amPostmaster,
 	 * both datadir and socket lockfiles; although more stuff may get added to
 	 * the datadir lockfile later.
 	 */
-	snprintf(buffer, sizeof(buffer), "%d\n%s\n%ld\n%d\n%s\n",
+	snprintf(buffer, sizeof(buffer), "%d\n%s\n%lld\n%d\n%s\n",
 			 amPostmaster ? (int) my_pid : -((int) my_pid),
 			 DataDir,
-			 (long) MyStartTime,
+			 (long long) MyStartTime,
 			 PostPortNumber,
 			 socketDir);
 
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index f9e0ee4eee..e138c5b236 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -624,7 +624,7 @@ wait_for_postmaster_start(pgpid_t pm_pid, bool do_checkpoint)
 			 * Allow 2 seconds slop for possible cross-process clock skew.
 			 */
 			pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
-			pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
+			pmstart = atoll(optlines[LOCK_FILE_LINE_START_TIME - 1]);
 			if (pmstart >= start_time - 2 &&
 #ifndef WIN32
 pmpid == pm_pid
-- 
2.34.1

Re: Possible null pointer dereference in afterTriggerAddEvent()

2024-09-24 Thread Alvaro Herrera

On 2024-Sep-24, Alexander Kuznetsov wrote:

> Hello,
> 
> is there anything else we can help with or discuss in order to apply this fix?

I don't think so, it seems a no-brainer to me and there are no
objections.  I'll get it pushed tomorrow.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Linux transformó mi computadora, de una `máquina para hacer cosas',
en un aparato realmente entretenido, sobre el cual cada día aprendo
algo nuevo" (Jaime Salinas)

Re: pgsql: Improve default and empty privilege outputs in psql.

2024-09-24 Thread Christoph Berg

Re: Tom Lane
> Improve default and empty privilege outputs in psql.

I'm sorry to report this when 17.0 has already been wrapped, but this
change is breaking `psql -l` against 9.3-or-earlier servers:

$ /usr/lib/postgresql/17/bin/psql
psql (17rc1 (Debian 17~rc1-1.pgdg+2), Server 9.3.25)
Geben Sie »help« für Hilfe ein.

postgres =# \set ECHO_HIDDEN on
postgres =# \l
/*** ANFRAGE /
SELECT
  d.datname as "Name",
  pg_catalog.pg_get_userbyid(d.datdba) as "Owner",
  pg_catalog.pg_encoding_to_char(d.encoding) as "Encoding",
  'libc' AS "Locale Provider",
  d.datcollate as "Collate",
  d.datctype as "Ctype",
  NULL as "Locale",
  NULL as "ICU Rules",
  CASE WHEN pg_catalog.cardinality(d.datacl) = 0 THEN '(none)' ELSE 
pg_catalog.array_to_string(d.datacl, E'\n') END AS "Access privileges"
FROM pg_catalog.pg_database d
ORDER BY 1;
//

FEHLER:  42883: Funktion pg_catalog.cardinality(aclitem[]) existiert nicht
ZEILE 10:   CASE WHEN pg_catalog.cardinality(d.datacl) = 0 THEN '(none...
  ^
TIP:  Keine Funktion stimmt mit dem angegebenen Namen und den Argumenttypen 
überein. Sie müssen möglicherweise ausdrückliche Typumwandlungen hinzufügen.
ORT:  ParseFuncOrColumn, parse_func.c:299


The psql docs aren't really explicit about which old versions are
still supported, but there's some mentioning that \d should work back
to 9.2:

  psql works best with servers of the same
   or an older major version.  Backslash commands are particularly likely
   to fail if the server is of a newer version than 
psql
   itself.  However, backslash commands of the \d family 
should
   work with servers of versions back to 9.2, though not necessarily with
   servers newer than psql itself.

\l seems a tad more important even.

Christoph

Re: Add llvm version into the version string

2024-09-24 Thread Joe Conway


On 9/24/24 09:52, Andres Freund wrote:

Hi,

On 2024-09-24 13:53:49 +0100, Alastair Turner wrote:

Since the build and runtime versions may differ for some things like llvm,
libxml2 and the interpreters behind some of the PLs, it may be valuable to
expand the view and show two values - a build time (or configure time)
value and a runtime value for these.


+1

Somewhat orthogonal: I've wondered before whether it'd be useful to have a
view showing the file path and perhaps the soversion of libraries dynamically
loaded into postgres. That's currently hard to figure out over a connection
(which is all a lot of our users have access to).


+1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Improve default and empty privilege outputs in psql.

2024-09-24 Thread Tom Lane

I wrote:
> Yes, that's the expectation.  I'm sure we can think of a more
> backwards-compatible way to test for empty datacl, though.

Looks like the attached should be sufficient.  As penance, I tried
all the commands in describe.c against 9.2 (or the oldest supported
server version), and none of them fail now.

regards, tom lane

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index faabecbc76..6a36c91083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6678,7 +6678,7 @@ printACLColumn(PQExpBuffer buf, const char *colname)
 {
 	appendPQExpBuffer(buf,
 	  "CASE"
-	  " WHEN pg_catalog.cardinality(%s) = 0 THEN '%s'"
+	  " WHEN pg_catalog.array_length(%s, 1) = 0 THEN '%s'"
 	  " ELSE pg_catalog.array_to_string(%s, E'\\n')"
 	  " END AS \"%s\"",
 	  colname, gettext_noop("(none)"),

Re: CSN snapshots in hot standby

2024-09-24 Thread Andres Freund

Hi,

On 2024-08-13 23:13:39 +0300, Heikki Linnakangas wrote:
> I added a tiny cache of the CSN lookups into SnapshotData, which can hold
> the values of 4 XIDs that are known to be visible to the snapshot, and 4
> invisible XIDs. This is pretty arbitrary, but the idea is to have something
> very small to speed up the common cases that 1-2 XIDs are repeatedly looked
> up, without adding too much overhead.
> 
> 
> I did some performance testing of the visibility checks using these CSN
> snapshots. The tests run SELECTs with a SeqScan in a standby, over a table
> where all the rows have xmin/xmax values that are still in-progress in the
> primary.
> 
> Three test scenarios:
> 
> 1. large-xact: one large transaction inserted all the rows. All rows have
> the same XMIN, which is still in progress
> 
> 2. many-subxacts: one large transaction inserted each row in a separate
> subtransaction. All rows have a different XMIN, but they're all
> subtransactions of the same top-level transaction. (This causes the subxids
> cache in the proc array to overflow)
> 
> 3. few-subxacts: All rows are inserted, committed, and vacuum frozen. Then,
> using 10 in separate subtransactions, DELETE the rows, in an interleaved
> fashion. The XMAX values cycle like this "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1,
> 2, 3, 4, 5, ...". The point of this is that these sub-XIDs fit in the
> subxids cache in the procarray, but the pattern defeats the simple 4-element
> cache that I added.

I'd like to see some numbers for a workload with many overlapping top-level
transactions. I contrast to 2) HEAD wouldn't need to do subtrans lookups,
whereas this patch would need to do csn lookups. And a four entry cache
probably wouldn't help very much.


> +/*
> + * Record commit LSN of a transaction and its subtransaction tree.
> + *
> + * xid is a single xid to set status for. This will typically be the top 
> level
> + * transaction ID for a top level commit.
> + *
> + * subxids is an array of xids of length nsubxids, representing 
> subtransactions
> + * in the tree of xid. In various cases nsubxids may be zero.
> + *
> + * commitLsn is the LSN of the commit record.  This is currently never called
> + * for aborted transactions.
> + */
> +void
> +CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
> +  XLogRecPtr commitLsn)
> +{
> + int pageno;
> + int i = 0;
> + int offset = 0;
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + pageno = TransactionIdToPage(xid);  /* get page of parent */
> + for (;;)
> + {
> + int num_on_page = 0;
> +
> + while (i < nsubxids && TransactionIdToPage(subxids[i]) == 
> pageno)
> + {
> + num_on_page++;
> + i++;
> + }

Hm - is there any guarantee / documented requirement that subxids is sorted?


> + CSNLogSetPageStatus(xid,
> + num_on_page, subxids + 
> offset,
> + commitLsn, pageno);
> + if (i >= nsubxids)
> + break;
> +
> + offset = i;
> + pageno = TransactionIdToPage(subxids[offset]);
> + xid = InvalidTransactionId;
> + }
> +}

Hm. Maybe I'm missing something, but what prevents a concurrent transaction to
check the visibility of a subtransaction between marking the subtransaction
committed and marking the main transaction committed? If subtransaction and
main transaction are on the same page that won't be possible, but if they are
on different ones it does seem possible?

Today XidInMVCCSnapshot() will use pg_subtrans to find the top transaction in
case of a suboverflowed snapshot, but with this patch that's not the case
anymore.  Which afaict will mean that repeated snapshot computations could
give different results for the same query?



Greetings,

Andres Freund

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-24 Thread Peter Eisentraut


On 23.09.24 22:51, Shayon Mukherjee wrote:

I am happy to draft a patch for this as well. I think I have a working
idea so far of where the necessary checks might go. However if you don’t
  mind, can you elaborate further on how the effect would be similar to
enable_indexscan?


Planner settings like enable_indexscan used to just add a large number 
(disable_cost) to the estimated plan node costs.  It's a bit more 
sophisticated in PG17.  But in any case, I imagine "disabling an index" 
could use the same mechanism.  Or maybe not, maybe the setting would 
just completely ignore the index.

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-24 Thread Peter Eisentraut


On 24.09.24 02:30, David Rowley wrote:

I understand the last discussion went down that route too. For me, it
seems strange that adding some global variable is seen as cleaner than
storing the property in the same location as all the other index
properties.


It's arguably not actually a property of the index, it's a property of 
the user's session.  Like, kind of, the search path is a session 
property, not a property of a schema.



How would you ensure no cached plans are still using the index after
changing the GUC?


Something for the patch author to figure out. ;-)

Re: AIX support

2024-09-24 Thread Bruce Momjian

On Tue, Sep 24, 2024 at 04:09:39PM +0300, Heikki Linnakangas wrote:
> On 24/09/2024 14:25, Srirama Kucherlapati wrote:
> > Hi Heikki & team,
> > 
> > Could you please let me know your comments on the previous details?
> > 
> > Attached are the individual patches for AIX and gcc(__sync) routines.
> 
> Repeating what I said earlier:
> 
> > Ok, if we don't need the assembler code at all, that's good. A patch
> > to introduce AIX support should not change it for non-AIX powerpc
> > systems though. That might be a good change, but would need to be
> > justified separately, e.g.  by some performance testing, and should
> > be a separate patch

Agreed.  Srirama Kucherlapati, you seem to be doing the minimum amount
of work and then repeatedly asking the same questions.  I suggest you
start to take this task more seriously.  I would go back and read the
entire thread, and take the things we have told you more seriously.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: [PATCH] Support Int64 GUCs

2024-09-24 Thread Nathan Bossart

On Tue, Sep 24, 2024 at 12:27:20PM +0300, Aleksander Alekseev wrote:
>> It doesn't look like these *_age GUCs could benefit from being 64-bit,
>> before 64-bit transaction ids get in.  However, I think there are some
>> better candidates.
>>
>> autovacuum_vacuum_threshold
>> autovacuum_vacuum_insert_threshold
>> autovacuum_analyze_threshold
>>
>> This GUCs specify number of tuples before vacuum/analyze.  That could
>> be more than 2^31.  With large tables of small tuples, I can't even
>> say this is always impractical to have values greater than 2^31.
>
> [...]
> 
> I found a great candidate in src/test/modules/delay_execution:
> 
> ```
> DefineCustomIntVariable("delay_execution.post_planning_lock_id",
> "Sets the advisory lock ID to be
> locked/unlocked after planning.",
> ```
> 
> Advisory lock IDs are bigints [1]. I modified the module to use Int64's.
> 
> I guess it may also answer Nathan's question.

Hm.  I'm not sure I find any of these to be particularly convincing
examples of why we need int64 GUCs.  Yes, the GUCs in question could
potentially be set to higher values, but I've yet to hear of this being a
problem in practice.  We might not want to encourage such high values,
either.

-- 
nathan

AIO writes vs hint bits vs checksums

2024-09-24 Thread Andres Freund

Hi,

Currently we modify pages while just holding a share lock, for hint bit
writes. Writing a buffer out only requires a share lock. Because of that we
can't compute checksums and write out pages in-place, as a concurent hint bit
write can easily corrupt the checksum.

That's not great, but not awful for our current synchronous buffered IO, as
we only ever have a single page being written out at a time.

However, it becomes a problem even if we just want to write out in chunks
larger than a single page - we'd need to reserve not just one BLCKSZ sized
buffer for this, but make it PG_IOV_MAX * BLCKSZ sized. Perhaps still
tolerable.

With AIO this becomes a considerably bigger issue:
a) We can't just have one write in progress at a time, but many
b) To be able to implement AIO using workers the "source" or "target" memory
   of the IO needs to be in shared memory

Besides that, the need to copy the buffers makes checkpoints with AIO
noticeably slower when checksums are enabled - it's not the checksum but the
copy that's the biggest source of the slowdown.


So far the AIO patchset has solved this by introducing a set of "bounce
buffers", which can be acquired and used as the source/target of IO when doing
it in-place into shared buffers isn't viable.

I am worried about that solution however, as either acquisition of bounce
buffers becomes a performance issue (that's how I did it at first, it was hard
to avoid regressions) or we reserve bounce buffers for each backend, in which
case the memory overhead for instances with relatively small amount of
shared_buffers and/or many connections can be significant.


Which lead me down the path of trying to avoid the need for the copy in the
first place:  What if we don't modify pages while it's undergoing IO?

The naive approach would be to not set hint bits with just a shared lock - but
that doesn't seem viable at all. For performance we rely on hint bits being
set and in many cases we'll only encounter the page in shared mode. We could
implement a conditional lock upgrade to an exclusive lock and do so while
setting hint bits, but that'd obviously be concerning from a concurrency point
of view.

What I suspect we might want instead is something inbetween a share and an
exclusive lock, which is taken while setting a hint bit and which conflicts
with having an IO in progress.

On first blush it might sound attractive to introduce this on the level of
lwlocks. However, I doubt that is a good idea - it'd make lwlock.c more
complicated which would imply overhead for other users, while the precise
semantics would be fairly specific to buffer locking.  A variant of this would
be to generalize lwlock.c to allow implementing different kinds of locks more
easily. But that's a significant project on its own and doesn't really seem
necessary for this specific project.

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples. That's not exactly free. I've
prototyped checking visibility of an entire page of tuples at once and it
indeed speeds up visibility checks substantially (in some cases seqscans are
over 20% faster!).

Once we have page-level visibility checks we can get the right to set hint
bits once for an entire page instead of doing it for every tuple - with that
in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
wins.


Having a page level approach to setting hint bits has other advantages:

E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set
on the page and we don't mark a page dirty on hot standby. Which often will
result in hint bits notpersistently set on replicas until the page is frozen.

Another issue is that we'll often WAL log hint bits for a page (due to hint
bits being set), just to then immediately log another WAL record for the same
page (e.g. for pruning), which is obviously wasteful. With a different
interface we could combine the WAL records for both.

I've not prototyped either, but I'm fairly confident they'd be helpful.


Does this sound like a reasonable idea?  Counterpoints?  If it does sound
reasonable, I'll clean up my pile of hacks into something
semi-

Re: Why mention to Oracle ?

2024-09-24 Thread Marcos Pegoraro

Em dom., 22 de set. de 2024 às 12:49, Roberto Mello 
escreveu:

> If you're volunteering to add a MySQL, SQL Server, Mongo, etc porting to
> the docs, I'm sure it could be a
> nice addition.
>

And if we create a page like https://www.postgresql.org/about/featurematrix/
But instead of Postgres versions we have other vendors.
Every feature would have a Postgres way of doing and what differs from his
old database.
Feature PostgreSQL Oracle SQL Server MySQL Firebird
SELECT  N ROWS LIMIT 10   TOP 10   FIRST 10
CONCAT STRINGS 'Name: ' || Name'Name: ' + Name
REBUILD INDEX REINDEX ALTER INDEX… REBUILD
CURRENT DATE CURRENT_DATE   GETDATE
This is just an example, for sure there would be several tables, for DMLs,
for DDL, for Maintenance ...

What do you think ?

regards
Marcos

Re: pg_ctl/miscinit: print "MyStartTime" as a long long instead of long to avoid 2038 problem.

2024-09-24 Thread Nathan Bossart

On Tue, Sep 24, 2024 at 04:44:41PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> I think we should use INT64_FORMAT here.  That'll choose the right length
>> modifier for the platform.  And I don't think we need to cast MyStartTime,
>> since it's a pg_time_t (which is just an int64).
> 
> Agreed.  However, a quick grep finds half a dozen other places that
> are casting MyStartTime to long.  We should fix them all.

+1

> Also note that if any of the other places are using translatable
> format strings, INT64_FORMAT is problematic in that context, and
> "long long" is a better answer for them.

At a glance, I'm not seeing any translatable format strings that involve
MyStartTime.  But that is good to know...

-- 
nathan

Re: Add contrib/pg_logicalsnapinspect

2024-09-24 Thread shveta malik

On Tue, Sep 24, 2024 at 10:23 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Tue, Sep 24, 2024 at 09:15:31AM +0530, shveta malik wrote:
> > On Fri, Sep 20, 2024 at 12:22 PM Bertrand Drouvot
> >  wrote:
> > >
> > >
> > > Please find attached v8, that:
> > >
> >
> > Thank You for the patch. In one of my tests, I noticed that I got
> > negative checksum:
> >
> > postgres=# SELECT * FROM pg_get_logical_snapshot_meta('0/3481F20');
> >magic|  checksum  | version
> > ++-
> >  1369563137 | -266346460 |   6
> >
> > But pg_crc32c is uint32. Is it because we are getting it as
> > Int32GetDatum(ondisk.checksum) in pg_get_logical_snapshot_meta()?
> > Instead should it be UInt32GetDatum?
>
> Thanks for the testing.
>
> As the checksum could be > 2^31 - 1, then v9 (just shared up-thread) changes 
> it
> to an int8 in the pg_logicalinspect--1.0.sql file. So, to avoid CI failure on
> the 32bit build, then v9 is using Int64GetDatum() instead of UInt32GetDatum().
>

Okay, looks  good,

> > Same goes for below:
> > values[i++] = Int32GetDatum(ondisk.magic);
> > values[i++] = Int32GetDatum(ondisk.magic);
>
> The 2 others field (magic and version) are unlikely to be > 2^31 - 1, so v9 is
> making use of UInt32GetDatum() and keep int4 in the sql file.
>
> > We need to recheck the rest of the fields in the info() function as well.
>
> I think that the pg_get_logical_snapshot_info()'s fields are ok (I did spend 
> some
> time to debug CI failing on the 32bit build for some on them before 
> submitting v1).
>

+ OUT catchange_xip xid[]

One question, what is xid datatype, is it too int8? Sorry, could not
find the correct doc. Since we are getting uint32 in Int64, this also
needs to be accordingly.

thanks
Shveta

Re: Syncrep and improving latency due to WAL throttling

2024-09-24 Thread Soumyadeep Chakraborty

Hey Tomas,

Shirisha had posted a recent re-attempt here [1] and then we were
graciously redirected here by Jakub.

We took a look at v5-0001-v4.patch and also a brief look at
v5-0002-rework.patch. We feel that it might be worth considering
throttling based on the remote standby to begin with for simplicity (i.e.
wal_flush_after_remote), and then we can add the other knobs incrementally?

Our comments on v5-0001-v4.patch are below:

> +
> + /*
> + * Decide if we need to throttle this backend, so that it does not write
> + * WAL too fast, causing lag against the sync standby (which in turn
> + * increases latency for standby confirmations). We may be holding locks
> + * and blocking interrupts here, so we only make the decision, but the
> + * wait (for sync standby confirmation) happens elsewhere.

Slightly reworded:

* wait (for sync standby confirmation) happens as part of the next
* CHECK_FOR_INTERRUPTS(). See HandleXLogDelayPending() for details on
* why the delay is deferred.

> + *
> + * The throttling is applied only to large transactions (producing more
> + * than wal_throttle_threshold kilobytes of WAL). Throttled backends
> + * can be identified by a new wait event SYNC_REP_THROTTLED.
> + *
> + * Small transactions (by amount of produced WAL) are still subject to
> + * the sync replication, so the same wait happens at commit time.
> + *
> + * XXX Not sure this is the right place for a comment explaining how the
> + * throttling works. This place is way too low level, and rather far from
> + * the place where the wait actually happens.

Perhaps the best course of action is to move the code and the comments
above into an inline function: SetAndCheckWALThrottle().

> +
> +/*
> + * HandleXLogDelayPending
> + * Throttle backends generating large amounts of WAL.
> + *
> + * The throttling is implemented by waiting for a sync replica confirmation 
> for
> + * a convenient LSN position. In particular, we do not wait for the current 
> LSN,
> + * which may be in a partially filled WAL page (and we don't want to write 
> this
> + * one out - we'd have to write it out again, causing write amplification).
> + * Instead, we move back to the last fully WAL page.
> + *
> + * Called from ProcessMessageInterrupts() to avoid syncrep waits in 
> XLogInsert(),
> + * which happens in critical section and with blocked interrupts (so it 
> would be
> + * impossible to cancel the wait if it gets stuck). Also, there may be locks 
> held
> + * and we don't want to hold them longer just because of the wait.
> + *
> + * XXX Andres suggested we actually go back a couple pages, to increase the
> + * probability the LSN was already flushed (obviously, this depends on how 
> much
> + * lag we allow).
> + *
> + * XXX Not sure why we use XactLastThrottledRecEnd and not simply XLogRecEnd?
> + */
> +void
> +HandleXLogDelayPending()
> +{
> + XLogRecPtr lsn;
> +
> + /* calculate last fully filled page */
> + lsn = XactLastThrottledRecEnd - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
> +
> + Assert(wal_throttle_threshold > 0);
> + Assert(backendWalInserted >= wal_throttle_threshold * 1024L);
> + Assert(XactLastThrottledRecEnd != InvalidXLogRecPtr);
> +
> + XLogFlush(lsn);
> + SyncRepWaitForLSN(lsn, false);
> +
> + XLogDelayPending = false;
> +}


(1) Can't we simply call SyncRepWaitForLSN(LogwrtResult.Flush, false); here?
LogwrtResult.Flush will guarantee that we are waiting on something that
has already been flushed or will be flushed soon. Then we wouldn't need
to maintain XactLastThrottledRecEnd, nor call XLogFlush() before calling
SyncRepWaitForLSN(). LogwrtResult can be slightly stale, but does that
really matter here?

(2) Also, to make things more understandable, instead of maintaining a
counter to track the number of WAL bytes written, maybe we should
maintain a LSN pointer called XLogDelayWindowStart. And then in here,
we can assert:

Assert(LogwrtResult.Flush - XLogDelayWindowStart >
wal_throttle_threshold * 1024);

Similarly, we can check the same conditional in SetAndCheckWALThrottle().

After the wait is done and we reset XLogDelayPending = false, we can
perhaps reset XLogDelayWindowStart = LogwrtResult.Flush.

The only downside probably is that if our standby is caught up enough, we may be
repeatedly and unnecessarily invoking a HandleXLogDelayPending(), where our
SyncRepWaitForLSN() would be called with an older LSN and it would be a no-op
early exit.


> + * XXX Should this be done even if XLogDelayPending is already set? Maybe
> + * that should only update XactLastThrottledRecEnd, withoug incrementing
> + * the pgWalUsage.wal_throttled counter?
> + */
> + backendWalInserted += rechdr->xl_tot_len;
> +
> + if ((synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_WRITE) &&
> + (wal_throttle_threshold > 0) &&
> + (backendWalInserted >= wal_throttle_threshold * 1024L))
> + {
> + XactLastThrottledRecEnd = XactLastRecEnd;
> + InterruptPending = true;
> + XLogDelayPending = true;
> + pgWalUsage.wal_throttled++;
XLogDelayWindowStart =

Re: Add llvm version into the version string

2024-09-24 Thread Dmitry Dolgov

> On Mon, Sep 23, 2024 at 02:45:18PM GMT, Tom Lane wrote:
> Maybe another idea could be a new system view?
>
> => select * from pg_system_version;
>  property | value
> 
>  core version | 18.1
>  architecture | x86_64-pc-linux-gnu
>  word size| 64 bit
>  compiler | gcc (GCC) 12.5.0
>  ICU version  | 60.3
>  LLVM version | 18.1.0
>  ...
>
> Adding rows to a view over time seems way less likely to cause
> problems than messing with a string people probably have crufty
> parsing code for.

Agree, the idea with a new system view sounds interesting. I'll try to
experiment in this direction, thanks.

RE: AIX support

2024-09-24 Thread Srirama Kucherlapati

Hi Heikki & team,

Could you please let me know your comments on the previous details?

Attached are the individual patches for AIX and gcc(__sync) routines.

Thanks,
Sriram.



0001-AIX-support-revert-the-changes-from-0b16bb8776bb8.v5.patch
Description:  0001-AIX-support-revert-the-changes-from-0b16bb8776bb8.v5.patch


0001-Replace-tas-asm-with-gcc-routines-for-powerPC-implem.patch
Description:  0001-Replace-tas-asm-with-gcc-routines-for-powerPC-implem.patch

RE: AIX support

2024-09-24 Thread Srirama Kucherlapati


Hi Heikki,

As requested earlier, I need some assistance from the Postgres side to identify 
any tool or testcase to calibrate the sync routine performance in Postgres.
I see the below tools for benchmarking.

  *   Pgbench https://www.postgresql.org/docs/current/pgbench.html
  *   Pg_test_fsync https://www.postgresql.org/docs/current/pgtestfsync.html
  *   pg_test_timing https://www.postgresql.org/docs/current/pgtesttiming.html

Please let me know, if these tools are fine or else ca you suggest us with any 
additional tools to run the benchmarking.


  > > Ok, if we don't need the assembler code at all, that's good. A patch to
  > > introduce AIX support should not change it for non-AIX powerpc systems
  > > though. That might be a good change, but would need to be justified
  > > separately, e.g.  by some performance testing, and should be a separate
  > > patch.

  > > If you make no changes to s_lock.h at all, will it work? Why not?

  > With the existing asm code I see there are some syntax errors, being hit.
  > But after reverting the old changes the issues resolved. Below are diffs.

  > Let me know if I need to run any perf tools to check the performance of
  > the __sync_lock_test_and_set change.

Thanks,
Sriram.

Re: Eager aggregation, take 3

2024-09-24 Thread Richard Guo

On Thu, Sep 5, 2024 at 9:40 AM Tender Wang  wrote:
> 1.  in setup_eager_aggregation(), before calling create_agg_clause_infos(), 
> it does
> some checks if eager aggregation is available. Can we move those checks into 
> a function,
> for example, can_eager_agg(), like can_partial_agg() does?

We can do this, but I'm not sure this would be better.

> 2.  I found that outside of joinrel.c we all use IS_DUMMY_REL,  but in 
> joinrel.c, Tom always uses
> is_dummy_rel(). Other commiters use IS_DUMMY_REL.

They are essentially the same: IS_DUMMY_REL() is a macro that wraps
is_dummy_rel().  I think they are interchangeable, and I don’t have a
preference for which one is better.

> 3.  The attached patch does not consider FDW when creating a path for 
> grouped_rel or grouped_join.
> Do we need to think about FDW?

We may add support for foreign relations in the future, but for now, I
think we'd better not expand the scope too much until we ensure that
everything is working correctly.

Thanks
Richard

Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest

2024-09-24 Thread Jelte Fennema-Nio

On Fri, 16 Aug 2024 at 07:43, Tom Lane  wrote:
> However, there are other ways to accomplish that.  I liked the
> suggestion of extending the CF webapp with a way to search for entries
> mentioning a particular mail message ID.  I dunno how hard it'd be to
> get it to recognize *any* message-ID from a thread, but surely at
> least the head message ought not be too hard to match.

This is now deployed, so you can now find a CF entry based on the email ID.

A bunch of other improvements also got deployed:
- Improved homepage[1] (now with useful and bookmarkable links at the top)
- More links on the cf entry page[2] (cfbot results, github diff, and
stable link to entry itself)
- Instructions on how to checkout an cfbot entry

CFBot traffic lights directly on the cfentry and probably the
commifest list page are the next thing I'm planning to work on

After that I'll take a look at sending opt-in emails

Another thing that I'm interested in adding is some metric of patch
size, so it's easier to find small patches that are thus hopefully
"easy" to review. To accommodate multi-patch emails, I'm thinking of
showing lines changed in the first patch and lines changed in all
patches together. Possibly showing it clearly, if significantly more
lines were deleted than added, so it's easy to spot effective
refactorings.

RE: Conflict detection for update_deleted in logical replication

2024-09-24 Thread Zhijie Hou (Fujitsu)

On Friday, September 20, 2024 11:59 AM Hou, Zhijie/侯 志杰 wrote:
> 
> On Friday, September 20, 2024 10:55 AM Zhijie Hou (Fujitsu)
>  wrote:
> > On Friday, September 20, 2024 2:49 AM Masahiko Sawada
>  wrote:
> > >
> > >
> > > I think that such a time-based configuration parameter would be a
> > > reasonable solution. The current concerns are that it might affect
> > > vacuum performance and lead to a similar bug we had with
> > vacuum_defer_cleanup_age.
> >
> > Thanks for the feedback!
> >
> > I am working on the POC patch and doing some initial performance tests
> > on this idea.
> > I will share the results after finishing.

Here is a POC patch for vacuum_committs_age idea. The patch adds a GUC
vacuum_committs_age to prevent dead rows from being removed if the age of the
delete transaction (xmax) has not exceeded the vacuum_committs_age threshold.
E.g. , it ensures the row is retained if now() - commit_timestamp_of_xmax <
vacuum_committs_age.

However, please note that the patch is still unfinished due to a few
issues that need to be addressed. For instance: We need to prevent
relfrozenxid/datfrozenxid from being advanced in both aggressive and
non-aggressive vacuum modes. Otherwise, the commit timestamp data is cleaned
up after advancing frozenxid, and we won’t be able to compute the age of a 
tuple.

Additionally, the patch has a noticeable performance impact on vacuum
operations when rows in a table are deleted by multiple transactions. Here are
the results of VACUUMing a table after deleting each row in a separate
transaction (total of 1000 dead rows) and the xmax ages of all the dead
tuples have exceeded the vacuum_committs_age in patched tests (see attachment
for the basic configuration of the tests):

HEAD:   Time: 848.637 ms
patched, SLRU 8MB:  Time: 1423.915 ms
patched, SLRU 1G:   Time: 1310.869 ms

Since we have discussed about an alternative approach that can reliably retain
dead tuples without modifying vacuum process. We plan to shift our focus to
this new approach [1]. I am currently working on another POC patch based on this
new approach and will share it later.

[1] 
https://www.postgresql.org/message-id/CAD21AoD%3Dm-YHceYMpsdu0HnGCaezeyVhaCPFxDLHU7aN0wgzqg%40mail.gmail.com

Best Regards,
Hou zj

perftest.conf
Description: perftest.conf

0001-try-to-add-vacuum_committs_age.patch
Description: 0001-try-to-add-vacuum_committs_age.patch

88 matches

Mail list logo