date:20241106

Re: Time to add a Git .mailmap?

2024-11-06 Thread Michael Paquier

On Wed, Nov 06, 2024 at 11:36:42AM +0100, Alvaro Herrera wrote:
> Funnily enough, I'm the only committer that has ever used "Michaël", but
> I'm not the only one to have used the mistaken "Paquiër".  Go figure.

I haven't noticed this one.  That's new :DD
--
Michael


signature.asc
Description: PGP signature

Re: Inconsistent RestrictInfo serial numbers

2024-11-06 Thread Richard Guo

On Tue, Oct 8, 2024 at 8:20 PM Richard Guo  wrote:
>
> I ran into an "ERROR:  variable not found in subplan target lists"
> error, which can be reproduced with the following query.

> To fix, I think we can reset the root->last_rinfo_serial counter after
> generating the additional constant-FALSE RestrictInfo.  Please see
> attached.

I intend to push this patch shortly, barring any objections, so that
it can catch up with the next minor release (Nov. 14).

Thanks
Richard

Re: Commit Timestamp and LSN Inversion issue

2024-11-06 Thread Amit Kapila

On Tue, Nov 5, 2024 at 7:28 PM Jan Wieck  wrote:
>
> >
> > We can't forget CDR completely as this could only be a potential
> > problem in that context. Right now, we don't have any built-in
> > resolution strategies, so this can't impact but if this is a problem
> > then we need to have a solution for it before considering a solution
> > like "last_write_wins" strategy.
>
> I agree that we can't forget about CDR. This is precisely the problem we
> ran into here at pgEdge and why we came up with a solution (attached).
>

I would like to highlight that we need to solve LSN<->Timestamp
inversion issue not only for resolution strategies like
'last_write_wins' but also for conflict detection as well. In
particular, while implementing/discussing the patch to detect the
update_deleted conflict type, we came across the race conditions [1]
where the inversion issue discussed here would lead to removing the
required rows before we could detect the conflict. So, +1 to solve
this issue.

> > Now, instead of discussing LSN<->timestamp inversion issue, you
> > started to discuss "last_write_wins" strategy itself which we have
> > discussed to some extent in the thread [2]. BTW, we are planning to
> > start a separate thread as well just to discuss the clock skew problem
> > w.r.t resolution strategies like "last_write_wins" strategy. So, we
> > can discuss clock skew in that thread and keep the focus of this
> > thread LSN<->timestamp inversion problem.
>
> Fact is that "last_write_wins" together with some implementation of
> Conflict free Replicated Data Types (CRDT) is good enough for many real
> world situations. Anything resembling a TPC-B or TPC-C is quite happy
> with it.
>
> The attached solution is minimally invasive because it doesn't move the
> timestamp generation (clock_gettime() call) into the critical section of
> ReserveXLogInsertLocation() that is protected by a spinlock. Instead it
> keeps track of the last commit-ts written to WAL in shared memory and
> simply bumps that by one microsecond if the next one is below or equal.
> There is one extra condition in that code section plus a function call
> by pointer for every WAL record.
>

I think we avoid calling hook/callback functions after holding a lock
(spinlock or LWLock) as the user may do an expensive operation or
acquire some other locks in those functions which could lead to
deadlocks or impact the concurrency. So, it would be better to
directly call an inline function to perform the required operation.

This sounds like a good idea to solve this problem. Thanks for sharing
the patch.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1LKgkjyNKeW5jEhy1%3DuE8z0p7Pdae0rohoJP51eJGd%3Dgg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Add column name to error description

2024-11-06 Thread Vladlen Popolitov

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Only tests checked.

I applied patch (no errors) and did build (no errors).

Tests passed without errors: 
make check
make installcheck
make check-world

Re: Rename Function: pg_postmaster_start_time

2024-11-06 Thread David G. Johnston

On Wed, Nov 6, 2024 at 1:17 PM Maiquel Grassi  wrote:

> >I can agree that pg_postmaster_ has the potential to be confusing to
> users, but
> >I agree that if we are to do anything it should be alias while
> maintaining the
> >old name for compatibility.
> >
> >Looking at similar functions it's clear they don't use the pg_postgres_
> prefix,
> >like for example pg_conf_load_time.  Should this if so be pg_start_time?
>
The name pg_conf_load_time doesn't seem that confusing to me. However,
> it doesn't provide clarity on which specific configuration file or which
> configuration
> was "reloaded,"
>

Since it reloads "all" of them such specificity would not be warranted.

This is actually a problem for pg_start_time though, especially if it stays
in the table "Session Information Functions" since one would rightly assume
it returns the time this client-session process started, not the postmaster
process.

The idea of using aliases would be a long-term solution, if you agree with
> me, of course, that we don't write PostgreSQL for ourselves, and good
> names should
> always be considered with the end users in mind.
>

Except it doesn't seem like users are confused.  It is just someone in the
ivory tower thinking they could be.  It feels like make-work for a problem
that doesn't actually exist.

So even agreeing with you that naming for comprehension by users is
important this change doesn't seem necessary.  If they see the name in the
documentation and are confused by the terminology "postmaster" maybe we can
cross-reference the glossary where it is defined.

This is a function that is likely often executed in third-party scripts and
not that often executed manually.  That further increases the burden of
change, especially relative to memorization burden for a user typing it in
manually.

David J.

RE: Rename Function: pg_postmaster_start_time

2024-11-06 Thread Maiquel Grassi

RE: Rename Function: pg_postmaster_start_time

2024-11-06 Thread Maiquel Grassi

>Except it doesn't seem like users are confused.  It is just someone in the 
>ivory tower thinking they could be.  It feels like make-work for a problem 
>that doesn't actually exist.

Hello. I understand that this type of comment could be interpreted as
discouraging and detrimental to the development of new ideas and
the growth of the community as a whole. I recognize that your
perspective has great value in helping to identify potential situations
that may not be viable. However, I would like to suggest that the way
these issues are expressed publicly could be more careful to avoid
misunderstandings.

Maiquel.

RE: Rename Function: pg_postmaster_start_time

2024-11-06 Thread Maiquel Grassi

>This function has this name since 600da67fbe5e back from 2008.
>Changing that 16 years later will break things.

Certainly, there are more considerations to take into account
than I initially realized. One possibility would be to create an
alias (or synonym) for the function "pg_postmaster_start_time",
naming it "pg_postgres_start_time". This way, the name change
could occur gradually and strategically over the coming years,
without immediate impact. I believe this approach could be
viable to facilitate a future transition without abrupt breaks.
Maiquel

Re: Rename Function: pg_postmaster_start_time

2024-11-06 Thread Daniel Gustafsson

> On 6 Nov 2024, at 20:28, Maiquel Grassi  wrote:
> 
> >This function has this name since 600da67fbe5e back from 2008.
> >Changing that 16 years later will break things.
> 
> Certainly, there are more considerations to take into account
> than I initially realized. One possibility would be to create an
> alias (or synonym) for the function "pg_postmaster_start_time",
> naming it "pg_postgres_start_time". This way, the name change
> could occur gradually and strategically over the coming years,
> without immediate impact. I believe this approach could be
> viable to facilitate a future transition without abrupt breaks. Maiquel

I can agree that pg_postmaster_ has the potential to be confusing to users, but
I agree that if we are to do anything it should be alias while maintaining the
old name for compatibility.

Looking at similar functions it's clear they don't use the pg_postgres_ prefix,
like for example pg_conf_load_time.  Should this if so be pg_start_time?

--
Daniel Gustafsson

RE: Rename Function: pg_postmaster_start_time

2024-11-06 Thread Maiquel Grassi

>I can agree that pg_postmaster_ has the potential to be confusing to users, but
>I agree that if we are to do anything it should be alias while maintaining the
>old name for compatibility.
>
>Looking at similar functions it's clear they don't use the pg_postgres_ prefix,
>like for example pg_conf_load_time.  Should this if so be pg_start_time?

The name pg_conf_load_time doesn't seem that confusing to me. However,
it doesn't provide clarity on which specific configuration file or which 
configuration
was "reloaded," so depending on the user sitting behind the computer, it could
indeed, be considered confusing as well. I wouldn't know how to suggest a better
alias for this function, "pg_conf_load_time." I like to always remind that 
PostgreSQL
is written for users, not for us. The current names and terms are very clear to 
us, actually,
extremely clear. For someone here in the hacker community, saying "postgres" or
"postmaster" is irrelevant — we are all more than advanced users of PostgreSQL, 
but we
are the minority of minorities within the range of users worldwide who use 
PostgreSQL.
Well, in that sense, I honestly only thought of a better and less confusing 
name for
"pg_postmaster_start_time." For the other functions that may have poor names,
I haven't done an assessment. I don't have a suggestion for your point, maybe 
someone
else does. The idea of using aliases would be a long-term solution, if you 
agree with
me, of course, that we don't write PostgreSQL for ourselves, and good names 
should
always be considered with the end users in mind.

Maiquel

Re: index prefetching

2024-11-06 Thread Peter Geoghegan

On Wed, Nov 6, 2024 at 12:25 PM Tomas Vondra  wrote:
> Attached is an updated version of this patch series. The first couple
> parts (adding batching + updating built-in index AMs) remain the same,
> the new part is 0007 which switches index scans to read stream API.

The first thing that I notice about this patch series is that it
doesn't fully remove amgettuple as a concept. That seems a bit odd to
me. After all, you've invented a single page batching mechanism, which
is duplicative of the single page batching mechanism that each
affected index AM has to use already, just to be able to allow the
amgettuple interface to iterate backwards and forwards with a
scrollable cursor (and to make mark/restore work). ISTM that you have
one too many batching interfaces here.

I can think of nothing that makes the task of completely replacing
amgettuple particularly difficult. I don't think that the need to do
the _bt_killitems stuff actually makes this task all that much harder.
It will need to be generalized, too, by keeping track of multiple
BTScanOpaqueData.killedItems[] style states, each of which is
associated with its own page-level currPos state. But that's not
rocket science. (Also don't think that mark/restore support is all
that hard.)

The current way in which _bt_kill_batch() is called from
_bt_steppage() by the patch seems weird to me. You're copying what you
actually know to be the current page's kill items such that
_bt_steppage() will magically do what it does already when the
amgetttuple/btgettuple interface is in use, just as we're stepping off
the page. It seems to be working at the wrong level.

Notice that the current way of doing things in your patch means that
your new batching interface tacitly knows about the nbtree batching
interface, and that it too works along page boundaries -- that's the
only reason why it can hook into _bt_steppage like this in the first
place. Things are way too tightly coupled, and the old and new way of
doing things are hopelessly intertwined. What's being abstracted away
here, really?

I suspect that _bt_steppage() shouldn't be calling _bt_kill_batch() at
all -- nor should it even call _bt_killitems(). Things need to be
broken down into smaller units of work that can be reordered, instead.

The first half of the current _bt_steppage() function deals with
finishing off the current leaf page should be moved to some other
function -- let's call it _bt_finishpage. A new callback should be
called as part of the new API when the time comes to tell nbtree that
we're now done with a given leaf page -- that's what this new
_bt_finishpage function is for. All that remains of _bt_steppage() are
the parts that deal with figuring out which page should be visited
next -- the second half of _bt_steppage stays put.

That way stepping to the next page and reading multiple pages can be
executed as eagerly as makes sense -- we don't need to "coordinate"
the heap accesses in lockstep with the leaf page accesses. Maybe you
won't take advantage of this flexibility right away, but ISTM that you
need nominal support for this kind of reordering to make the new API
really make sense.

There are some problems with this scheme, but they seem reasonably
tractable to me. We already have strategies for dealing with the risk
of concurrent TID recycling when _bt_killitems is called with some
maybe-recycled TIDs -- we're already dropping the pin on the leaf page
early in many cases. I've pointed this out many times already (again,
see _bt_drop_lock_and_maybe_pin).

It's true that we're still going to have to hold onto a buffer pin on
leaf pages whose TIDs haven't all been read from the table AM side
yet, unless we know that it's a case where that's safe for other
reasons -- otherwise index-only scans might give wrong answers. But
that other problem shouldn't be confused with the _bt_killitems
problem, just because of the superficial similarity around holding
onto a leaf page pin.

To repeat: it is important that you not conflate the problems on the
table AM side (TID recycle safety for index scans) with the problems
on the index AM side (safely setting LP_DEAD bits in _bt_killitems).
They're two separate problems that are currently dealt with as one
problem on the nbtree side -- but that isn't fundamental. Teasing them
apart seems likely to be helpful here.

> I speculated that with the batching concept it might work better, and I
> think that turned out to be the case. The batching is still the core
> idea, giving the index AM enough control to make kill tuples work (by
> not generating batches spanning multiple leaf pages, or doing something
> smarter). And the read stream leverages that too - the next_block
> callback returns items from the current batch, and the stream is reset
> between batches. This is the same prefetch restriction as with the
> explicit prefetching (done using posix_fadvise), except that the
> prefetching is done by the read stream.

ISTM that the central feature of the new

Re: define pg_structiszero(addr, s, r)

2024-11-06 Thread Michael Paquier

On Thu, Nov 07, 2024 at 08:10:17AM +1300, David Rowley wrote:
> Did you try with a size where there's a decent remainder, say 124
> bytes? FWIW, one of the cases has 112 bytes, and I think that is
> aligned memory meaning we'll do the first 64 in the SIMD loop and have
> to do 48 bytes in the byte-at-a-time loop. If you had the loop Michael
> mentioned, that would instead be 6 loops of size_t-at-a-time.

See the attached allzeros.c, based on the previous versions exchanged.
And now just imagine a structure like that:
#define BLCKSZ 48
typedef union AlignedBlock
{
chardata[BLCKSZ];
double  force_align_d;
int64_t force_align_i64;
} AlignedBlock;

This structure is optimized so as the first step to do the char step
is skipped because the pointer is aligned when allocated, and the
second step with the potential SIMD is skipped because the structure
is small enough at 48 bytes.  Hence only the last step would do the
allzero check.  Adding a size_t step to force a loop is going to be
more efficient, as proved upthread:
$ gcc -o allzeros -march=native -O2 allzeros.c
$ ./allzeros
allzeros: done in 118332297 nanoseconds
allzeros_v2: done in 13877745 nanoseconds (8.52677 times faster)

The allzero check is used for pgstat entries, and it could be possible
that some out-of-core code needs to rely on such small-ish sizes, or
even something else when a patch author feels like it.  So let's make
that optimized as much as we think we can: that's what this discussion
is about.
--
Michael
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define LOOPS 1000
#define BLCKSZ 48

typedef union AlignedBlock
{
	char		data[BLCKSZ];
	double		force_align_d;
	int64_t		force_align_i64;
} AlignedBlock;

static inline bool
isallzeros(const void *ptr, size_t len)
{
	const char *p = (const char *) ptr;
	const char *end = &p[len];
	const char *aligned_end = (const char *) ((uintptr_t) end & (~(sizeof(size_t) - 1)));

	while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
	{
		if (p == end)
			return true;
		if (*p++ != 0)
			return false;
	}
	for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
	{
		if size_t *) p)[0] != 0) |
			(((size_t *) p)[1] != 0) |
			(((size_t *) p)[2] != 0) |
			(((size_t *) p)[3] != 0) |
			(((size_t *) p)[4] != 0) |
			(((size_t *) p)[5] != 0) |
			(((size_t *) p)[6] != 0) |
			(((size_t *) p)[7] != 0))
			return false;
	}
	while (p < end)
	{
		if (*p++ != 0)
			return false;
	}
	return true;
}

static inline bool
isallzeros_v2(const void *ptr, size_t len)
{
	const char *p = (const char *) ptr;
	const char *end = &p[len];
	const char *aligned_end = (const char *) ((uintptr_t) end & (~(sizeof(size_t) - 1)));

	while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
	{
		if (p == end)
			return true;
		if (*p++ != 0)
			return false;
	}
	for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
	{
		if (((size_t *) p)[0] != 0 |
			((size_t *) p)[1] != 0 |
			((size_t *) p)[2] != 0 |
			((size_t *) p)[3] != 0 |
			((size_t *) p)[4] != 0 |
			((size_t *) p)[5] != 0 |
			((size_t *) p)[6] != 0 |
			((size_t *) p)[7] != 0)
			return false;
	}
	for (; p < aligned_end; p += sizeof(size_t))
	{
		if (*(size_t *)p != 0)
			return false;
	}
	while (p < end)
	{
		if (*p++ != 0)
			return false;
	}
	return true;
}

#define NANOSEC_PER_SEC 10

// Returns difference in nanoseconds
int64_t
get_clock_diff(struct timespec *t1, struct timespec *t2)
{
	int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC;
	nanosec += (t1->tv_nsec - t2->tv_nsec);

	return nanosec;
}

int main()
{
	AlignedBlock *pagebytes;
	volatile bool result;
	struct timespec start,end;
	int64_t char_time, size_t_time;

	pagebytes = (AlignedBlock *) malloc(sizeof(AlignedBlock));
	memset(pagebytes, 0, sizeof(AlignedBlock));

	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);

	for (int i = 0; i < LOOPS; i++)
	{
		result = isallzeros(pagebytes, sizeof(AlignedBlock));
	}

	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
	char_time = get_clock_diff(&end, &start);
	printf("allzeros: done in %ld nanoseconds\n", char_time);

	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);

	for (int i = 0; i < LOOPS; i++)
	{
		result = isallzeros_v2(pagebytes, sizeof(AlignedBlock));
	}

	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
	size_t_time = get_clock_diff(&end, &start);
	printf("allzeros_v2: done in %ld nanoseconds (%g times faster)\n", size_t_time, (double) char_time / size_t_time);

	return 0;
}


signature.asc
Description: PGP signature

Re: define pg_structiszero(addr, s, r)

2024-11-06 Thread Michael Paquier

On Thu, Nov 07, 2024 at 08:05:10AM +1300, David Rowley wrote:
> That might be quite good for small lengths or for use cases where the
> memory is always or almost always zero. The problem is there's no
> early exit when you find the first non-zero which means, for larger
> lengths, reading much more memory. That'll both take longer and
> possibly evict cache lines which might be useful to have in the near
> future.

Didn't know this one either, thanks for the explanation.
--
Michael


signature.asc
Description: PGP signature

Re: per backend I/O statistics

2024-11-06 Thread Michael Paquier

On Wed, Nov 06, 2024 at 01:51:02PM +, Bertrand Drouvot wrote:
> That's not needed, the patch I'm working on stores the proc number in the
> objid field of the key.

Relying on the procnumber for the object ID gets a +1 here.  That
provides an automatic cap on the maximum number of entries that can
exist at once for this new stats kind.
--
Michael


signature.asc
Description: PGP signature

Re: New function normal_rand_array function to contrib/tablefunc.

2024-11-06 Thread Andy Fan



Hi,

Looks I miss some interesting dicussions in the recent days, the
pretty neat API random_array, random_array or array_random (I prefer the
random_array because of the setseed stuff as Dean said). These
dicussions absoluatly enrichs my API / decoument design experience. 

I'm still not sure if it should be put into core, I agree that 
random_array/array_random should be put into core if we go with this
way. It looks to me that the cost of putting them into core is just
it takes more OIDs in initdb, hence making catalog a bit of bigger, not 
sure the real matter of it.

As for if we should support random_array_multidim at the first step, as
a developer, I would like to try dim sutff since I have little
experience on this area, It is possible that this area bring some
other interesting experience for me as the things happen to me recently.


> I suggest implementing the function in several steps. First implement
> random_array(min, max, len). It's straightforward and IMO will provide
> the most value to the majority of the users.Then we can either add an
> optional argument or a random_array_multidim() function.

This may doubles the OID expence, every invariants needs 3 OIDs(int, bigint,
numeric). 

> This can be
> implemented and discussed as a separate patch. This approach would be
> convenient for the author and the reviewers and also will allow us to
> deliver value to the users sooner.

Appricated for your suggestion like this, Aleksander, I can understand
your kindly intention!

If no new ideas, I will defer to Dean's final decision for efficient
purpose.

-- 
Best Regards
Andy Fan

Re: optimize the value of vacthresh and anlthresh

2024-11-06 Thread Frédéric Yhuel





On 11/4/24 09:30, wenhui qiu wrote:

Hi hackers
     A few days ago, I was looking at the sql server documentation and 
found that sql server has optimized the algorithm related to updating 
statistics in the 2016 ,version,I think we can also learn from the 
implementation method of sql server to optimize the problem of automatic 
vacuum triggered by large tables，The Document 
link(https://learn.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-ver16 ),I refer to the sql server implementation method, submit a patch, please see the attachment, hope to get the help of expert hackers


similar idea here : 
https://www.postgresql.org/message-id/6a2ac9b7-6535-4bb1-8274-0647f7c31c82%40dalibo.com

Re: Converting contrib SQL functions to new style

2024-11-06 Thread Ronan Dunklau

Le mercredi 6 novembre 2024, 06:52:16 heure normale d’Europe centrale Michael 
Paquier a écrit :
> On Tue, Nov 05, 2024 at 08:05:16PM -0500, Tom Lane wrote:
> > No, I don't think so.  For one thing, it would not help existing
> > installations unless they issue "ALTER EXTENSION UPDATE", which
> > people are not likely to do in a minor update.  But also, we don't
> > know of live attacks against these functions with their current
> > definitions, so I don't think this is urgent.
> 
> OK, thanks.

For most of them I agree, but one side effect of the current implementation is 
that we have a bug when pg_upgrad'ing if earthdistance is installed: 
https://www.postgresql.org/message-id/flat/
152106914669.1223.5104148605998271987%40wrigleys.postgresql.org

The new version of earthdistance fixes that problem, so this one would be worth 
a back patch to 16, giving users the opportunity to update earthdistance 
before running pg_upgrade.

Thanks for the tip about the CI, will make sure to have it setup before next 
time.

--
Ronan Dunklau

Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description

2024-11-06 Thread Amit Kapila

On Thu, Oct 31, 2024 at 11:05 PM Bruce Momjian  wrote:
>
> Yes, all good suggestions, updated patch attached.
>

LGTM.

-- 
With Regards,
Amit Kapila.

Re: Eager aggregation, take 3

2024-11-06 Thread Richard Guo

On Fri, Nov 1, 2024 at 9:42 PM Robert Haas  wrote:
> On Fri, Nov 1, 2024 at 2:21 AM Richard Guo  wrote:
> > ... an aggregated row from the partial
> > aggregation matches the other side of the join if and only if each row
> > in the partial group does, thereby ensuring that all rows in the same
> > partial group have the same 'destiny'.
>
> Ah, I really like this turn of phrase! I think it's clearer and
> simpler than what I said. And I think it implies that we don't need to
> explicitly deduce surrogate grouping keys. For example if we have  A
> JOIN B JOIN C JOIN D JOIN E JOIN F, grouped by columns from A, we
> don't need to work out surrogate grouping keys for B and then C and
> then D and then E and then F. We can just look at F's join clauses and
> that tells us how to group, independent of anything else.
>
> But is there any hole in that approach? I think the question is
> whether the current row could be used in some way that doesn't show up
> in the join clauses. I can't think of any way for that to happen,
> really. I believe that any outerjoin-delayed quals will show up as
> join clauses, and stuff like ORDER BY and HAVING will happen after the
> aggregation (at least logically) so it should be fine. Windowed
> functions and ordered aggregates may be a blocker to the optimization,
> though: if the window function needs access to the unaggregated rows,
> or even just needs to know how many rows there are, then we'd better
> not aggregate them before the window function runs; and if the
> aggregate is ordered, we can only partially aggregate the data if it
> is already ordered in a way that is compatible with the final, desired
> ordering. Another case we might need to watch out for is RLS. RLS
> wants to apply all the security quals before any non-leakproof
> functions, and pushing down the aggregation might push an aggregate
> function past security quals. Perhaps there are other cases to worry
> about as well; this is all I can think of at the moment.

Yeah, ordered aggregates could be a blocker.  I think it might be best
to prevent the use of eager aggregation if root->numOrderedAggs > 0
for now.

I've been thinking about the window functions case, as Jian He also
mentioned it some time ago.  It seems that the window function's
argument(s), as well as the PARTITION BY expression(s), are supposed
to appear in the GROUP BY clause or be used in an aggregate function.
And window functions are applied after the aggregation.  So it seems
that there is no problem with window functions.  But maybe I'm wrong.

I hadn't considered the RLS case before, but I think you're right.  To
simplify things, maybe for now we can just prevent pushing down the
aggregation if the query applies some RLS policy, by checking
query->hasRowSecurity.

> But regardless of those kinds of cases, the basic idea that we want
> the partially aggregate rows to join if and only if the unaggregated
> rows would have joined seems exactly correct to me, and that provides
> theoretical justification for deriving the surrogate grouping key
> directly from the join quals. Woot!

Thank you for the confirmation!

Thanks
Richard

Re: Using read stream in autoprewarm

2024-11-06 Thread Nazir Bilal Yavuz

Hi,

On Fri, 1 Nov 2024 at 21:06, Andrey M. Borodin  wrote:
>
> > On 1 Nov 2024, at 12:51, Nazir Bilal Yavuz  wrote:
> >
> >  am not
> > sure whether 'BufferStrategyControl.lastFreeBuffer -
> > BufferStrategyControl.firstFreeBuffer' is safe to use.
>
> Ugh... it will work. But it seems to me too dirty hack. There's no scalable 
> way to know size of a free list.
> Let's just comment that we might read some more buffers if database does not 
> fit into memory?
> Alternatively we can count size of a free list on the start.

I agree that it is too dirty to hack. There is a minor problem with
the counting size of a free list on the start. There may be other
processes that fill the buffer pool concurrently, so we can still end
up doing unnecessary I/Os. That said, I believe this approach remains
an improvement.

The first patch includes the comment you suggested, and the second
patch implements counting size of a free list on the start.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From bcdb55f237945072cb5740392515af701cfdc1d2 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 5 Nov 2024 11:40:14 +0300
Subject: [PATCH v2 1/2] Use read stream in autoprewarm

Instead of reading blocks with ReadBufferExtended(), create read stream
object for each possible case and use it.

This change provides about 10% performance improvement.
---
 contrib/pg_prewarm/autoprewarm.c | 112 +--
 1 file changed, 107 insertions(+), 5 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index fac4051e1aa..8291bf3c427 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -41,6 +41,7 @@
 #include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procsignal.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/guc.h"
@@ -422,6 +423,68 @@ apw_load_buffers(void)
 		apw_state->prewarmed_blocks, num_elements)));
 }
 
+struct apw_read_stream_private
+{
+	bool		first_block;
+	int			max_pos;
+	int			pos;
+	BlockInfoRecord *block_info;
+	BlockNumber nblocks_in_fork;
+
+};
+
+static BlockNumber
+apw_read_stream_next_block(ReadStream *stream,
+		   void *callback_private_data,
+		   void *per_buffer_data)
+{
+	struct apw_read_stream_private *p = callback_private_data;
+	bool	   *rs_have_free_buffer = per_buffer_data;
+	BlockInfoRecord *old_blk;
+	BlockInfoRecord *cur_blk;
+
+	*rs_have_free_buffer = true;
+
+	/*
+	 * There may still be queued blocks in the stream even when no free
+	 * buffers are available in the buffer pool. This can lead to unnecessary
+	 * I/O operations and buffer evictions. One possible solution is to
+	 * compare the number of free buffers in the buffer pool with the number
+	 * of queued blocks in the stream. However, this approach is considered a
+	 * workaround and would add complexity with minimal benefit, as only a few
+	 * unnecessary I/O operations and buffer evictions are expected.
+	 * Therefore, this solution has not been implemented.
+	 */
+	if (!have_free_buffer())
+	{
+		*rs_have_free_buffer = false;
+		return InvalidBlockNumber;
+	}
+
+	if (p->pos == p->max_pos)
+		return InvalidBlockNumber;
+
+	if (p->first_block)
+	{
+		p->first_block = false;
+		return p->block_info[p->pos++].blocknum;
+	}
+
+	old_blk = &(p->block_info[p->pos - 1]);
+	cur_blk = &(p->block_info[p->pos]);
+
+	if (old_blk->database == cur_blk->database &&
+		old_blk->forknum == cur_blk->forknum &&
+		old_blk->filenumber == cur_blk->filenumber &&
+		cur_blk->blocknum < p->nblocks_in_fork)
+	{
+		p->pos++;
+		return cur_blk->blocknum;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Prewarm all blocks for one database (and possibly also global objects, if
  * those got grouped with this database).
@@ -435,6 +498,9 @@ autoprewarm_database_main(Datum main_arg)
 	BlockNumber nblocks = 0;
 	BlockInfoRecord *old_blk = NULL;
 	dsm_segment *seg;
+	ReadStream *stream = NULL;
+	struct apw_read_stream_private p;
+	bool	   *rs_have_free_buffer;
 
 	/* Establish signal handlers; once that's done, unblock signals. */
 	pqsignal(SIGTERM, die);
@@ -451,13 +517,16 @@ autoprewarm_database_main(Datum main_arg)
 	block_info = (BlockInfoRecord *) dsm_segment_address(seg);
 	pos = apw_state->prewarm_start_idx;
 
+	p.block_info = block_info;
+	p.max_pos = apw_state->prewarm_stop_idx;
+
 	/*
 	 * Loop until we run out of blocks to prewarm or until we run out of free
 	 * buffers.
 	 */
-	while (pos < apw_state->prewarm_stop_idx && have_free_buffer())
+	for (; pos < apw_state->prewarm_stop_idx; pos++)
 	{
-		BlockInfoRecord *blk = &block_info[pos++];
+		BlockInfoRecord *blk = &block_info[pos];
 		Buffer		buf;
 
 		CHECK_FOR_INTERRUPTS();
@@ -470,6 +539,18 @@ autoprewarm_database_main(Datum main_arg)
 			old_blk->database != 0)
 			break;
 
+		/*
+		 * If stream needs to be created again, end it before closing the old
+		 * relation.
+		 */
+		if (stream && (old_blk == NULL ||
+

Re: doc issues in event-trigger-matrix.html

2024-11-06 Thread Peter Eisentraut


On 29.10.24 23:33, Michael Paquier wrote:

On Tue, Oct 29, 2024 at 03:53:43PM +0100, Daniel Gustafsson wrote:

+1, I think this is a net improvement.


Agreed.  I have spent some time looking in the past few years looking
at patches that tweaked this table, and it was always hard to figure
out if it was completely right.


The only thing I would change on top of this is move the reference to section
9.30 under table_rewrite to be at the end after both supporting functions since
the link is relevant to both of them. Something like:

-pg_event_trigger_table_rewrite_oid() (see
-). To discover the reason(s)
-for the rewrite, use the function
-pg_event_trigger_table_rewrite_reason().
+pg_event_trigger_table_rewrite_oid() To discover the 
reason(s)
+for the rewrite, use the function 
pg_event_trigger_table_rewrite_reason()
+(see ).


Fine by me to tweak this paragraph like that with the link at the end.


Committed with the suggested changes.

Re: Inconsistent RestrictInfo serial numbers

2024-11-06 Thread Ashutosh Bapat

On Wed, Nov 6, 2024 at 4:10 PM Richard Guo  wrote:
>
> On Tue, Oct 8, 2024 at 8:20 PM Richard Guo  wrote:
> >
> > I ran into an "ERROR:  variable not found in subplan target lists"
> > error, which can be reproduced with the following query.
>
> > To fix, I think we can reset the root->last_rinfo_serial counter after
> > generating the additional constant-FALSE RestrictInfo.  Please see
> > attached.
>
> I intend to push this patch shortly, barring any objections, so that
> it can catch up with the next minor release (Nov. 14).

I didn't get time to respond to your email and create a patch based on
my proposal. Sorry. I may not get time for another week or so. Please
feel free to push the patch, if you feel that it's in a committable
state.

-- 
Best Wishes,
Ashutosh Bapat

Re: doc issues in event-trigger-matrix.html

2024-11-06 Thread Peter Eisentraut


On 30.10.24 13:31, jian he wrote:

On Tue, Oct 29, 2024 at 7:54 PM Peter Eisentraut  wrote:


I made a patch for this.  I have expanded the narrative discussion on
what commands are supported for event triggers, also made a few
corrections/additions there, based on inspecting the source code.  And
then removed the big matrix, which doesn't provide any additional
information, I think.

I think this is sufficient and covers everything.  The only hand-wavy
thing I can see is exactly which ALTER commands trigger the sql_drop
event.  But this was already quite imprecise before, and I think also
not quite correct.  This might need a separate investigation.

In any case, we can use this as a starting point to iterate on the right
wording etc.


hi. I have some minor issue.


  An event trigger fires whenever the event with which it is associated
  occurs in the database in which it is defined.

is possible to rewrite this sentence, two "which" is kind of not easy
to understand?


I couldn't think of anything simpler that wouldn't be weirdly nested in 
some other way.  This wasn't really related to this patch, so I didn't 
touch it.  But suggestions are welcome.



create role alice;
create role bob;
grant alice to bob;

  As an exception, this event does not occur for DDL commands targeting
  shared objects:
  
   databases
   roles
   tablespaces
   parameter privileges
   ALTER SYSTEM
  
  This event also does not occur for commands targeting event triggers
  themselves.


not 100% sure this description
"  roles"
cover case like "grant alice to bob;"
Here "targeting  shared objects" is "role related meta information".
maybe a new item like
roles privileges.


Yeah, I added a clarification in the committed version.

Re: optimize the value of vacthresh and anlthresh

2024-11-06 Thread wenhui qiu

Hi Frédéric
many thanks for your email。
I'll go and see.

On Wed, 6 Nov 2024 at 17:47, Frédéric Yhuel 
wrote:

>
>
> On 11/4/24 09:30, wenhui qiu wrote:
> > Hi hackers
> >  A few days ago, I was looking at the sql server documentation and
> > found that sql server has optimized the algorithm related to updating
> > statistics in the 2016 ,version,I think we can also learn from the
> > implementation method of sql server to optimize the problem of automatic
> > vacuum triggered by large tables，The Document
> > link(
> https://learn.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-ver16
> <
> https://learn.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-ver16>),I
> refer to the sql server implementation method, submit a patch, please see
> the attachment, hope to get the help of expert hackers
>
> similar idea here :
>
> https://www.postgresql.org/message-id/6a2ac9b7-6535-4bb1-8274-0647f7c31c82%40dalibo.com
>

Re: define pg_structiszero(addr, s, r)

2024-11-06 Thread Bertrand Drouvot

Hi,

On Wed, Nov 06, 2024 at 01:44:58PM +0900, Michael Paquier wrote:
> Should the last loop check only 1 byte at a time or should this stuff
> include one more step before the last one you wrote to do a couple of
> checks with size_t?  That may matter for areas small enough (len <
> sizeof(size_t) * 8) causing the second step to not be taken, but large
> enough (len > sizeof(size_t)) to apply a couple of size_t checks per
> loop. 

Do you mean add:

"
for (; p < aligned_end; p += sizeof(size_t))
{
   if (*(size_t *)p != 0)
   return false;
}
"

just before the last loop?

If so, I did a few tests and did not see any major improvements. So, I thought
it's simpler to not add more code in this inline function in v7 shared 
up-thread.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: doc: pgevent.dll location

2024-11-06 Thread Peter Eisentraut


On 06.11.24 13:57, Ryohei Takahashi (Fujitsu) wrote:

The dll install paths are changed as follows on Windows.

(1) pgevent.dll
PG16: lib/
PG17: bin/

(2) dll for user (like libpq.dll, libecpg.dll)
PG16: Both in lib/ and bin/
PG17: bin/

(3) contrib dll (like amcheck.dll)
PG16: lib/
PG17: lib/


I understand that Dave says (1) and (2) should exist on bin/ directory
and the paths in PG17 are correct.

So, I think it is correct to update documentation.


I don't have Windows handy to test it out, but looking at the respective 
build system source files, in master, pgevent is built and installed 
like a normal shared library in both meson.build and Makefile, so it 
should end up somewhere in lib.


In src/tools/msvc in REL_16_STABLE, I see some code that appears to 
suggest that it installs in bin.


Again, this is just reading the code, but it seems to be backwards from 
what is claimed earlier.


The statements like "in PGxxx it did this" are not precise enough 
because there are three possible build systems.  We need to know what 
each build system is doing.  Also, the consideration of consistency 
should go in two dimensions: Consistency between versions and 
consistency between build systems.

Re: index_delete_sort: Unnecessary variable "low" is used in heapam.c

2024-11-06 Thread Daniel Gustafsson

> On 5 Nov 2024, at 22:08, Daniel Gustafsson  wrote:
> 
>> On 5 Nov 2024, at 17:40, Fujii Masao  wrote:
>> 
>> On 2024/09/24 21:31, Daniel Gustafsson wrote:
 On 24 Sep 2024, at 10:32, btnakamurakoukil 
  wrote:
 I noticed unnecessary variable "low" in index_delete_sort() 
 (/postgres/src/backend/access/heap/heapam.c), patch attached. What do you 
 think?
>>> That variable does indeed seem to not be used, and hasn't been used since it
>>> was committed in d168b666823.  The question is if it's a left-over from
>>> development which can be removed, or if it should be set and we're missing 
>>> an
>>> optimization.  Having not read the referenced paper I can't tell so adding
>>> Peter Geoghegan who wrote this for clarification.
>> 
>> It's been about a month without updates. How about removing the unnecessary
>> variable as suggested? We can always add the "missing" logic later if needed.
> 
> Thanks for reviving this, I had admittedly forgotten about this thread.  I
> don't see any reason to not go ahead with the proposed diff.

Done now.

--
Daniel Gustafsson

Re: Eager aggregation, take 3

2024-11-06 Thread Robert Haas

On Wed, Nov 6, 2024 at 3:22 AM Richard Guo  wrote:
> Yeah, ordered aggregates could be a blocker.  I think it might be best
> to prevent the use of eager aggregation if root->numOrderedAggs > 0
> for now.
>
> I've been thinking about the window functions case, as Jian He also
> mentioned it some time ago.  It seems that the window function's
> argument(s), as well as the PARTITION BY expression(s), are supposed
> to appear in the GROUP BY clause or be used in an aggregate function.
> And window functions are applied after the aggregation.  So it seems
> that there is no problem with window functions.  But maybe I'm wrong.
>
> I hadn't considered the RLS case before, but I think you're right.  To
> simplify things, maybe for now we can just prevent pushing down the
> aggregation if the query applies some RLS policy, by checking
> query->hasRowSecurity.

Particularly for the RLS case, I think we should be reluctant to
disable the optimization entirely just because there might be a
problem. We have existing infrastructure to keep security quals from
being applied too late, and I suspect it's mostly applicable to this
situation. Therefore, I suspect it might not be very much work to
allow this optimization even when RLS is in use, as long as it
wouldn't actually cause a violation of the RLS rules. If, upon
investigation, you find some reason why we can't assess accurately
whether pushing down a specific aggregate is a problem, then the
approach that you propose is reasonable, but I think the question
should be investigated. I don't like the idea of giving up on
RLS-using queries completely without even trying to figure out how
difficult it would be to do the right thing.

I have similar but weaker feelings about ordered aggregates. Consider:

explain select t1.id, array_agg(t2.v order by t3.o) from t1, t2, t3
where t1.id = t2.id and t2.id = t3.id group by 1;

We can't partially aggregate t2, but we could partially aggregate t2
join t3. So this case is a lot like:

explain select t1.id, array_agg(t2.v + t3.o) from t1, t2, t3 where
t1.id = t2.id and t2.id = t3.id group by 1;

I don't know whether the patch handles the second case correctly right
now, but that certainly seems like a case that has to work. We must be
able to determine in such a case that the partial aggregate has to be
above the t2-t3 join. And if we can determine that, then why can't
basically the same logic handle the first case? There are certainly
some differences. The first case not only needs the aggregate to be
above the t2-t3 join but also needs the input data to be sorted, so we
don't get the right behavior for ordered aggregates just by using the
contents of the ORDER BY clause to decide at what level the partial
aggregate can be applied. On the other hand, if we're looking at paths
for (T2 JOIN T3) to build paths for PartialAgg(T2 join T3), the
stipulation that we need to use ordered paths or sorting doesn't make
the code very much more complicated. I'm open to the conclusion that
this is too much complexity but I'd rather not dismiss it instantly.

Regarding window functions, you've said a few times now that you don't
see the problem, but the more I think about it, the more obvious it
seems to me that there are BIG problems. Consider this example from
the documentation:

SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname)
FROM empsalary;

I get a query plan like this:

 WindowAgg  (cost=83.46..104.37 rows=1200 width=72)
   ->  Sort  (cost=83.37..86.37 rows=1200 width=40)
 Sort Key: depname
 ->  Seq Scan on empsalary  (cost=0.00..22.00 rows=1200 width=40)

Already we see warning signs here. The WindowAgg node needs the input
rows to be ordered, because it's going to average the salary for each
group of rows with the same depname. So we have the same kinds of
issues that we do for ordered aggregates, at the very least. But
window aggregates are not just ordering-sensitive. They are also
empowered to look at other rows in the frame. Consider the following
example:

create table names (n text);
insert into names values ('Tom'), ('Dick'), ('Harry');
select n, lag(n, 1) over () from names;

The result is:

   n   | lag
---+--
 Tom   |
 Dick  | Tom
 Harry | Dick

I think it is pretty obvious that if any form of partial aggregation
had been applied here, it would be impossible to correctly evaluate
lag(). Or am I missing something?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: doc fail about ALTER TABLE ATTACH re. NO INHERIT

2024-11-06 Thread Alvaro Herrera

On 2024-Nov-06, Amit Langote wrote:

> On Tue, Nov 5, 2024 at 9:01 PM Alvaro Herrera  wrote:

> > While doing final review for not-null constraints, I noticed that the
> > ALTER TABLE ATTACH PARTITION have this phrase:
> >
> > If any of the CHECK constraints of the table being attached are marked 
> > NO
> > INHERIT, the command will fail; such constraints must be recreated 
> > without the
> > NO INHERIT clause.

> I think that text might be talking about this case:
> 
> create table parent (a int, constraint check_a check (a > 0))
> partition by list (a);
> create table part1 (a int, constraint check_a check (a > 0) no inherit);
> alter table parent attach partition part1 for values in (1);
> ERROR:  constraint "check_a" conflicts with non-inherited constraint on child 
> table "part1"

Oh, hmm, that makes sense I guess.  Still, while this restriction makes
sense for inheritance, it doesn't IMO for partitioned tables.  I would
even suggest that we drop enforcement of this restriction during ATTACH.

> Perhaps the ATTACH PARTITION text should be changed to make clear
> which case it's talking about, say, like:
> 
> If any of the CHECK constraints of the table being attached are marked
> NO INHERIT, the command will fail if a constraint with the same name
> exists in the parent table; such constraints must be recreated without
> the NO INHERIT clause.

Hmm, not sure about it; I think we're giving too much prominence to a
detail that's not so important that it warrants four extra lines, when
it could be a parenthical note next to the other mention of those
constraints earlier in that paragraph.  I suggest something like this:

  A partition using FOR VALUES uses same syntax for
  partition_bound_spec as
  CREATE TABLE.  
The partition bound specification
  must correspond to the partitioning strategy and partition key of the
  target table.  The table to be attached must have all the same columns
  as the target table and no more; moreover, the column types must also
  match.  Also, it must have all the NOT NULL and
  CHECK constraints of the target table.  Currently
  FOREIGN KEY constraints are not considered.
  UNIQUE and PRIMARY KEY constraints
  from the parent table will be created in the partition, if they don't
  already exist.
  If any of the CHECK constraints of the table being
  attached are marked NO INHERIT, the command will fail;
  such constraints must be recreated without the
  NO INHERIT clause.

I suggest we change it to

  A partition using FOR VALUES uses same syntax for
  partition_bound_spec as
  CREATE TABLE.  
The partition bound specification
  must correspond to the partitioning strategy and partition key of the
  target table.  The table to be attached must have all the same columns
  as the target table and no more; moreover, the column types must also
  match.  Also, it must have all the NOT NULL and
  CHECK constraints of the target table, not marked
  NO INHERIT.  Currently
  FOREIGN KEY constraints are not considered.
  UNIQUE and PRIMARY KEY constraints
  from the parent table will be created in the partition, if they don't
  already exist.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"You don't solve a bad join with SELECT DISTINCT" #CupsOfFail
https://twitter.com/connor_mc_d/status/1431240081726115845

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2024-11-06 Thread Alvaro Herrera

On 2024-Nov-05, Tom Lane wrote:

> I'm trying to write release notes for commits 53af9491a et al,
> and it seems to me that we need to explain how to get out of
> the mess that would be left behind by the old DETACH code.
> There's no hint about that in the commit message :-(

> Clearly, if you have now-inconsistent data, there's little
> help for that but to manually fix the inconsistencies.
> What I am worried about is how to get to a state where you
> have correct catalog entries for the constraint.
> 
> Will ALTER TABLE DROP CONSTRAINT on the now stand-alone table
> work to clean out the old catalog entries for the constraint?

Yes -- as far as I can tell, a DROP CONSTRAINT of the offending
constraint is successful and leaves no unwanted detritus.

Perhaps one more task for me is to figure out a way to get a list of all
the constraints that are broken because of this ... let me see if I can
figure that out.

> I'm worried that it will either fail, or go through but remove
> triggers on the referenced table that we still need for the
> original partitioned table.  If that doesn't work I think we had
> better create a recipe for manually removing the detritus.

As as far as I can see, it works and no triggers are spuriously removed.

> Once the old entries are gone it should be possible to do ALTER TABLE
> ADD CONSTRAINT (with an updated server), and that would validate
> your data.  It's the DROP CONSTRAINT part that worries me.

Yeah, that's correct: adding the constraint again after removing its
broken self detects that there are values violating the RI.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)

Re: pg_upgrade check for invalid databases

2024-11-06 Thread Daniel Gustafsson

> On 1 Nov 2024, at 01:36, Bruce Momjian  wrote:
> On Fri, Oct 25, 2024 at 01:55:57PM +0200, Daniel Gustafsson wrote:

>> In the meantime, the OP has a good point that it's a tad silly that 
>> pg_upgrade
>> fails hard on invalid databases instead of detecting and reporting like how
>> other errors are handled.  The attached adds this check and expands the 
>> report
>> wording to cover it.
> 
> Agreed.

I've applied this part, the discussion on whether or not pg_upgrade should gain
capabilities to skip and/or fix issues should probably be carried over in a new
thread.

--
Daniel Gustafsson

Re: Add missing tab completion for ALTER TABLE ADD COLUMN IF NOT EXISTS

2024-11-06 Thread Karina Litskevich

Hi Kirill,

I looked through your patches. First of all, please note that TAB
completion doesn't have to complete all valid grammatical
constructions. See the comment on the top of tab-complete.in.c:

 * This file implements a somewhat more sophisticated readline "TAB
 * completion" in psql. It is not intended to be AI, to replace
 * learning SQL, or to relieve you from thinking about what you're
 * doing. Also it does not always give you all the syntactically legal
 * completions, only those that are the most common or the ones that
 * the programmer felt most like implementing.

Considering this, I don't think 0001 should be accepted as is. Neither
"IF NOT EXIST" nor "IF EXISTS" appears anywhere in tab-complete.in.c.
Adding it for the only "ALTER TABLE ... ADD COLUMN" case doesn't seem
right. Adding it everywhere looks like lots of work of questionable
utility.

0003 looks good to me. Considering that we already have a completion of
"CREATE TEMP TABLE ... USING" with table access methods, completion of
"CREATE TABLE ..." with USING seems reasonable. It’s strange we didn’t
have this before. Good catch!

I can't judge if 0002 adds useful completion cases, but I noticed that
you add completion of "DROP/ADD ATTRIBUTE ..." with CASCADE/RESRTICT
while ignoring "ALTER/RENAME ATTRIBUTE ..." that also can be followed
by CASCADE/RESRTICT according to the documentation. Is it intentional?

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/

Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY

2024-11-06 Thread Shlok Kyal

Hi Aleksander,

>
> > Here the generated column 'b' is set as REPLICA IDENTITY for table
> > 'testpub_gencol'. When we create publication 'pub_gencol' we do not
> > specify any column list, so column 'b' will not be published.
> > So, the update message generated by the last UPDATE would have NULL
> > for column 'b'.
> >
> > To avoid the issue, we can disallow UPDATE/DELETE on table with
> > unpublished generated column as REPLICA IDENTITY. I have attached a
> > patch for the same.
>
> I don't think this would be a correct fix. Let's say I *don't* have
> any publications:
>
> ```
> =# CREATE TABLE testpub_gencol (a INT, b INT GENERATED ALWAYS AS (a + 1)
> STORED NOT NULL);
> CREATE TABLE
>
> =# CREATE UNIQUE INDEX testpub_gencol_idx ON testpub_gencol (b);
> CREATE INDEX
>
> =# INSERT INTO testpub_gencol (a) VALUES (1);
> INSERT 0 1
>
> =# UPDATE testpub_gencol SET a = 100 WHERE a = 1;
> UPDATE 1
> eax=# SELECT * FROM testpub_gencol ;
>   a  |  b
> -+-
>  100 | 101
> (1 row)
> ```
>
> So far everything works fine. You are saying that when one creates a
> publication UPDATEs should stop working. That would be rather
> surprising behavior for a typical user not to mention that it will
> break the current behavior.
>
> I believe one would expect that both UPDATEs and the publication
> should continue to work. Perhaps we should forbid the creation of a
> publication like this instead. Or alternatively include a generated
> column to the publication list if it's used as a replica identity. Or
> maybe even keep everything as is.
>
> Thoughts?
>

While testing I found that similar behaviors already exist in some
cases. Where once we create a publication UPDATES might stop working.
For example:
Case1:
postgres=# create table t1(c1 int);
CREATE TABLE
postgres=# insert into t1 values(1);
INSERT 0 1
postgres=# update t1 set c1 = 100 where c1 = 1;
UPDATE 1
postgres=# create publication pub for table t1;
CREATE PUBLICATION
postgres=# update t1 set c1 = 100 where c1 = 1;
ERROR:  cannot update table "t1" because it does not have a replica
identity and publishes updates
HINT:  To enable updating the table, set REPLICA IDENTITY using ALTER TABLE.

Case2:
postgres=# create table t2(c1 int, c2 int not null);
CREATE TABLE
postgres=# create unique index t2_idx on t2 (c2);
CREATE INDEX
postgres=# alter table t2 replica identity using index t2_idx;
ALTER TABLE
postgres=# insert into t2 values(1,1);
INSERT 0 1
postgres=# update t2 set c1 = 100 where c1 = 1;
UPDATE 1
postgres=# create publication pub2 for table t2 where (c1 > 10);
CREATE PUBLICATION
postgres=# update t2 set c1 = 100 where c1 = 1;
ERROR:  cannot update table "t2"
DETAIL:  Column used in the publication WHERE expression is not part
of the replica identity.

Behaviour with the patch provided in [1] to resolve the issue:
postgres=# create table t3(c1 int, c2 INT GENERATED ALWAYS AS (c1 + 1)
STORED NOT NULL);
CREATE TABLE
postgres=# create unique index t3_idx on t3 (c2);
CREATE INDEX
postgres=# alter table t3 replica identity using index t3_idx;
ALTER TABLE
postgres=# insert into t3 values(1);
INSERT 0 1
postgres=# update t3 set c1 = 100 where c1 = 1;
UPDATE 1
postgres=# create publication pub3 for table t3;
CREATE PUBLICATION
postgres=# update t3 set c1 = 100 where c1 = 1;
ERROR:  cannot update table "t3"
DETAIL:  Column list used by the publication does not cover the
replica identity.

So, I think this behavior would be acceptable. Thoughts?

[1]: 
https://www.postgresql.org/message-id/CANhcyEVw4V2Awe2AB6i0E5AJLNdASShGfdBLbUd1XtWDboymCA%40mail.gmail.com

Thanks and Regards,
Shlok Kyal

Re: define pg_structiszero(addr, s, r)

2024-11-06 Thread Bertrand Drouvot

Hi,

On Wed, Nov 06, 2024 at 12:16:33PM +1300, David Rowley wrote:
> On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot
>  wrote:
> > Another option could be to use SIMD instructions to check multiple bytes
> > is zero in a single operation. Maybe just an idea to keep in mind and 
> > experiment
> > if we feel the need later on.
> 
> Could do. I just wrote it that way to give the compiler flexibility to
> do SIMD implicitly.

ohhh, great, thanks!

> That seemed easier than messing around with SIMD
> intrinsics.

I had in mind to use SIMD intrinsics actually when posting the SIMD idea but...

> I guess the compiler won't use SIMD with the single
> size_t-at-a-time version as it can't be certain it's ok to access the
> memory beyond the first zero word. Because I wrote the "if" condition
> using bitwise-OR, there's no boolean short-circuiting, so the compiler
> sees it must be safe to access all the memory for the loop iteration.

that's a better idea! Yeah, I think that now the compiler sees that all 
comparisons
can be done in parallel and combined with a single OR operation (so, good 
candidate
to use SIMD optimization).

> If I use -march=native or -march=znver2 on my Zen2 machine, gcc does
> use SIMD operators.  Clang uses some 128-bit registers without
> specifying -march:
> 
> drowley@amd3990x:~$ gcc -O2 allzeros.c -march=native -o allzeros &&
> for i in {1..3}; do ./allzeros; done
> char: done in 1940539 nanoseconds
> size_t: done in 261731 nanoseconds (7.41425 times faster than char)
> size_t * 4: done in 130415 nanoseconds (14.8797 times faster than char)
> size_t * 8: done in 70031 nanoseconds (27.7097 times faster than char)
> char: done in 3030132 nanoseconds
> size_t: done in 477044 nanoseconds (6.35189 times faster than char)
> size_t * 4: done in 123551 nanoseconds (24.5254 times faster than char)
> size_t * 8: done in 68549 nanoseconds (44.2039 times faster than char)
> char: done in 3214037 nanoseconds
> size_t: done in 256901 nanoseconds (12.5108 times faster than char)
> size_t * 4: done in 126017 nanoseconds (25.5048 times faster than char)
> size_t * 8: done in 73167 nanoseconds (43.9274 times faster than char)
> 

Thanks for the tests! Out of curiosity, using gcc 11.4.0 (SIMD instructions not
generated) and get:

$ gcc -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 2655385 nanoseconds
size_t: done in 476021 nanoseconds (5.57829 times faster than char)
size_t SIMD DAVID: done in 174816 nanoseconds (15.1896 times faster than char)

or

$ gcc -march=native -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 2681146 nanoseconds
size_t: done in 395041 nanoseconds (6.78701 times faster than char)
size_t SIMD DAVID: done in 175608 nanoseconds (15.2678 times faster than char)

=> It's faster than the size_t one.

But of course, it's even faster with SIMD:

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -O2 allzeros_simd.c -o allzeros_simd ; 
./allzeros_simd
char: done in 5318674 nanoseconds
size_t: done in 443591 nanoseconds (11.99 times faster than char)
size_t SIMD DAVID: done in 179650 nanoseconds (29.6058 times faster than char)

or

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_simd.c -o 
allzeros_simd ; ./allzeros_simd
char: done in 5319534 nanoseconds
size_t: done in 426599 nanoseconds (12.4696 times faster than char)
size_t SIMD DAVID: done in 128687 nanoseconds (41.337 times faster than char)

So, I don't see any reason why not to use this SIMD approach: please find v7
attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From c609eeb1ee333018a107b15f9c3a1a699a9f661d Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 1 Nov 2024 11:46:29 +
Subject: [PATCH v7] Optimize pg_memory_is_all_zeros()

pg_memory_is_all_zeros() is currently doing byte per byte comparison and so
could lead to performance regression or penalties when multi bytes comparison
could be done instead.

Let's provide an optimized version that divides the checks into three phases for
efficiency:

- Initial alignment (byte per byte comparison)
- Multiple bytes comparison at once (candidate for SIMD optimization)
- Remaining bytes (byte per byte comparison)

Code mainly provided by David Rowley.

In passing, let's make use of this new version in PageIsVerifiedExtended() now
that multi bytes comparison is handled.
---
 src/backend/storage/page/bufpage.c | 13 +
 src/include/utils/memutils.h   | 43 +++---
 2 files changed, 41 insertions(+), 15 deletions(-)
  15.1% src/backend/storage/page/
  84.8% src/include/utils/

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index be6f1f62d2..aa264f61b9 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -89,10 +89,8 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
 {
 	PageHeader	p = (PageHeader) page;
 	size_t

RE: doc: pgevent.dll location

2024-11-06 Thread Ryohei Takahashi (Fujitsu)

Hi,


Thank you for your reply.

The dll install paths are changed as follows on Windows.

(1) pgevent.dll
PG16: lib/
PG17: bin/

(2) dll for user (like libpq.dll, libecpg.dll)
PG16: Both in lib/ and bin/
PG17: bin/

(3) contrib dll (like amcheck.dll)
PG16: lib/
PG17: lib/


I understand that Dave says (1) and (2) should exist on bin/ directory
and the paths in PG17 are correct.

So, I think it is correct to update documentation.


Regards,
Ryohei Takahashi

logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

2024-11-06 Thread Tomas Vondra

Hi,

I've been investigating some issues reported by users, related to
logical replication unexpectedly breaking with messages like:

  LOG: invalidating slot "s" because its restart_lsn X/Y exceeds
   max_slot_wal_keep_size

which is pretty confusing, because the system has that GUC set to -1 (so
disabled, there should be no limit). Or a message like this:

  ERROR: requested WAL segment 000...0AA has already been removed

which is a bit less confusing, but still puzzling because replication
slots are meant to prevent exactly this.

I speculated there's some sort of rare race condition, in how we advance
the slot LSN values. I didn't know where to start, so I gave up on
trying to understand the whole complex code. Instead, I wrote a simple
stress test that

1) sets up a cluster (primary + 1+ logical replicas)

2) runs pgbench on the primary, checks replicas/slots

3) randomly restarts the nodes (both primary/replicas) with either fast
  and immediate mode, with random short pauses

4) runs checkpoints in tight while loop

This is based on the observations that in past reports the issues seem
to happen only with logical replication, shortly after reconnect (e.g.
after connection reset etc.). And the slots are invalidated / WAL
removed during a checkpoint, so frequent checkpoints should make it
easier to hit ...

Attached is a couple scripts running this - it's not particularly pretty
and may need some tweak to make it work on your machine (run.sh is the
script to run).

And unfortunately, this started to fail pretty quick. The short story is
that it's not difficult to get into a situation where restart_lsn for a
slot moves backwards, so something like this can happen:

1) slot restart_lsn moves forward to LSN A

2) checkpoint happens, updates min required LSN for slots, recycles
segments it considers unnecessary (up to LSN A)

3) slot restart_lsn moves backwards to LSN B (where B < A)

4) kaboom

This would perfectly explain all the symptoms I mentioned earlier. The
max_slot_wal_keep_size reference is confusing, but AFAIK it's just based
on (reasonable) belief that LSN can't move backwards, and so the only
reason for restart_lsn being before min required LSN is that this GUC
kicked in. But the assumption does not hold :-(

Now, why/how can this happen?

I kept adding a elog(LOG) messages to all places updating LSNs for a
slot, and asserts to fail if data.restart_lsn moves backwards. See the
attached 0001 patch. An example for a failure (for the walsended backend
that triggered the assert) looks like this:

  344.139 LOG:  starting logical decoding for slot "s1"

  344.139 DETAIL:  Streaming transactions committing after 1/E97EAC30,
   reading WAL from 1/E96FB4D0.

  344.140 LOG:  logical decoding found consistent point at 1/E96FB4D0

  344.140 DETAIL:  Logical decoding will begin using saved snapshot.

  344.140 LOG:  LogicalConfirmReceivedLocation 1/E9865398

  344.140 LOG:  LogicalConfirmReceivedLocation updating
   data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
   candidate_restart_valid 0/0 (from 1/E9865398)
   candidate_restart_lsn 0/0 (from 1/E979D4C8)

  344.145 LOG:  LogicalIncreaseRestartDecodingForSlot
   restart_decoding_lsn 1/E96FB4D0

  344.145 LOG:  LogicalIncreaseRestartDecodingForSlot s1
   candidate_restart_valid_lsn 1/E979D4C8 (0/0)
   candidate_restart_lsn 1/E96FB4D0 (0/0)

  344.147 LOG:  LogicalIncreaseRestartDecodingForSlot
   restart_decoding_lsn 1/E979D4C8

  344.156 LOG:  LogicalIncreaseXminForSlot
   candidate_catalog_xmin 30699
   candidate_xmin_lsn 1/E993AD68 (0/0)
  ...
  344.235 LOG:  LogicalIncreaseRestartDecodingForSlot
   restart_decoding_lsn 1/E9F33AF8

  344.240 LOG:  LogicalConfirmReceivedLocation 1/E9DCCD78

  344.240 LOG:  LogicalConfirmReceivedLocation updating
   data.restart_lsn to 1/E96FB4D0 (from 1/E979D4C8)
   candidate_restart_valid 0/0 (from 1/E979D4C8)
   candidate_restart_lsn 0/0 (from 1/E96FB4D0)

  345.536 LOG:  server process (PID 2518127) was terminated by
signal 6: Aborted

We start decoding at 1/E96FB4D0, and right after startup we get a
confirmation, and LogicalConfirmReceivedLocation updates restart_lsn to
1/E979D4C8.

But then LogicalIncreaseRestartDecodingForSlot comes along, and stores
the restart_decoding_lsn 1/E96FB4D0 (which is the initial restart_lsn)
into candidate_restart_lsn.

And then a little bit later we get another confirmation, we call
LogicalConfirmReceivedLocation which propagates candidate_restart_lsn
into data.restart_lsn.

This is how restart_lsn moves backwards, causing issues. I've reproduced
this on PG14 and current master, but I believe the issue exists since
the introduction of logical replication in 9.4 :-(

I'm not claiming this is the only way how this can happen, but all the
cases

Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY

2024-11-06 Thread Aleksander Alekseev

Hi Shlok,

> So, I think this behavior would be acceptable. Thoughts?

That's a fair point, thanks for sharing. Personally I find this
behavior somewhat suboptimal but since we already have it in certain
cases I guess what you propose might be acceptable.

I'm still not entirely happy about breaking the existing behavior in
the discussed case. Not sure what the lesser evil would be - breaking
it or keeping it as is.

Some input from other people on the mailing list would be appreciated.

-- 
Best regards,
Aleksander Alekseev

RE: COPY performance on Windows

2024-11-06 Thread Ryohei Takahashi (Fujitsu)

Hi,


Thank you for your reply.


I don't want to "speed up" the COPY command.
I just want to "prevent speed down" compared with PG16.

But anyway, my current analysis is not convincing.
So, I will do more analysis and get back to you.


Regards,
Ryohei Takahashi

Re: per backend I/O statistics

2024-11-06 Thread Bertrand Drouvot

Hi,

On Wed, Nov 06, 2024 at 08:39:07AM +0900, Michael Paquier wrote:
> On Tue, Nov 05, 2024 at 05:37:15PM +, Bertrand Drouvot wrote:
> > I'm starting working on option 2, I think it will be easier to discuss with
> > a patch proposal to look at.
> > 
> > If in the meantime, one strongly disagree with option 2 (means implement a 
> > brand
> > new PGSTAT_KIND_BACKEND and keep PGSTAT_KIND_IO), please let me know.
> 
> Sorry for the late reply, catching up a bit.

No problem at all, thanks for looking at it!

> As you are quoting in [1], you do not expect the backend-io stats and
> the more global pg_stat_io to achieve the same level of consistency as
> the backend stats would be gone at restart, and wiped out when a
> backend shuts down.

Yes.

> So, splitting them with a different stats kind
> feels more natural because it would be possible to control how each
> stat kind behaves depending on the code shutdown and reset paths
> within their own callbacks rather than making the callbacks of
> PGSTAT_KIND_IO more complex than they already are.

Yeah, thanks for sharing your thoughts.

> And pg_stat_io is
> a fixed-numbered stats kind because of the way it aggregates its stats
> with a number states defined at compile-time.
> 
> Is the structure you have in mind different than PgStat_BktypeIO?

Very close.

> Perhaps a split is better anyway with that in mind.

The in-progress patch (not shared yet) is using the following:

"
typedef struct PgStat_Backend
{
   TimestampTz stat_reset_timestamp;
   BackendType bktype;
   PgStat_BktypeIO stats;
} PgStat_Backend;
"

The bktype is used to be able to filter the stats correctly when we display 
them.

> The amount of memory required to store the snapshots of backend-IO
> does not worry me much, TBH, but you are worried about a high turnover
> of connections that could cause a lot of bloat in the backend-IO
> snapshots because of the persistency that these stats would have,
> right?

Not only a high turnover but also a high number of entries created in the hash.

Furthermore I don't see any use case of relying on stats_fetch_consistency
while querying other backend's stats.

> If possible, supporting snapshots would be
> more consistent with the other stats.

I have I mind to support the snapshots _only_ when querying our own stats. I can
measure the memory impact if we use them also when querying other backends stats
too (though I don't see a use case).

> Just to be clear, I am not in favor of making PgStat_HashKey larger
> than it already is.

That's not needed, the patch I'm working on stores the proc number in the
objid field of the key.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Logical replication timeout

2024-11-06 Thread Ashutosh Bapat

On Wed, Nov 6, 2024 at 1:07 PM RECHTÉ Marc  wrote:
>
> Hello,
>
> For some unknown reason (probably a very big transaction at the source), we 
> experienced a logical decoding breakdown,
> due to a timeout from the subscriber side (either wal_receiver_timeout or 
> connexion drop by network equipment due to inactivity).
>
> The problem is, that due to that failure, the wal_receiver process stops. 
> When the wal_sender is ready to send some data, it finds the connexion broken 
> and exits.
> A new wal_sender process is created that restarts from the beginning (restart 
> LSN). This is an endless loop.
>
> Checking the network connexion between wal_sender and wal_receiver, we found 
> that no traffic occurs for hours.
>
> We first increased wal_receiver_timeout up to 12h and still got a 
> disconnection on the receiver party:
>
> 2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR:  
> terminating logical replication worker due to timeout
> 2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG:  
> background worker "logical replication worker" (PID 1356203) exited with exit 
> code 1
>
> Then put this parameter to 0, but got then disconnected by the network (note 
> the slight difference in message):
>
> 2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR:  could 
> not receive data from WAL stream: could not receive data from server: 
> Connection timed out
> 2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG:  
> background worker "logical replication worker" (PID 1697787) exited with exit 
> code 1
>
> The message is generated in libpqrcv_receive function 
> (replication/libpqwalreceiver/libpqwalreceiver.c) which calls 
> pqsecure_raw_read (interfaces/libpq/fe-secure.c)
>
> The last message "Connection timed out" is the errno translation from the 
> recv system function:
>
> ETIMEDOUT   Connection timed out (POSIX.1-2001)
>
> When those timeout occurred, the sender was still busy deleting files from 
> data/pg_replslot/bdcpb21_sene, accumulating more than 6 millions small 
> ".spill" files.
> It seems this very long pause is at cleanup stage were PG is blindly trying 
> to delete those files.
>
> strace on wal sender show tons of calls like:
>
> unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-100.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-200.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-300.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-400.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-500.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
>
> This occurs in ReorderBufferRestoreCleanup 
> (backend/replication/logical/reorderbuffer.c).
> The call stack presumes this may probably occur in DecodeCommit or 
> DecodeAbort (backend/replication/logical/decode.c):
>
> unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-8800.spill") = 
> -1 ENOENT (Aucun fichier ou dossier de ce type)
>  > /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7]
>  > /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) 
> [0x769e3d]
>  > /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== 
> replication/logical/reorderbuff.c:1480 (mais cette fonction (static) n'est 
> utiliée qu'au sein de ce module ...)
>  > /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <=== 
> replication/logical/decode.c:175
>  > /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] 
> <=== replication/logical/decode.c:90, appelle la fonction rmgr.rm_decode(ctx, 
> &buf) = 1 des 6 méthodes du resource manager
>  > /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e]
>  > /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121]
>  > /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a]
>  > /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588]
>  > /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97]
>  > /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c]
>  > /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05]
>  > /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554]
>  > /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8]

I think, we need a call to rb->update_progress_txn(rb, txn,
change->lsn) at regular intervals in ReorderBufferRestoreCleanup()
similar to ReorderBufferProcessTXN(). And may be at more places where
we have potentially long running loops.

-- 
Best Wishes,
Ashutosh Bapat

Re: Fix for Extra Parenthesis in pgbench progress message

2024-11-06 Thread Fujii Masao





On 2024/11/02 20:43, Tatsuo Ishii wrote:

Hi,

I noticed an issue in the pgbench progress message where an extra
closing parenthesis )) appears, as shown below:

700 of 1000 tuples (70%) of pgbench_accounts done (elapsed
19.75 s, remaining 8.46 s))


Yeah, annoying.


This occurs when running commands like pgbench -i -s100 and is caused
by leftover characters when using \r with fprintf. I made a patch to
address this by adding extra spaces before \r, which clears any
remaining characters. While effective, I recognize this solution may
not be the most sophisticated.


The patch works perfectly for the case that there is one extra brace
as shown in your example. However I think it will not work if there
are two or more extra braces.


Are you suggesting adding more space characters before the carriage return
in the progress reporting line, like this? Since the line includes both
elapsed and remaining times, its total length doesn’t change much.
I think adding five spaces before the carriage return should be enough.
Thoughts?

-   chars = fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " 
tuples (%d%%) of %s done (elapsed %.2f s, remaining %.2f s)   %c",
+   chars = fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " 
tuples (%d%%) of %s done (elapsed %.2f s, remaining %.2f s)%5s%c",
j, total,
(int) ((j * 100) / 
total),
-   table, elapsed_sec, 
remaining_sec, eol);
+   table, elapsed_sec, 
remaining_sec, "", eol);

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Fix for Extra Parenthesis in pgbench progress message

2024-11-06 Thread Tatsuo Ishii

>> The patch works perfectly for the case that there is one extra brace
>> as shown in your example. However I think it will not work if there
>> are two or more extra braces.
> 
> Are you suggesting adding more space characters before the carriage
> return
> in the progress reporting line, like this?

No. I thought about dynamically calculating spaces needed to be
printed something like attached patch.

> Since the line includes
> both
> elapsed and remaining times, its total length doesn’t change much.

Maybe. But I am also worried about the case when we would want to
change the log line format in the future. We might introduce this kind
of bug again.  By dynamically calculating the number of necessary
spaces, we don't need to think about the concern.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index ba247f3f85..4c05076dff 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4944,6 +4944,7 @@ initPopulateTable(PGconn *con, const char *table, int64 base,
 	int			n;
 	int64		k;
 	int			chars = 0;
+	int			previous_chars = 0;
 	PGresult   *res;
 	PQExpBufferData sql;
 	char		copy_statement[256];
@@ -5004,10 +5005,18 @@ initPopulateTable(PGconn *con, const char *table, int64 base,
 			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 			double		remaining_sec = ((double) total - j) * elapsed_sec / j;
 
-			chars = fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) of %s done (elapsed %.2f s, remaining %.2f s)%c",
+			chars = fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) of %s done (elapsed %.2f s, remaining %.2f s)",
 			j, total,
 			(int) ((j * 100) / total),
-			table, elapsed_sec, remaining_sec, eol);
+			table, elapsed_sec, remaining_sec);
+			if (previous_chars != 0)
+			{
+int	n_spaces = chars - previous_chars;
+fprintf(stderr, "%*s", n_spaces, "");
+chars += n_spaces;
+			}
+			fprintf(stderr, "%c", eol);
+			previous_chars = chars;
 		}
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))

Re: Changing shared_buffers without restart

2024-11-06 Thread Thomas Munro

On Sat, Oct 19, 2024 at 8:21 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Currently it
> supports only an increase of shared_buffers.

Just BTW in case it is interesting, Palak and I experimented with how
to shrink the buffer pool while PostgreSQL is running, while we were
talking about 13453ee (which it shares infrastructure with).  This
version fails if something is pinned and in the way of the shrink
operation, but you could imagine other policies (wait, cancel it,
...):

https://github.com/macdice/postgres/commit/db26fe0c98476cdbbd1bcf553f3b7864cb142247

Re: Converting contrib SQL functions to new style

2024-11-06 Thread Michael Paquier

On Wed, Nov 06, 2024 at 10:51:29AM +0100, Ronan Dunklau wrote:
> For most of them I agree, but one side effect of the current implementation 
> is 
> that we have a bug when pg_upgrad'ing if earthdistance is installed: 
> https://www.postgresql.org/message-id/flat/
> 152106914669.1223.5104148605998271987%40wrigleys.postgresql.org
> 
> The new version of earthdistance fixes that problem, so this one would be 
> worth 
> a back patch to 16, giving users the opportunity to update earthdistance 
> before running pg_upgrade.

Good point.  Checking all these contrib updates one-by-one is an ant's
work, but I'll see if I can get at least some of them done on HEAD.

> Thanks for the tip about the CI, will make sure to have it setup before next 
> time.

See also "Enabling cirrus-ci in a github repository" in
src/tools/ci/README.
--
Michael


signature.asc
Description: PGP signature

Add html-serve target to autotools and meson

2024-11-06 Thread Tristan Partin

When working on documentation, I find it useful to spawn an HTTP server 
to serve docs for me to consume in a web browser. I went ahead and used 
Python's builtin http.server module, but it could also be a run target 
which just spawned `xdg-open builddir/doc/src/sgml/html/index.html`, 
though I would doubt Windows has xdg-open(1).


The one controversial part of this patch might be using "port" 0, which 
asks the kernel to assign an unused port. I'm not sure if that exists on 
Windows, or is even desireable. Should we have a build option for the 
port? What would a good default be?


--
Tristan Partin
Neon (https://neon.tech)
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index a04c532b53..6f629e02a7 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -130,6 +130,9 @@ html-stamp: stylesheet.xsl postgres-full.xml $(ALL_IMAGES)
 	$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_HTML_FLAGS) $(wordlist 1,2,$^)
 	touch $@
 
+html-serve: html
+	python3 -m http.server --directory '$(top_builddir)/$(subdir)/html' 0
+
 # single-page HTML
 postgres.html: stylesheet-html-nochunk.xsl postgres-full.xml $(ALL_IMAGES)
 	$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_HTML_FLAGS) -o $@ $(wordlist 1,2,$^)
diff --git a/doc/src/sgml/meson.build b/doc/src/sgml/meson.build
index e418de83a7..fa598d8ab1 100644
--- a/doc/src/sgml/meson.build
+++ b/doc/src/sgml/meson.build
@@ -154,6 +154,19 @@ if docs_dep.found()
   alias_target('html', html)
   alias_target('install-html', install_doc_html)
 
+  run_target(
+'html-serve',
+command: [
+  python,
+  '-m',
+  'http.server',
+  '--directory',
+  meson.current_build_dir() / 'html',
+  '0',
+],
+depends: html,
+  )
+
   # build and install multi-page html docs as part of docs target
   docs += html
   installdocs += install_doc_html

FW: Building Postgres 17.0 with meson

2024-11-06 Thread Mark Hill

Srinath is in India I believe and not available currently.   Does anybody have 
any idea why meson
is not finding the paths I'm specifying with the -Dextra_lib_dirs and  
-Dextra_include_dirs?  See below.

Thanks, Mark

From: Mark Hill
Sent: Wednesday, November 6, 2024 10:33 AM
To: 'Srinath Reddy Sadipiralla' 
Subject: RE: Building Postgres 17.0 with meson

Hi Srinath,

Thanks for the tip.

I tried:
meson setup build --prefix=%prefix% --buildtype=release 
-Dextra_lib_dirs=%openssl_directory%,%zlib_directory%,%icu4c_directory% 
-Dextra_include_dirs=%openssl_directory%,%zlib_directory%,%icu4c_directory%

using both the -Dextra_lib_dirs and -Dextra_include_dirs and it's still not 
finding openssl, icu, or zlib.

In the External libraries information toward the end of the output, it says:
External libraries
icu: NO
openssl: NO
zlib   : NO

I removed the other libraries in the output for brevity.

I've checked my variables and they're correct and those libraries are there.

Am I missing an option to meson setup build?

Thanks, Mark

From: Srinath Reddy Sadipiralla 
mailto:srinath.re...@zohocorp.com>>
Sent: Wednesday, November 6, 2024 12:34 AM
To: Mark Hill mailto:mark.h...@sas.com>>
Cc: pgsql-hackers 
mailto:pgsql-hackers@lists.postgresql.org>>
Subject: Re: Building Postgres 17.0 with meson


EXTERNAL




 On Wed, 06 Nov 2024 09:59:37 +0530 Mark Hill 
mailto:mark.h...@sas.com>> wrote ---

I'm trying to build Postgres 17.0 and have about learning meson as I go.

The build setup command I have so far is:
meson setup build --prefix=%prefix% --buildtype=release -Dssl=auto -Dzlib=auto 
-Dicu=auto

but I believe that cmd expects ssl, zlib, and icu to be installed in default 
locations.

How do I specify to meson that I want it to use the versions of those 3 
software packages that I have built

e.g.  the openssl I want it to use is located in 
D:\Postgres-Builds\OpenSSL\OpenSSL-Install\OpenSSL-3.1.6-wx6

and similar locations for icu and zlib?

Thanks, Mark
Specify the lib's paths in "-Dextra_lib_dirs" during meson setup
e.g. meson setup build --prefix=%prefix% --buildtype=release 
-Dextra_lib_dirs=D:\Postgres-Builds\OpenSSL\OpenSSL-Install\OpenSSL-3.1.6-wx6

https://www.postgresql.org/docs/current/install-meson.html#CONFIGURE-EXTRA-LIB-DIRS-MESON

Regards,
Srinath Reddy Sadipiralla
Member Technical Staff
ZOHO

Re: UUID v7

2024-11-06 Thread Andrey M. Borodin

> On 5 Nov 2024, at 23:56, Andrey M. Borodin  wrote:
> 
> 

Some more thoughts on this patch version:

0. Comment mentioning nanoseconds, while we do not need to carry anything
/* Convert TimestampTz back and carry nanoseconds. */

1. There's unnecessary &3 in
uuid->data[7] = uuid->data[7] | ((uuid->data[8] >> 6) & 3);

2. Currently we store 0..999 microseconds in 10 bits, so values 1000..1023 are 
unused. We could use them for overflow. That would slightly increase 
non-overflowing capacity when generating more than million UUIDs per second on 
one backend. However, given current performance of our CSPRNG I do not think 
this feature worth code complexity.

Best regards, Andrey Borodin.

Re: Windows meson build

2024-11-06 Thread Andres Freund

Hi,

On 2024-11-05 06:32:51 +, Kohei Harikae (Fujitsu) wrote:
> I do not use pkgconf in my Windows environment.
> In my Windows environment, I could not build the following OSS with meson.
> - 0001 icu
> - 0002 libxml
> - 0003 libxslt
> - 0004 lz4
> - 0005 tcl
> - 0006 zlib
> - 0007 zstd
> 
> [1]thread, I created a patch like the one in the attached file, and now I can 
> build.
> Would you like to be able to build OSS with Windows meson without using 
> pkgconf?

You can use pkgconf or cmake for the dependencies.  I don't want to add "raw"
dependency handling for every dependency, they each build in too many variants
for that to be a sensible investment of time.


Greetings,

Andres Freund

Re: Add parallel columns for pg_stat_statements

2024-11-06 Thread Michael Paquier

On Wed, Oct 09, 2024 at 08:32:52AM +0900, Michael Paquier wrote:
> Okay, applied 0001 and 0002 then after a second lookup.  I'll spend
> some more time thinking about 0003 and the other threads.

Considered 0003, and I'm still not sure that this is something that
is really required based on the correlation that are now possible with
the number of times a query has been called and the number of
planned/launched workers.

So I am marking the entry as committed.  Let's see about the threads
for the addition of this data at table-level and at the
database-level.
--
Michael

signature.asc
Description: PGP signature

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-11-06 Thread Michael Paquier

On Wed, Nov 06, 2024 at 02:48:31PM +0900, Michael Paquier wrote:
> I'm OK with 0002 to add the wait parameter to BackgroundPsql and be
> able to take some actions until a manual wait_connect().  I'll go do
> this one.  Also perhaps 0001 while on it but I am a bit puzzled by the
> removal of the three ok() calls in 037_invalid_database.pl.

0002 has been done as ba08edb06545 after adding a bit more
documentation that was missing.  0001 as well with 70291a3c66ec.  The
original expectation of 037_invalid_database.pl with the banner data
expected in the output was interesting..

Note that 0003 is lacking an EXTRA_INSTALL in the Makefile of
src/test/authentication/, or the test would fail if doing for example
a `make check` in this path.

The following nit is also required in the script for installcheck, to
skip the test if the module is not installed:
if (!$node->check_extension('injection_points'))
{
plan skip_all => 'Extension injection_points not installed';
}

See src/test/modules/test_misc/t/005_timeouts.pl as one example.  (I
know, these are tricky to know about..)

007_injection_points.pl is a name too generic as it could apply in a
lot more places, without being linked to injection points.  How about
something like 007_pre_auth.pl?
--
Michael

signature.asc
Description: PGP signature

Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY

2024-11-06 Thread Amit Kapila

On Wed, Nov 6, 2024 at 5:48 PM Aleksander Alekseev
 wrote:
>
> > So, I think this behavior would be acceptable. Thoughts?
>
> That's a fair point, thanks for sharing. Personally I find this
> behavior somewhat suboptimal but since we already have it in certain
> cases I guess what you propose might be acceptable.
>

This is not a suboptimal behavior but a must to have, otherwise, there
is no way we can identify the row to update on the subscriber side.
Also, this is not in certain cases but in all cases for UPDATE/DELETE,
we need REPLICA IDENTITY to be set. See more about REPLICA IDENTITY in
Alter Table docs [1]. The problem reported by Shlok is that even
though we have a REPLICA IDENTITY defined on a generated column but
still won't send the required column value (as generated columns are
skipped by default) to the subscriber which will lead to ERROR as
mentioned below. Now, one can argue that this is not expected from the
user or why the user would have such a setup but I think we should fix
the problem if it leads to unexpected behavior on the subscriber.

> I'm still not entirely happy about breaking the existing behavior in
> the discussed case. Not sure what the lesser evil would be - breaking
> it or keeping it as is.
>

The current behavior is not acceptable because it would generate an
ERROR as follows on the subscriber:

2024-11-07 10:50:31.381 IST [16260] ERROR:  publisher did not send
replica identity column expected by the logical replication target
relation "public.testpub_gencol"
2024-11-07 10:50:31.381 IST [16260] CONTEXT:  processing remote data
for replication origin "pg_16389" during message type "UPDATE" for
replication target relation "public.testpub_gencol" in transaction
748, finished at 0/176D5D8
2024-11-07 10:50:31.398 IST [6216] LOG:  background worker "logical
replication apply worker" (PID 16260) exited with exit code 1

> Some input from other people on the mailing list would be appreciated.
>

We should fix this in the HEAD and back branches.

[1] - https://www.postgresql.org/docs/devel/sql-altertable.html

-- 
With Regards,
Amit Kapila.

Re: Parallel workers stats in pg_stat_database

2024-11-06 Thread Michael Paquier

On Sat, Oct 12, 2024 at 01:14:54AM +0200, Benoit Lobréau wrote:
> Here is an updated version, I modified it to:
> 
> * have the same wording in the doc and code (planned => to_launch)
> * split de declaration from the rest (and have the same code as the parallel
> worker logging patch)

Thanks for the updated patch set.

I've been thinking about this proposal for the two counters with
pg_stat_database in 0001, and I am going to side with the argument
that it sucks to not have this information except if
pg_stat_statements is enabled on an instance.  It would be a different
discussion if PGSS were to be in core, and if that were to happen we
could perhaps remove these counters from pg_stat_database, but there
is no way to be sure if this is going to happen, as well.  And this
information is useful for the GUC settings.

+/*
+ * reports parallel_workers_to_launch and parallel_workers_launched into
+ * PgStat_StatDBEntry
+ */
Perhaps a reword with:
"Notify the stats system about parallel worker information."

+/* pg_stat_get_db_parallel_workers_to_launch*/
[...]
+/* pg_stat_get_db_parallel_workers_launched*/

Incorrect comment format, about which pgindent does not complain..

..  But pgindent complains in execMain.c and pgstat_database.c.  These
are only nits, the patch is fine.  If anybody has objections or
comments, feel free.

Now, I am not really on board with 0002 and 0003 about the tracking of
the maintenance workers, which reflect operations that happen less
often than what 0001 is covering.  Perhaps this would have more
value if autovacuum supported parallel operations, though.
--
Michael

signature.asc
Description: PGP signature

Doc: typo in ddl.sgml

2024-11-06 Thread kasaharatt


Hi.

I noticed that it seems to me a mistake in the sample code in the 
document.

https://www.postgresql.org/docs/devel/ddl-constraints.html#DDL-CONSTRAINTS-FK

I think it is a typo for user_id instead of author_id.
See bellow.

Best reghards,

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 898b6dd..c0de169 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1269,9 +1269,9 @@ CREATE TABLE users (
 CREATE TABLE posts (
 tenant_id integer REFERENCES tenants ON DELETE CASCADE,
 post_id integer NOT NULL,
-author_id integer,
+user_id integer,
 PRIMARY KEY (tenant_id, post_id),
-FOREIGN KEY (tenant_id, author_id) REFERENCES users ON DELETE SET 
NULL (author_id)
+FOREIGN KEY (tenant_id, user_id) REFERENCES users ON DELETE SET 
NULL (user_id)

 );
 
 Without the specification of the column, the foreign key would also 
set

Re: Fix for Extra Parenthesis in pgbench progress message

2024-11-06 Thread Tatsuo Ishii

>>> The patch works perfectly for the case that there is one extra brace
>>> as shown in your example. However I think it will not work if there
>>> are two or more extra braces.
>> 
>> Are you suggesting adding more space characters before the carriage
>> return
>> in the progress reporting line, like this?
> 
> No. I thought about dynamically calculating spaces needed to be
> printed something like attached patch.
> 
>> Since the line includes
>> both
>> elapsed and remaining times, its total length doesn’t change much.
> 
> Maybe. But I am also worried about the case when we would want to
> change the log line format in the future. We might introduce this kind
> of bug again.  By dynamically calculating the number of necessary
> spaces, we don't need to think about the concern.

Probably my patch is too invasive and not worth the trouble for the
stable branches. What about back patching your patch to the stable
stable branches, and applying my patch to the master branch?

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Doc: typo in ddl.sgml

2024-11-06 Thread David G. Johnston

On Wednesday, November 6, 2024, kasaharatt 
wrote:.
>
>
> I noticed that it seems to me a mistake in the sample code in the document.
> https://www.postgresql.org/docs/devel/ddl-constraints.html#
> DDL-CONSTRAINTS-FK
>
> I think it is a typo for user_id instead of author_id.

It’s hard to argue a typo when three different places need to be changed,
and the example is fully functional as-is.

The example writer intentionally chose to make author_id the FK of the PK
user_id column in order to reduce ambiguity as to which column is on the FK
table and which is on the PK table.  And to emphasize that the columns need
not be named the same in any case.

David J.

Linkify mentions of the primary/subscriber's max_replication_slots

2024-11-06 Thread Tristan Partin

In the replication server configuration documentation[0], there are 
3 sections: sending, primary, standby, and subscriber. These sections 
mention various GUCs and how they apply to each server configuration.


The max_replication_slots GUC is mentioned twice, once in the sending 
section[1]:



[...]

Note that this parameter also applies on the subscriber side, but with 
a different meaning.


and another time in the subscriber section[2]:


[...]

Note that this parameter also applies on a sending server, but with 
a different meaning.


We can aid users that read these sections of the docs by adding links to 
the max_replication_slots GUC description which the texts refer to.


Here is a patch which does so.

[0]: https://www.postgresql.org/docs/17/runtime-config-replication.html
[1]: 
https://www.postgresql.org/docs/17/runtime-config-replication.html#GUC-MAX-REPLICATION-SLOTS
[2]: 
https://www.postgresql.org/docs/17/runtime-config-replication.html#GUC-MAX-REPLICATION-SLOTS-SUBSCRIBER

--
Tristan Partin
Neon (https://neon.tech)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d54f904956..34b29fa13e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4488,7 +4488,9 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
 
 
  Note that this parameter also applies on the subscriber side, but with
- a different meaning.
+ a different meaning. See 
+ in  for more
+ details.
 

   
@@ -5215,7 +5217,9 @@ ANY num_sync ( 
+in  for more
+details.

split func.sgml to separated individual sgml files

2024-11-06 Thread jian he

hi.

move to a new thread.
Since the old thread[1], many things have interacted together.

we are going to split func.sgml to 31 inviduaul sgml files.
the new file name pattern is "func-" as the prefix.
all the func-*.sgml files stored in doc/src/sgml/func
based on the original func.sgml line number,
each file has a line beginning and line end,
which is why the python script is long.


python3 v1-0001-split_func_sgml.py
git apply v1-0001-all-filelist-for-directory-doc-src-sgml-func.patch
execute the above two commands, we should go good to go.

The following is step-by-step logic.

#--step0 get line number info, and validate it.
in func.sgml, we have 62 "sect1", these corresponding to the line
begin and line end of the new sgml file.
so in func.sgml, we locate and validate it.
later we use the SED command to do the copy, and need these line
number information.

#--step1. create doc/src/sgml/func directory, move func.sgml to there.
create each new empty invidual sgml file

#---step2 construct sed copy commands and execute it.
This will roughly copy func.sgml content from line 52 to line 31655 to
corresponding new individual sgml file, based on
line number information.

#-step3 validates that the new file only has 2 "sect1". also
validate new files "https://www.postgresql.org/message-id/flat/CA%2BTgmoZ2F%2BK0j%3D6BOJLD%3DYfpJMdJRXC7sWmtXGRjx1Rq0x8PUA%40mail.gmail.com
import subprocess
import os
import re
top_dir = subprocess.check_output(["git", "rev-parse", "--show-toplevel"])
top_dir = top_dir.decode().rstrip('\n')
top_src_dir = top_dir + "/doc/src/sgml"
#goto doc/src/sgml directory, exit when not exists
try:
os.chdir(top_src_dir)
except FileNotFoundError:
print(f'{top_src_dir} not does not exist. SKIPPED')
quit()

target_file="func.sgml"
#
func_logical="func-logical.sgml"
func_comparison="func-comparison.sgml"
func_math="func-math.sgml"
func_string="func-string.sgml"
func_binarystring="func-binarystring.sgml"
func_bitstring="func-bitstring.sgml"
func_matching="func-matching.sgml"
func_formatting="func-formatting.sgml"
func_datetime="func-datetime.sgml"
func_enum="func-enum.sgml"
func_geometry="func-geometry.sgml"
func_net="func-net.sgml"
func_textsearch="func-textsearch.sgml"
func_uuid="func-uuid.sgml"
func_xml="func-xml.sgml"
func_json="func-json.sgml"
func_sequence="func-sequence.sgml"
func_conditional="func-conditional.sgml"
func_array="func-array.sgml"
func_range="func-range.sgml"
func_aggregate="func-aggregate.sgml"
func_window="func-window.sgml"
func_merge_support ="func-merge-support.sgml"
func_subquery="func-subquery.sgml"
func_comparisons="func-comparisons.sgml"
func_srf="func-srf.sgml"
func_info="func-info.sgml"
func_admin="func-admin.sgml"
func_trigger="func-trigger.sgml"
func_event_triggers="func-event-triggers.sgml"
func_statistics="func-statistics.sgml"
#
func_filenames = list()
func_filenames.append(func_logical)
func_filenames.append(func_comparison)
func_filenames.append(func_math)
func_filenames.append(func_string)
func_filenames.append(func_binarystring)
func_filenames.append(func_bitstring)
func_filenames.append(func_matching)
func_filenames.append(func_formatting)
func_filenames.append(func_datetime)
func_filenames.append(func_enum)
func_filenames.append(func_geometry)
func_filenames.append(func_net)
func_filenames.append(func_textsearch)
func_filenames.append(func_uuid)
func_filenames.append(func_xml)
func_filenames.append(func_json)
func_filenames.append(func_sequence)
func_filenames.append(func_conditional)
func_filenames.append(func_array)
func_filenames.append(func_range)
func_filenames.append(func_aggregate)
func_filenames.append(func_window)
func_filenames.append(func_merge_support )
func_filenames.append(func_subquery)
func_filenames.append(func_comparisons)
func_filenames.append(func_srf)
func_filenames.append(func_info)
func_filenames.append(func_admin)
func_filenames.append(func_trigger)
func_filenames.append(func_event_triggers)
func_filenames.append(func_statistics)
#
func_string_begin_lineno= -1
func_logical_begin_lineno=-1
func_comparison_begin_lineno=-1
func_math_begin_lineno=-1
func_string_begin_lineno=-1
func_binarystring_begin_lineno=-1
func_bitstring_begin_lineno=-1
func_matching_begin_lineno=-1
func_formatting_begin_lineno=-1
func_datetime_begin_lineno=-1
func_enum_begin_lineno=-1
func_geometry_begin_lineno=-1
func_net_begin_lineno=-1
func_textsearch_begin_lineno=-1
func_uuid_begin_lineno=-1
func_xml_begin_lineno=-1
func_json_begin_lineno=-1
func_sequence_begin_lineno=-1
func_conditional_begin_lineno=-1
func_array_begin_lineno=-1
func_range_begin_lineno=-1
func_aggregate_begin_lineno=-1
func_window_begin_lineno=-1
func_merge_support_begin_lineno=-1
func_subquery_begin_lineno=-1
func_comparisons_begin_lineno=-1
func_srf_begin_lineno=-1
func_info_begin_lineno=-1
func_admin_begin_lineno=-1
func_trigger_begin_lineno=-1
func_event_triggers_begin_lineno=-1
func_statistics_begin_lineno=-1
#
func_logical_end_lineno=-1
func_comparison_end_lineno=-1
func_mat

Re: Pgoutput not capturing the generated columns

2024-11-06 Thread Amit Kapila

On Wed, Nov 6, 2024 at 4:18 PM vignesh C  wrote:
>
> The attached v50 version patch has the changes for the same.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Incremental Sort Cost Estimation Instability

2024-11-06 Thread Andrei Lepikhov


On 10/8/24 11:33, Andrei Lepikhov wrote:

On 9/23/24 20:02, Andrei Lepikhov wrote:

On 12/9/2024 12:12, David Rowley wrote:
On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov  

Minor change to make compiler and cfbot happy
Now, this thread looks connected to the [1]. However, it still has 
independent profit, which can be discussed separately.
After the introduction of the em->em_ndistinct cache, I played around 
with the idea of letting the estimate_num_groups use this cache. 
Occasionally found out  that we have one more instability case like the 
following:


DROP TABLE IF EXISTS test;
CREATE TABLE test (x int, y int, z int);
INSERT INTO test (x,y,z) (SELECT random()*1E5, random()*2, random() FROM 
generate_series(1,1e4));

VACUUM ANALYZE test;

EXPLAIN SELECT count(*) FROM test WHERE x=y GROUP BY x,y;
EXPLAIN SELECT count(*) FROM test WHERE x=y GROUP BY y,x;

Here, you can see that depending on the initial order of grouping, 
Postgres chooses different columns for grouping. Doing that causes 
different estimations - one of them is definitely wrong:


GroupAggregate  (cost=181.41..182.29 rows=50 width=16)
GroupAggregate  (cost=181.41..181.82 rows=3 width=16)

That happens because when estimating the number of groups, Postgres 
doesn't consider EquivalenceClass, which can let him correct group 
estimation at a low price.
It may be done inside the make_pathkeys_for_sortclauses_extended by 
choosing a column with a lower number of distinct, but IMO, it is better 
to do it at the moment of the number of groups estimation.


Thoughts? Is it a real issue or just a non-practical corner case?

The new version of the patch is attached.

[1] 
https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com

--
regards, Andrei LepikhovFrom 5eb884cbbd9c2e356d5d855da46d7e62d101b8b9 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Tue, 29 Oct 2024 08:49:33 +0700
Subject: [PATCH v2 1/4] Stabilise incremental sort cost calculation.

Carefully identify a column/expression that can represent the path key in cost
calculation of specific sort operator. Columns may have different numbers of
distinct values. That's why the order of columns in the sort operation may
impact number of the comparison function's calls.
Sorting has only pathkeys as input for the cost estimation. This patch, instead
of a blind choice of the first equivalence class member, attempts to find an
expression that chooses the most negligible ndistinct value.

TODO: Filtering EC members, external to this sort operator is not a big deal.
But in that case it would be necessary to pass underlying relids to cost
calculation routine that would cause the API change. So, here we stay as simple
as possible.

Add into EquivalenceMember the number of distinct values - em_ndistinct.
It may be additionally used later in groups number estimations.
---
 src/backend/optimizer/path/costsize.c | 72 +--
 src/backend/optimizer/path/equivclass.c   |  1 +
 src/include/nodes/pathnodes.h |  2 +
 .../regress/expected/incremental_sort.out | 51 +
 src/test/regress/sql/incremental_sort.sql | 31 
 5 files changed, 152 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 2bb6db1df7..686d5883d1 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -203,6 +203,8 @@ static int32 get_expr_width(PlannerInfo *root, const Node *expr);
 static double relation_byte_size(double tuples, int width);
 static double page_size(double tuples, int width);
 static double get_parallel_divisor(Path *path);
+static EquivalenceMember *identify_sort_ecmember(PlannerInfo *root,
+ EquivalenceClass *ec);
 
 
 /*
@@ -2052,22 +2054,21 @@ cost_incremental_sort(Path *path,
 	 */
 	foreach(l, pathkeys)
 	{
-		PathKey*key = (PathKey *) lfirst(l);
-		EquivalenceMember *member = (EquivalenceMember *)
-			linitial(key->pk_eclass->ec_members);
+		PathKey			   *key = (PathKey *) lfirst(l);
+		EquivalenceMember  *em = identify_sort_ecmember(root, key->pk_eclass);
 
 		/*
 		 * Check if the expression contains Var with "varno 0" so that we
 		 * don't call estimate_num_groups in that case.
 		 */
-		if (bms_is_member(0, pull_varnos(root, (Node *) member->em_expr)))
+		if (bms_is_member(0, pull_varnos(root, (Node *) em->em_expr)))
 		{
 			unknown_varno = true;
 			break;
 		}
 
 		/* expression not containing any Vars with "varno 0" */
-		presortedExprs = lappend(presortedExprs, member->em_expr);
+		presortedExprs = lappend(presortedExprs, em->em_expr);
 
 		if (foreach_current_index(l) + 1 >= presorted_keys)
 			break;
@@ -6604,3 +6605,64 @@ compute_gather_rows(Path *path)
 
 	return clamp_row_est(path->rows * get_parallel_divisor(path));
 }
+
+/*
+ * Find suitable member of the equivalence class.
+ * Passing through the list of EC members find the member with minimum of
+ * distinct

Re: not null constraints, again

2024-11-06 Thread jian he

sql-altertable.html

   
SET/DROP NOT NULL

 
  These forms change whether a column is marked to allow null
  values or to reject null values.
 
 
  If this table is a partition, one cannot perform DROP
NOT NULL
  on a column if it is marked NOT NULL in the parent
  table.  To drop the NOT NULL constraint from all the
  partitions, perform DROP NOT NULL on the parent
  table.
 
Now this will be slightly inaccurate.

drop table if exists part, part0 cascade;
create table part (a int not null) partition by range (a);
create table part0 (a int not null);
alter table part attach partition part0 for values from (0) to (1000);
alter table ONLY part0 add primary key(a);
alter table part alter column a drop not null;

as the example shows that part0 not-null constraint is still there.
that means:

perform DROP NOT NULL on the parent table
will not drop the NOT NULL constraint from all partitions.

so we need rephrase the following sentence:

To drop the NOT NULL constraint from all the
  partitions, perform DROP NOT NULL on the parent
  table.

to address this kind of corner case?

Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY

2024-11-06 Thread Amit Kapila

On Tue, Nov 5, 2024 at 12:53 PM Shlok Kyal  wrote:
>
> To avoid the issue, we can disallow UPDATE/DELETE on table with
> unpublished generated column as REPLICA IDENTITY. I have attached a
> patch for the same.
>

+CREATE PUBLICATION pub_gencol FOR TABLE testpub_gencol;
+UPDATE testpub_gencol SET a = 100 WHERE a = 1;
+ERROR:  cannot update table "testpub_gencol"
+DETAIL:  Column list used by the publication does not cover the
replica identity.

This is not a correct ERROR message as the publication doesn't have
any column list associated with it. You have added the code to detect
this in the column list code path which I think is not required. BTW,
you also need to consider the latest commit 7054186c4e for this. I
guess you need to keep another flag in PublicationDesc to detect this
and then give an appropriate ERROR.

-- 
With Regards,
Amit Kapila.

Re: Remove an obsolete comment in gistinsert()

2024-11-06 Thread Michael Paquier

On Thu, Nov 07, 2024 at 10:57:08AM +0800, Tender Wang wrote:
> Thanks for reviewing this. I have added it to the 2015-01 commitfest.

Right.  I don't quite see why this comment would apply anymore, and
the commit you are pointing to looks right.  Will fix.
--
Michael


signature.asc
Description: PGP signature

Re: Remove an obsolete comment in gistinsert()

2024-11-06 Thread Tender Wang

Michael Paquier  于2024年11月7日周四 14:11写道：

> On Thu, Nov 07, 2024 at 10:57:08AM +0800, Tender Wang wrote:
> > Thanks for reviewing this. I have added it to the 2015-01 commitfest.
>
> Right.  I don't quite see why this comment would apply anymore, and
> the commit you are pointing to looks right.  Will fix.
>

Thanks for pushing. I have changed the status to committed on commitfest
2025-01


-- 
Thanks,
Tender Wang

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

2024-11-06 Thread Peter Eisentraut


On 05.11.24 19:19, Robert Haas wrote:

1) EXPLAIN ANALYZE Is sometimes very confusing (because there is ANALYZE). 
Let's rename it to EXPLAIN EXECUTE?

The trouble is that EXPLAIN EXECUTE already means something.

robert.haas=# explain execute foo;
ERROR:  prepared statement "foo" does not exist

Granted, that would not make it impossible to make EXPLAIN (EXECUTE) a
synonym for EXPLAIN (ANALYZE), but IMHO it would be pretty confusing
if EXPLAIN EXECUTE and EXPLAIN (EXECUTE) did different things.


At some point in the past, the idea of renaming EXPLAIN ANALYZE to 
PROFILE was thrown around.  I still kind of like that idea.  You'd have 
to keep the existing syntax around, of course.

Re: magical eref alias names

2024-11-06 Thread Peter Eisentraut


On 06.11.24 20:06, Robert Haas wrote:

I can make nothing of*TLOCRN* or*TROCRN*, even
after looking at the relevant source code,


These are from the SQL standard text.  So they are more guidance to the 
implementer than anything else.  I think something had to be put there, 
because erefs are required.  I'm also interested in the discussion you 
raise about that.


(I think they are meant to be something like "table {left|right} operand 
cycle reference name".)

Re: Pgoutput not capturing the generated columns

2024-11-06 Thread Peter Eisentraut


On 07.11.24 05:13, Amit Kapila wrote:

On Wed, Nov 6, 2024 at 4:18 PM vignesh C  wrote:


The attached v50 version patch has the changes for the same.


Could you (everybody on this thread) please provide guidance how this 
feature is supposed to interact with virtual generated columns [0].  I 
don't think it's reasonably possible to replicate virtual generated 
columns.  I had previously suggested to make it more explicit that this 
feature only works for stored generated columns (e.g., name the option 
like that), but I don't see that this was considered.


[0]: 
https://www.postgresql.org/message-id/flat/a368248e-69e4-40be-9c07-6c3b5880b...@eisentraut.org

Re: Doc: typo in ddl.sgml

2024-11-06 Thread kasaharatt


Hi, thanks for your comment.


On Wednesday, November 6, 2024, kasaharatt
 wrote:.


I noticed that it seems to me a mistake in the sample code in the
document.


https://www.postgresql.org/docs/devel/ddl-constraints.html#DDL-CONSTRAINTS-FK

[1]

I think it is a typo for user_id instead of author_id.


It’s hard to argue a typo when three different places need to be
changed, and the example is fully functional as-is.

Yeah, you are right.
I mistakenly thought it was syntactically incorrect. The sample code 
works correctly.


Best regards,

Re: not null constraints, again

2024-11-06 Thread jian he

RemoveInheritance
if (copy_con->coninhcount <= 0) /* shouldn't happen */
elog(ERROR, "relation %u has non-inherited constraint \"%s\"",
 RelationGetRelid(child_rel), NameStr(copy_con->conname));
dropconstraint_internal
if (childcon->coninhcount <= 0) /* shouldn't happen */
elog(ERROR, "relation %u has non-inherited constraint \"%s\"",
 childrelid, NameStr(childcon->conname));

RemoveInheritance error triggered (see below), dropconstraint_internal may also.
that means the error message should use RelationGetRelationName
rather than plain "relation %u"?


drop table if exists inh_parent,inh_child1,inh_child2;
create table inh_parent(f1 int not null no inherit);
create table inh_child1(f1 int not null no inherit);
alter table inh_child1 inherit inh_parent;
alter table inh_child1 NO INHERIT inh_parent;
ERROR:  relation 26387 has non-inherited constraint "inh_child1_f1_not_null"

sql-altertable.html
INHERIT parent_table
This form adds the target table as a new child of the
specified parent table.
Subsequently, queries against the parent will include records
of the target
table.  To be added as a child, the target table must already
contain all the
same columns as the parent (it could have additional columns,
too).  The columns
must have matching data types, and if they have NOT NULL
constraints in the
parent then they must also have NOT NULL constraints in the child.

"
The columns must have matching data types, and if they have NOT NULL
constraints in the
 parent then they must also have NOT NULL constraints in the child.
"
For the above sentence, we need to add some text to explain
NOT NULL constraints, NO INHERIT property
for the child table and parent table.


drop table if exists inh_parent,inh_child1,inh_child2;
create table inh_parent(f1 int not null no inherit);
create table inh_child1(f1 int);
alter table inh_child1 inherit inh_parent;
alter table inh_child1 NO INHERIT inh_parent;
ERROR:  1 unmatched constraints while removing inheritance from
"inh_child1" to "inh_parent"

now, we cannot "uninherit" inh_child1 from inh_parent?
not sure this is expected behavior.

Re: UUID v7

2024-11-06 Thread Masahiko Sawada

On Wed, Nov 6, 2024 at 10:14 AM Andrey M. Borodin  wrote:
>
>
>
> > On 5 Nov 2024, at 23:56, Andrey M. Borodin  wrote:
> >
> > 
>
> Some more thoughts on this patch version:
>
> 0. Comment mentioning nanoseconds, while we do not need to carry anything
> /* Convert TimestampTz back and carry nanoseconds. */
>
> 1. There's unnecessary &3 in
> uuid->data[7] = uuid->data[7] | ((uuid->data[8] >> 6) & 3);
>
> 2. Currently we store 0..999 microseconds in 10 bits, so values 1000..1023 
> are unused. We could use them for overflow. That would slightly increase 
> non-overflowing capacity when generating more than million UUIDs per second 
> on one backend. However, given current performance of our CSPRNG I do not 
> think this feature worth code complexity.
>

While using only 10 bits microseconds makes the implementation simple,
I'm not sure if 10 bits is enough to generate UUIDs at microsecond
granularity without losing monotonicity. Since 10-bit microseconds are
used as is in rand_a space, 1000 UUIDs can be generated per
millisecond without losing monotonicity.

For example, in my environment, it took 1808 milliseconds to generate
1 million UUIDs. This is about 533 UUIDs generated per millisecond. As
UUID generation performance improves, I think 10 bits will not be
enough.

=# select count(uuidv7()) from generate_series(1, 1_000_000);
count
-
100
(1 row)

Time: 1808.734 ms

I found a similar comment from Sergey Prokhorenko[1]. He also mentioned:

> 4) Microsecond timestamp fraction subtracts 10 bits from random data, which 
> increases the risk of collision. In the counter, almost all bits are 
> initialized with a random number, which reduces the risk of collision.

I feel that it's better to switch to Method 1 or 2 with 12 bits or
larger counter space.

Regards,

[1] 
https://www.postgresql.org/message-id/305478845.5279532.1712440778735%40mail.yahoo.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Windows meson build

2024-11-06 Thread Peter Eisentraut


On 05.11.24 07:32, Kohei Harikae (Fujitsu) wrote:

I do not use pkgconf in my Windows environment.
In my Windows environment, I could not build the following OSS with meson.
- 0001 icu
- 0002 libxml
- 0003 libxslt
- 0004 lz4
- 0005 tcl
- 0006 zlib
- 0007 zstd

[1]thread, I created a patch like the one in the attached file, and now I can 
build.
Would you like to be able to build OSS with Windows meson without using pkgconf?


The problem I see with these patches is that they create a separate code 
path for Windows.  Right now, all platforms use the same code path for 
discovering the dependencies, which is nice for maintainability.


So I suggest the solution here is for you to install pkgconf and use it.

Re: define pg_structiszero(addr, s, r)

2024-11-06 Thread David Rowley

On Thu, 7 Nov 2024 at 07:34, Peter Eisentraut  wrote:
> Speaking of which, couldn't you just use
>
>  pg_popcount(ptr, len) == 0

That might be quite good for small lengths or for use cases where the
memory is always or almost always zero. The problem is there's no
early exit when you find the first non-zero which means, for larger
lengths, reading much more memory. That'll both take longer and
possibly evict cache lines which might be useful to have in the near
future.

David

Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2024-11-06 Thread Peter Geoghegan

On Mon, Nov 4, 2024 at 4:58 PM Matthias van de Meent
 wrote:
> This is a review on v11, not the latest v13. I suspect most comments
> still apply, but I haven't verified this.

v11 is indeed quite similar to v13, so this shouldn't really matter.

> I'm a bit concerned about the additional operations that are being
> added to the scan. Before this patch, the amount of work in the
> "horizontal" portion of the scan was limited to user-supplied
> scankeys, so O(1) even when the index condition is only (f < 7). But,
> with this patch, we're adding work for (a=, b=, c=, etc.) for every
> tuple in the scan.

There's no question that there are still some cases where this cannot
possibly pay for itself. And that just isn't acceptable -- no
arguments here.

> As these new "skip array" keys are primarily useful for inter-page
> coordination (by determining if we want to start a primitive scan to
> skip to a different page and which value range that primitive scan
> would search for, or continue on to the next sibling), can't we only
> apply the "skip array" portion of the code at the final tuple we
> access on this page?

I plan on doing something like that. I'll need to.

AFAICT I only need to avoid wasting CPU cycles here -- there are no
notable regressions from performing excessive amounts of index
descents, as far as I can tell. And so I plan on doing this without
fundamentally changing anything about the current design.

In particular, I want the performance to remain robust in cases where
the best strategy varies significantly as the scan progresses -- even
when we need to strongly favor skipping at first, and then strongly
favor staying/not skipping later on (all during the same individual
index scan). I really like that the current design gets that part
right.

> While this section already defines some things about index scans which
> seem btree-specific, I don't think we should add more references to
> btree scan internals in a section about bitmaps and bitmap index
> scans.

This section of the docs discusses the trade-off between multiple
single column indexes, and fewer multi-column indexes. How could skip
scan not be relevant to such a discussion? One of the main benefits of
skip scan is that it'll allow users to get by with fewer indexes.

> > +++ b/src/backend/access/nbtree/nbtree.c
> [...]
> > -slock_tbtps_mutex;/* protects above variables, 
> > btps_arrElems */
> > +LWLockbtps_lock;/* protects above variables, 
> > btps_arrElems */
>
> Why is this changed to LWLock, when it's only ever acquired exclusively?

In general one should never do more than an extremely small, tightly
controlled amount of work with a spinlock held. It's now possible that
we'll allocate memory with the lock held -- doing that with a spinlock
held is an obvious no-no. We really need to use an LWLock for this
stuff now, on general principle.

> > +btestimateparallelscan(Relation rel, int nkeys, int norderbys)
>
> I notice you're using DatumSerialize. Are there reasons why we
> wouldn't want to use heap_fill_tuple, which generally produces much
> more compact output?

heap_fill_tuple doesn't support the notion of -inf and +inf scan key
sentinel values. Plus I'm inclined to use DatumSerialize because it's
more or less designed for this kind of problem.

> Also, I think you can limit the space usage to BLCKSZ in total,
> because a full index tuple can't be larger than 1/3rd of a block; and
> for skip scans we'll only have known equality bounds for a prefix of
> attributes available in the index tuples, and a single (?)
> index-produced dynamic attribute we want to skip ahead of. So, IIUC,
> at most we'll have 2 index tuples' worth of data, or 2/3 BLCKSZ.
> Right?

Possibly, but who wants to take a chance? The scheme you're describing
only saves memory when there's 3 skip arrays, which is fairly unlikely
in general.

I think that the approach taken to serializing the array keys should
be as conservative as possible. It's not particularly likely that
we'll want to do a parallel skip scan. It's rather hard to test those
code paths.

> I needed to look up what this 'cons up' thing is, as it wasn't
> something that I'd seen before. It also seems used exclusively in
> btree code, and only after the array keys patch, so I think it'd be
> better in general to use 'construct' here.

FWIW I wasn't the first person to use the term in the nbtree code.

I think you're right, though. It is a needlessly obscure term that is
only known to Lisp hackers. I'll fix it.

> > +++ b/src/backend/access/nbtree/nbtcompare.c
>
> The changes here are essentially 6x the same code, but for different
> types. What do you think about the attached
> 0001-Deduplicate[...].patch.txt, which has the same effect but with
> only 1 copy of the code checked in?

Reminds me of the approach taken by this extension:

https://github.com/petere/pguint

I do find the need to write so much boilerplate code for B-Tree
opclasses annoying. I

Re: Changing shared_buffers without restart

2024-11-06 Thread Vladlen Popolitov

Hi

I tried to apply patches, but failed. I suppose the problem with CRLF in the 
end of lines in the patch files. At least, after manual change of v1-0001 and 
v1-0002 from CRLF to LF patches applied, but it was not helped for v1-0003 - 
v1.0005 - they have also other mistakes during patch process. Could you check 
patch files and place them in correct format?

The new status of this patch is: Waiting on Author

magical eref alias names

2024-11-06 Thread Robert Haas

Hi,

When a query references a normal relation, the RangeTblEntry we
construct sets rte->eref->aliasname to either the user-specified alias
name, if there is one, or otherwise to the table name. But if there's
neither a user-specified alias name nor a table name, then we make up
a name. I have two related questions about this. First, why do we do
it at all? Second, why do we assign the names as we do?

Let me start with the second question. After some surveying of the
source code, here's a (possibly-incomplete) list of eref names that we
fabricate: old, new, *SELECT*, ANY_subquery, *MERGE*, *RESULT*,
*SELECT*, excluded, unnamed_subquery, unnamed_join, *GROUP*, *TLOCRN*,
*TROCRN*, *SELECT* %d where %d is an integer, *VALUES*, xmltable,
json_table. Of these, old, new, and excluded seem to make total sense
-- in certain contexts, the user can actually use those names in SQL
expressions to refer to stuff. But the rest, to my knowledge, can't be
referenced within the query, so I suppose they're just for display
purposes. But that seems like an odd choice, because these artificial
names are (1) not unique, which means that if they are referenced
anywhere such references are ambiguous; (2) not consistent in terms of
capitalization and punctuation, which is questionable for something
whose primary purpose is to inform the user; and (3) sometimes
incomprehensible -- I can make nothing of *TLOCRN* or *TROCRN*, even
after looking at the relevant source code, and I wouldn't know what
the distinction is between *SELECT* and *SELECT* %d without looking at
the source code.

And that brings me to the first question, which is why we're even
making up these names at all (with the exceptions of new, old, and
excluded, which serve a clear purpose). If we needed them to identify
things, that would make sense, but since they're neither unique nor
comprehensible, they're not really any good for that. And it seems
downright confusing at best, and a source of bugs at worst, to mix
together user-supplied names and system-generated names in such a way
that the one can't be easily distinguished from the other. We even
have regression tests verifying that if the user explicitly enters
unnamed_join or unnamed_subquery as an alias name, it doesn't break
anything due to confusion with identical alias names that might be
generated internally, which seems like good evidence for the
proposition that I'm not the first person to worry about this being
bug-prone. Granted, there are some cases where these names make their
way into EXPLAIN output. For example, this query from the regression
tests:

select * from int4_tbl o where (f1, f1) in
  (select f1, generate_series(1,50) / 10 g from int4_tbl i group by f1);

produces EXPLAIN output that includes this:

 ->  Subquery Scan on "ANY_subquery"
   Output: "ANY_subquery".f1, "ANY_subquery".g
   Filter: ("ANY_subquery".f1 = "ANY_subquery".g)

However, that seems to be a minority position. Many of the
system-generated eref names do not appear in EXPLAIN output, at least
not in our regression tests. Furthermore, EXPLAIN itself does
post-processing of the relations that appear in the output to set the
final names that are displaced (see set_rtable_names()), so if we
didn't insert names at parse time, we'd still have an opportunity to
make up something for display purposes. You could argue that the
results would be worse, but the current results aren't especially
amazing so I'm not sure I believe that. Perhaps with some elbow grease
they would even be better.

So overall I'm just confused here. Is this just a bunch of old cruft
that has never been cleaned up or standardized, or does it serve some
valuable purpose that is not obvious to me?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: define pg_structiszero(addr, s, r)

2024-11-06 Thread David Rowley

On Thu, 7 Nov 2024 at 00:40, Bertrand Drouvot
 wrote:
> Do you mean add:
>
> "
> for (; p < aligned_end; p += sizeof(size_t))
> {
>if (*(size_t *)p != 0)
>return false;
> }
> "
>
> just before the last loop?
>
> If so, I did a few tests and did not see any major improvements. So, I thought
> it's simpler to not add more code in this inline function in v7 shared 
> up-thread.

Did you try with a size where there's a decent remainder, say 124
bytes? FWIW, one of the cases has 112 bytes, and I think that is
aligned memory meaning we'll do the first 64 in the SIMD loop and have
to do 48 bytes in the byte-at-a-time loop. If you had the loop Michael
mentioned, that would instead be 6 loops of size_t-at-a-time.

David

Re: Commit Timestamp and LSN Inversion issue

2024-11-06 Thread Jan Wieck


On 11/6/24 06:23, Amit Kapila wrote:


I think we avoid calling hook/callback functions after holding a lock
(spinlock or LWLock) as the user may do an expensive operation or
acquire some other locks in those functions which could lead to
deadlocks or impact the concurrency. So, it would be better to
directly call an inline function to perform the required operation.


This is a valid concern. The reason why I kept it as a hook is because 
ReserveXLogInsertLocation() has no knowledge that this is allocating WAL 
space for a commit record. Only the caller does. We certainly need to be 
extremely careful what any such hook function is doing. Acquiring 
additional locks is definitely not acceptable. But I am not sure we 
should burden this function with specialized knowledge about what it is 
reserving WAL space for.



Regards, Jan

optimize file transfer in pg_upgrade

2024-11-06 Thread Nathan Bossart

For clusters with many relations, the file transfer step of pg_upgrade can
take the longest.  This step clones, copies, or links the user relation
files from the older cluster to the new cluster, so the amount of time it
takes is closely related to the number of relations.  However, since v15,
we've preserved the relfilenodes during pg_upgrade, which means that all of
these user relation files will have the same name.  Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

The attached proof-of-concept patches implement this "catalog-swap" mode
for demonstration purposes.  I tested this mode on a cluster with 200
databases, each with 10,000 tables with 1,000 rows and 2 unique constraints
apiece.  Each database also had 10,000 sequences.  The test used 96 jobs.

  pg_upgrade --link --sync-method syncfs  -->  10m 23s (~5m linking)
  pg_upgrade --catalog-swap   -->  5m 32s (~30s linking)

While these results are encouraging, there are a couple of interesting
problems to manage.  First, in order to move the data directory from the
old cluster to the new cluster, we will have first moved the new cluster's
data directory (full of files created by pg_restore) aside.  After the file
transfer stage, this directory will be filled with useless empty files that
should eventually be deleted.  Furthermore, none of these files will have
been synchronized to disk (outside of whatever the kernel has done in the
background), so pg_upgrade's data synchronization step can take a very long
time, even when syncfs() is used (so long that pg_upgrade can take even
longer than before).  After much testing, the best way I've found to deal
with this problem is to introduce a special mode for "initdb --sync-only"
that calls fsync() for everything _except_ the actual data files.  If we
fsync() the new catalog files as we move them into place, and if we assume
that the old catalog files will have been properly synchronized before
upgrading, there's no reason to synchronize them again at the end.

Another interesting problem is that pg_upgrade currently doesn't transfer
the sequence data files.  Since v10, we've restored these via pg_restore.
I believe this was originally done for the introduction of the pg_sequence
catalog, which changed the format of sequence tuples.  In the new
catalog-swap mode I am proposing, this means we need to transfer all the
pg_restore-generated sequence data files.  If there are many sequences, it
can be difficult to determine which transfer mode and synchronization
method will be faster.  Since sequence tuple modifications are very rare, I
think the new catalog-swap mode should just use the sequence data files
from the old cluster whenever possible.

There are a couple of other smaller trade-offs with this approach, too.
First, this new mode complicates rollback if, say, the machine loses power
during file transfer.  IME the vast majority of failures happen before this
step, and it should be relatively simple to generate a script that will
safely perform the required rollback steps, so I don't think this is a
deal-breaker.  Second, this mode leaves around a bunch of files that users
would likely want to clean up at some point.  I think the easiest way to
handle this is to just put all these files in the old cluster's data
directory so that the cleanup script generated by pg_upgrade also takes
care of them.

Thoughts?

-- 
nathan
>From f800010296b1749b57e0fe3dcde010cc2ba41973 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 5 Nov 2024 15:59:51 -0600
Subject: [PATCH v1 1/8] Export walkdir().

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

A follow-up commit will use this function to swap catalog files
between database directories during pg_upgrade.
---
 src/common/file_utils.c | 5 +
 src/include/common/file_utils.h | 3 +++
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 398fe1c334..3f488bf5ec 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -48,9 +48,6 @@
 #ifdef PG_FLUSH_DATA_WORKS
 static int pre_sync_fname(const char *fname, bool isdir);
 #endif
-static void walkdir(const char *path,
-   int (*action) (const char *fname, bool 
isdir),
-   bool process_symlinks);
 
 #ifdef HAVE_SYNCFS
 
@@ -268,7 +265,7 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod 
sync_method)
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
  */
-static void
+void
 walkdir(const char *path,
int (*action) (const char *fname, bool isdir),
bool process_symlinks)
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index e4339fb7b6..5a9519acfe 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -3

Re: define pg_structiszero(addr, s, r)

2024-11-06 Thread Peter Eisentraut


On 05.11.24 16:03, Bertrand Drouvot wrote:

On Tue, Nov 05, 2024 at 05:08:41PM +1300, David Rowley wrote:

On Tue, 5 Nov 2024 at 06:39, Ranier Vilela  wrote:

I think we can add a small optimization to this last patch [1].


I think if you want to make it faster, you could partially unroll the
inner-most loop, like:

// size_t * 4
for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4)
{
 if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *)
p)[2] != 0 | ((size_t *) p)[3] != 0)
 return false;
}


Another option could be to use SIMD instructions to check multiple bytes
is zero in a single operation. Maybe just an idea to keep in mind and experiment
if we feel the need later on.


Speaking of which, couldn't you just use

pg_popcount(ptr, len) == 0

?

Re: FW: Building Postgres 17.0 with meson

2024-11-06 Thread Robert Haas

On Wed, Nov 6, 2024 at 2:59 PM Mark Hill  wrote:
> Srinath is in India I believe and not available currently.   Does anybody 
> have any idea why meson
>
> is not finding the paths I’m specifying with the -Dextra_lib_dirs and  
> -Dextra_include_dirs?  See below.

I am not an expert on this topic, but do use these options on macOS
and they do work for me. A typical 'meson setup' invocation for me is:

meson setup $HOME/pgsql $HOME/pgsql-meson -Dcassert=true -Ddebug=true
-Dextra_include_dirs=/opt/local/include
-Dextra_lib_dirs=/opt/local/lib -Dprefix=$HOME/install/dev

In your example, you set
extra_lib_dirs=%openssl_directory%,%zlib_directory%,%icu4c_directory%.
Those aren't actually pathnames, so something (your shell, for
instance) would have to be substituting the actual correct pathnames.
Maybe that isn't happening, or maybe the pathnames that are being
substituted are not the right ones, or maybe they don't contain the
files that meson thinks they ought to contain.

One thing that you could do is have a look at the meson log file. On
my machine, that shows up inside the directory where I run 'meson
setup' at meson-logs/meson-logs.txt. I am not sure if it's different
on Windows. If you look in that file for strings like icu or openssl
or zlib, you may be able to understand what is going wrong.

As far as "Srinath is in India I believe and not available currently,"
while that is probably true, I'm not sure that's really the issue.
Your original email didn't give the exact command you ran, the exact
output you got, or any error messages or log files. Your follow-up
email didn't include any of that either, except material quoted from
another email that you apparently sent off-list, and even that wasn't
nearly as detailed as many posts that you will see here. Also, see
also the "Email etiquette mechanics" section of
https://wiki.postgresql.org/wiki/Mailing_Lists which suggests keeping
replies on-list and not top-posting.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Rename Function: pg_postmaster_start_time

2024-11-06 Thread Tom Lane

Daniel Gustafsson  writes:
> Looking at similar functions it's clear they don't use the pg_postgres_ 
> prefix,
> like for example pg_conf_load_time.  Should this if so be pg_start_time?

No, because it's not clear what start time you're talking about;
aside from the postmaster, this could plausibly refer to the
session start, transaction start, or query start time.

Like David, I'm not seeing a good argument for changing anything
here.

regards, tom lane

Re: Popcount optimization using AVX512

2024-11-06 Thread Nathan Bossart

rebased

-- 
nathan
>From d0fb7e0e375f7b76d4df90910c21e9448dd3b380 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v3 1/1] use __attribute__((target(...))) for AVX-512 stuff

---
 config/c-compiler.m4 |  64 +-
 configure| 167 +++
 configure.ac |  17 +--
 meson.build  |  21 ++--
 src/Makefile.global.in   |   5 -
 src/include/c.h  |  10 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  12 +-
 src/port/meson.build |   7 +-
 src/port/pg_popcount_avx512.c|  86 +-
 src/port/pg_popcount_avx512_choose.c | 102 
 11 files changed, 183 insertions(+), 312 deletions(-)
 delete mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..c7eb896f14 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -700,20 +700,22 @@ undefine([Ac_cachevar])dnl
 # Check if the compiler supports the XSAVE instructions using the _xgetbv
 # intrinsic function.
 #
-# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
-# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+# If the intrinsics are supported, sets pgac_xsave_intrinsics.
 AC_DEFUN([PGAC_XSAVE_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
-  [return _xgetbv(0) & 0xe0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl
+AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("xsave")))
+#endif
+static int xsave_test(void)
+{
+  return _xgetbv(0) & 0xe0;
+}],
+  [return xsave_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_XSAVE="$1"
   pgac_xsave_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
@@ -725,29 +727,29 @@ undefine([Ac_cachevar])dnl
 # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
 # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
 #
-# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
-# -mavx512bw).  If the intrinsics are supported, sets
-# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics.
 AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
-  [const char buf@<:@sizeof(__m512i)@:>@;
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, 
(const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("avx512vpopcntdq","avx512bw")))
+#endif
+static int popcount_test(void)
+{
+  const char buf@<:@sizeof(__m512i)@:>@;
+  PG_INT64_TYPE popcnt = 0;
+  __m512i accum = _mm512_setzero_si512();
+  const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 
0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+  const __m512i cnt = _mm512_popcnt_epi64(val);
+  accum = _mm512_add_epi64(accum, cnt);
+  popcnt = _mm512_reduce_add_epi64(accum);
+  return (int) popcnt;
+}],
+  [return popcount_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_POPCNT="$1"
   pgac_avx512_popcnt_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
diff --git a/configure b/configure
index 6e256b417b..3a7332f834 100755
--- a/configure
+++ b/configure
@@ -647,9 +647,6 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
-PG_POPCNT_OBJS
-CFLAGS_POPCNT
-CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17272,185 +17269,103 @@ fi
 
 # Check for XSAVE intrinsics
 #
-CFLAGS_XSAVE=""
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
-$as_echo_n "ch

RE: Commit Timestamp and LSN Inversion issue

2024-11-06 Thread Zhijie Hou (Fujitsu)

On Tuesday, November 5, 2024 9:59 PM Jan Wieck  wrote:
> 
> Hi Hackers,
> 
> On 9/5/24 01:39, Amit Kapila wrote:
> >
> > We can't forget CDR completely as this could only be a potential
> > problem in that context. Right now, we don't have any built-in
> > resolution strategies, so this can't impact but if this is a problem
> > then we need to have a solution for it before considering a solution
> > like "last_write_wins" strategy.
> 
> I agree that we can't forget about CDR. This is precisely the problem we ran 
> into
> here at pgEdge and why we came up with a solution (attached).
> 
> > Now, instead of discussing LSN<->timestamp inversion issue, you
> > started to discuss "last_write_wins" strategy itself which we have
> > discussed to some extent in the thread [2]. BTW, we are planning to
> > start a separate thread as well just to discuss the clock skew problem
> > w.r.t resolution strategies like "last_write_wins" strategy. So, we
> > can discuss clock skew in that thread and keep the focus of this
> > thread LSN<->timestamp inversion problem.
> 
> Fact is that "last_write_wins" together with some implementation of Conflict
> free Replicated Data Types (CRDT) is good enough for many real world
> situations. Anything resembling a TPC-B or TPC-C is quite happy with it.
> 
> The attached solution is minimally invasive because it doesn't move the
> timestamp generation (clock_gettime() call) into the critical section of
> ReserveXLogInsertLocation() that is protected by a spinlock. Instead it keeps
> track of the last commit-ts written to WAL in shared memory and simply bumps
> that by one microsecond if the next one is below or equal.
> There is one extra condition in that code section plus a function call by 
> pointer
> for every WAL record. In the unlikely case of encountering a stalled or
> backwards running commit-ts, the expensive part of recalculating the CRC of
> the altered commit WAL-record is done later, after releasing the spinlock. I 
> have
> not been able to measure any performance impact on a machine with 2x
> Xeon-Silver (32 HT cores).
> 
> This will work fine until we have systems that can sustain a commit rate of 
> one
> million transactions per second or higher for more than a few milliseconds.

Thanks for the patch! I am reading the patch and noticed few minor things.

1.
+   /*
+* This is a local transaction. Make sure that the xact_time
+* higher than any timestamp we have seen thus far.
+*
+* TODO: This is not postmaster restart safe. If the local
+* system clock is further behind other nodes than it takes
+* for the postmaster to restart (time between it stops
+* accepting new transactions and time when it becomes ready
+* to accept new transactions), local transactions will not
+* be bumped into the future correctly.
+*/

The TODO section mentions other nodes, but I believe think patch currently do
not have the handling of timestamps for other nodes. Should we either remove
this part or add a brief explanation to clarify the relationship with other
nodes?

2.
+/*
+ * Hook function to be called while holding the WAL insert spinlock.
+ * to adjust commit timestamps via Lamport clock if needed.
+ */

The second line seems can be improved:
"to adjust commit timestamps .." => "It adjusts commit timestamps ..."


Best Regards,
Hou zj

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

2024-11-06 Thread Michael Christofides

>
> I'm not against analyze = on turning buffers on by default. However, I
> think it would be quite painful to fix the tests if it were on without
> analyze.
>

This would be amazing. I'm finding BUFFERS are especially helpful for
giving developers a clearer idea of why their queries are slow (especially
once converted to KB/MB/GB/TB).

> The trouble is that EXPLAIN EXECUTE already means something.

I completely agree with this. So -1 from me on the first suggestion.

> Let's focus on item 2.

+1 from me on item 2. I'd go further and have VERBOSE flip most other
parameters to on (or to their default for non-booleans), unless specified
otherwise. Specifically SUMMARY, BUFFERS, SETTINGS, WAL,
SERIALIZE (TEXT), and MEMORY. Although I do think it would be best if
BUFFERS and SERIALIZE were ON and TEXT by default respectively with
ANALYZE, which may reduce/remove the need for them to be affected by
VERBOSE.

> If the VERBOSE option turns information about BUFFERS on
> and off, and the BUFFERS option does the same thing, what happens if I
> say EXPLAIN (VERBOSE ON, BUFFERS OFF)? Is it different if I say
> EXPLAIN (BUFFERS OFF, VERBOSE ON)?

I'd expect this to work like other parameters that have dependencies, for
example both EXPLAIN (ANALYZE, SUMMARY OFF) and EXPLAIN
(SUMMARY OFF, ANALYZE) exclude the SUMMARY, even though it is
on by default with ANALYZE. So users could turn off anything they don't
want, if needed.

> I'm not very happy with the current situation. I agree that EXPLAIN
has gotten a bit too complicated.

I agree. In the past 6 versions, 5 new parameters have been added.
SETTINGS in v12, WAL in v13, GENERIC_PLAN in v16, SERIALIZE in
v17, and MEMORY in v17. It feels like we should have some easier way
to get everything. Currently, we need to specify: EXPLAIN (ANALYZE,
VERBOSE, BUFFERS, SETTINGS, WAL, SERIALIZE, MEMORY).

> If you enable an option that adds an extra line of
> output for every node and there are 100 nodes in the query plan, that
> is a whole lot of additional clutter.

This is a fair point, but I think it is worth it in the case of BUFFERS.
The
other parameter that adds a line per node is WAL, but the others don't
add much clutter.

Many people use tools these days to help read plans (I work on one,
so have some biased opinions of course). Tools help folks calculate
timings and spot bottlenecks , so once you're using a tool to read a plan,
more information is often beneficial for minimal overhead.

> This is not likely to fly for compatibility reasons.

I'd be interested to hear more on this front too. One issue is that folks
with auto_explain.log_verbose = on would get extra output in their logs,
but I strongly suspect I'm missing some more important things.

> the fresh SERIALIZE option was discussed in
>
https://www.postgresql.org/message-id/flat/ca0adb0e-fa4e-c37e-1cd7-91170b18cae1%40gmx.de
> (2023-2024, 17)

I noticed in this thread Tom was against SERIALIZE being on by default
with ANALYZE, "because it would silently render EXPLAIN outputs from
different versions quite non-comparable." I'm not sure I agree with the
silently part, as the output from 17+ would include the serialization
details,
but again perhaps I'm missing something important.

> Ready to do legwork here.

Same here.

—
Michael Christofides
Founder, pgMustard

Re: further #include cleanup (IWYU)

2024-11-06 Thread Peter Eisentraut

Seeing no further comments (or any easy alternatives), I have committed 
this last patch as is.



On 28.10.24 10:45, Peter Eisentraut wrote:

On 20.10.24 11:53, Alvaro Herrera wrote:

On 2024-Oct-20, Peter Eisentraut wrote:
diff --git a/src/bin/pg_dump/pg_backup_utils.c b/src/bin/pg_dump/ 
pg_backup_utils.c

index a0045cf5e58..80715979a1a 100644
--- a/src/bin/pg_dump/pg_backup_utils.c
+++ b/src/bin/pg_dump/pg_backup_utils.c
@@ -13,7 +13,9 @@
   */
  #include "postgres_fe.h"
+#ifdef WIN32
  #include "parallel.h"
+#endif
  #include "pg_backup_utils.h"


This seems quite weird and I think it's just because exit_nicely() wants
to do _endthreadex().  Maybe it'd be nicer to add a WIN32-specific
on_exit_nicely_list() callback that does that in parallel.c, and do away
with the inclusion of parallel.h in pg_backup_utils.c entirely?


I was thinking the same thing.  But maybe that should be a separate 
project.



diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index a09247fae47..78e91f6e2dc 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -63,7 +63,9 @@
  #include "fe_utils/string_utils.h"
  #include "parallel.h"
  #include "pg_backup_utils.h"
+#ifdef WIN32
  #include "port/pg_bswap.h"
+#endif


This looks really strange, but then parallel.c seems to have embedded
its own portability layer within itself.


The reason for this one is that pgpipe() uses pg_hton16() and 
pg_hton32().  We could use htons() and htonl() here instead.  That would 
effectively revert that part of commit 0ba99c84e8c.

Re: Time to add a Git .mailmap?

2024-11-06 Thread Alvaro Herrera

On 2024-Nov-06, Michael Paquier wrote:

> And now I'm going to ask how you figured out about the ë in my name,
> because it's right ;)

Hah, I got it from Fabien Coelho,
https://postgr.es/m/alpine.DEB.2.10.1512240729160.17411@sto
and then searched around to confirm that it was correct.

Funnily enough, I'm the only committer that has ever used "Michaël", but
I'm not the only one to have used the mistaken "Paquiër".  Go figure.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"That sort of implies that there are Emacs keystrokes which aren't obscure.
I've been using it daily for 2 years now and have yet to discover any key
sequence which makes any sense."(Paul Thomas)

Re: New GUC autovacuum_max_threshold ?

2024-11-06 Thread wenhui qiu

Hi frederic.yhuel

> Thank you. FWIW, I would prefer a sub-linear growth, so maybe something
> like this

>   vacthresh = Min(vac_base_thresh + vac_scale_factor * reltuples,
>   vac_base_thresh + vac_scale_factor * pow(reltuples, 0.7) * 100);

>   This would give :

>   * 386M (instead of 5.1 billion currently) for a 25.6 billion tuples
table ;
>   * 77M for a 2.56 billion tuples table (Robert's example) ;
>   * 15M (instead of 51M currently) for a 256M tuples table ;
>   * 3M (instead of 5M currently) for a 25.6M tuples table.
> The other advantage is that you don't need another GUC.
Argee ,We just need to change the calculation formula，But I prefer this
formula because it calculates a smoother value.

vacthresh =  (float4) fmin(vac_base_thresh + vac_scale_factor *
reltuples,vac_base_thresh
+ vac_scale_factor * log2(reltuples) * 1);
or
vacthresh = (float4) fmin(vac_base_thresh + (vac_scale_factor * reltuples)
, sqrt(1000.0 * reltuples));

Frédéric Yhuel  于2024年8月12日周一 21:41写道：

>
>
> On 8/7/24 23:39, Nathan Bossart wrote:
> > I've attached a new patch to show roughly what I think this new GUC
> should
> > look like.  I'm hoping this sparks more discussion, if nothing else.
> >
>
> Thank you. FWIW, I would prefer a sub-linear growth, so maybe something
> like this:
>
> vacthresh = Min(vac_base_thresh + vac_scale_factor * reltuples,
> vac_base_thresh + vac_scale_factor * pow(reltuples, 0.7) * 100);
>
> This would give :
>
> * 386M (instead of 5.1 billion currently) for a 25.6 billion tuples table ;
> * 77M for a 2.56 billion tuples table (Robert's example) ;
> * 15M (instead of 51M currently) for a 256M tuples table ;
> * 3M (instead of 5M currently) for a 25.6M tuples table.
>
> The other advantage is that you don't need another GUC.
>
> > On Tue, Jun 18, 2024 at 12:36:42PM +0200, Frédéric Yhuel wrote:
> >> By the way, I wonder if there were any off-list discussions after
> Robert's
> >> conference at PGConf.dev (and I'm waiting for the video of the conf).
> >
> > I don't recall any discussions about this idea, but Robert did briefly
> > mention it in his talk [0].
> >
> > [0] https://www.youtube.com/watch?v=RfTD-Twpvac
> >
>
> Very interesting, thanks!
>
>
>

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2024-11-06 Thread Alvaro Herrera

On 2024-Nov-06, Alvaro Herrera wrote:

> Perhaps one more task for me is to figure out a way to get a list of all
> the constraints that are broken because of this ... let me see if I can
> figure that out.

It's gotta be something like this,

SELECT conrelid::regclass AS "constrained table",
   conname as constraint, confrelid::regclass AS "references"
FROM pg_constraint
WHERE contype = 'f' and conparentid = 0 AND
   (SELECT count(*) FROM pg_constraint p2 WHERE conparentid = 
pg_constraint.oid) <>
   (SELECT count(*)
  FROM pg_inherits
 WHERE inhparent = pg_constraint.conrelid OR inhparent = 
pg_constraint.confrelid);

Essentially, top-level constraints should have as many children
constraint as direct partitions each partitioned table has.  Ideally
here you'd get an empty set, but you won't if the DETACH problem has
occurred.  I'll test this further later or maybe tomorrow as time
allows.

A quick test rig for this is:

create table pk (a int primary key) partition by list (a);
create table pk1 partition of pk for values in (1);
create table pk2367 partition of pk for values in (2, 3, 6, 7) partition by 
list (a);
create table pk67 partition of pk2367 for values in (6, 7) partition by list 
(a);
create table pk2 partition of pk2367 for values in (2);
create table pk3 partition of pk2367 for values in (3);
create table pk6 partition of pk67 for values in (6);
create table pk7 partition of pk67 for values in (7);
create table pk45 partition of pk for values in (4, 5) partition by list (a);
create table pk4 partition of pk45 for values in (4);
create table pk5 partition of pk45 for values in (5);

create table fk (a int references pk) partition by list (a);
create table fk1 partition of fk for values in (1);
create table fk2367 partition of fk for values in (2, 3, 6, 7) partition by 
list (a);
create table fk67 partition of fk2367 for values in (6, 7) partition by list 
(a);
create table fk2 partition of fk2367 for values in (2);
create table fk3 partition of fk2367 for values in (3);
create table fk6 partition of fk67 for values in (6);
create table fk7 partition of fk67 for values in (7);
create table fk45 partition of fk for values in (4, 5) partition by list (a);
create table fk4 partition of fk45 for values in (4);
create table fk5 partition of fk45 for values in (5);

alter table fk detach partition fk2367;



Before the fix, you get

 constrained table │ constraint │ references 
───┼┼
 fk2367│ fk_a_fkey  │ pk
(1 fila)

which means you need to
ALTER TABLE fk2367 DROP CONSTRAINT fk_a_fkey;

and then put it back.  Maybe it'd be better to have the query emit the
commands to drop and reconstruct the FK?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: SQL:2011 application time

2024-11-06 Thread Paul Jungwirth


On 11/4/24 13:16, Sam Gabrielsson wrote:
Foreign key violation errors are incorrectly raised in a few cases for a temporal foreign key with 
default ON UPDATE NO ACTION. Test is based on the commited v39 patches (used a snapshot version of 
PG18 devel available from PGDG).


Thank you for the report! I confirmed that this is a problem. In ri_restrict we fail if any fk 
records still match the being-changed pk, but for temporal if you're merely shrinking the pk range, 
fk references could still wind up being valid (if you're only shrinking it a little). So we need to 
do more work.


In the temporal NO ACTION case something similar to this (though with appropriate locks) could 
perhaps be tested in ri_restrict (when ri_Check_Pk_Match returns false):


   SELECT 1
   FROM (SELECT range_agg(pkperiodatt) AS r
   FROM 
   WHERE pkatt1 = $1 [AND ...]
   AND pkperiodatt && $n) AS pktable,
     (SELECT fkperiodatt AS r
   FROM 
   WHERE fkatt1 = $1 [AND ...]
   AND fkperiodatt && $n) AS fktable
   WHERE NOT fktable.r <@ pktable.r


This solution looks like it will work to me. Basically: find FKs that still match the PK, but only 
fail if they are no longer covered.


IIRC for RESTRICT it is *correct* to reject the change, so we would want to keep the old SQL there, 
and only update it for NOACTION.


I'll work on a fix and submit another set of patches.

Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.com

Re: Remove an obsolete comment in gistinsert()

2024-11-06 Thread Tender Wang

Aleksander Alekseev  于2024年11月5日周二 22:08写道：

> Hi Tender,
>
> > While learning the GIST codes, I find an obsolete comment in gistinsert
> ().
> >
> > itup = gistFormTuple(giststate, r,
> >   values, isnull, true /* size is
> currently bogus */ );
>
> Thanks for reporting. I agree that this is an oversight of commit 1f7ef54.
>
> The commit changed the signature of gistFormTuple():
>
> ```
>  IndexTuple
>  gistFormTuple(GISTSTATE *giststate, Relation r,
> - Datum attdata[], int datumsize[], bool isnull[])
> + Datum attdata[], bool isnull[], bool newValues)
> ```
>
> ... but left the comment for the `datumsize` argument:
>
> ```
> itup = gistFormTuple(&buildstate->giststate, index,
> -   values, NULL /* size is currently bogus */, isnull);
> +   values, isnull, true /* size is currently bogus */);
> ```
>
> I checked the rest of gistFormTuple() calls and also looked for other
> comments like this. There seems to be only one call like this to fix.
>

Thanks for reviewing this. I have added it to the 2015-01 commitfest.

-- 
Thanks,
Tender Wang

89 matches

Mail list logo