date:20241101

On Fri, Jul  5, 2024 at 05:11:22PM -0400, Bruce Momjian wrote:
> Wow, I see that now:
> 
>   test=> SELECT 'now('::timestamptz;
> timestamptz
>   ---
>2024-07-05 17:04:33.457915-04
> 
> If I remove the 'now()' mention in the docs, patch attached, I am
> concerned people will be confused whether it is the removal of the
> single quotes or the use of "()" which causes insert-time evaluation,
> and they might try 'now()'.

> diff --git a/doc/src/sgml/ref/create_table.sgml 
> b/doc/src/sgml/ref/create_table.sgml
> index f19306e7760..4d47248fccf 100644
> --- a/doc/src/sgml/ref/create_table.sgml
> +++ b/doc/src/sgml/ref/create_table.sgml
> @@ -888,6 +888,13 @@ WITH ( MODULUS  class="parameter">numeric_literal, REM
>match the data type of the column.
>   
>  
> + 
> +  Note, a string that returns a volatile result once cast to a data
> +  type, like 'now'::timestamptz, is evaluated at
> +  table creation time, while now()::timestamptz
> +  (without quotes) is evaluated at data insertion time.
> + 
> +
>   
>The default expression will be used in any insert operation that
>does not specify a value for the column.  If there is no default

It seems we never came to an agreed-upon documentation addition to warn
users about this.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: define pg_structiszero(addr, s, r)

2024-11-01 Thread Japin Li

On Fri, 01 Nov 2024 at 21:47, David Rowley  wrote:
> On Fri, 1 Nov 2024 at 20:49, Michael Paquier  wrote:
>>
>> On Fri, Nov 01, 2024 at 07:44:22AM +, Bertrand Drouvot wrote:
>> > Worth to add a comment as to why pg_memory_is_all_zeros() should not
>> > be used here?
>>
>> I would not add one in bufpage.c, documenting that where
>> pg_memory_is_all_zeros() is defined may be more adapted.
>
> The thought of having to write a comment to warn people not to use it
> for performance-critical things makes me think it might be better just
> to write a more optimal version of the function so we don't need to
> warn people. I looked around at the callers of the function I saw the
> following numbers of bytes being used for the length: 8192 (the one in
> question), 88, 32 and 112.
>
> I don't know how performance-critical the final three of those are,
> but I imagine all apart from the 32-byte one might be better with a
> non-inlined and more optimised version of the function. The problem
> with inlining the optimised version is that it's more code to inline.
>
+1

Is there a possible overflow?

+   const char *end = &p[len];

How about use MAXALIGN64 macro here?

When handling the aligned, is it possible to handle multiple values
(such as 4 or 8) in one iteration?

It might be faster. However, I'm not tested.


-- 
Regrads,
Japin Li

Re: Using read_stream in index vacuum

2024-11-01 Thread Kirill Reshke

Hi!

On Fri, 25 Oct 2024 at 17:01, Andrey M. Borodin  wrote:
>
>
>
> > On 25 Oct 2024, at 00:55, Rahila Syed  wrote:
> >
> > While writing this email, I realized I evicted buffers for the table
> > and not the index. I will perform the test again. However,
> > I would like to know your opinion on whether this looks like
> > a valid test.
>
> Well, yes, kind of, you need drop caches from index. And, perhaps, you can 
> have more indexes. You also can disable autovaccum and just restart postgres 
> instead of iterating through buffer caches.
>
> I've asked Thomas about performance implications and he told me that 
> converting stuff to streamlined API is not expected to have better 
> performance. It's needed to have decent perfromance when DIRECT_IO will be 
> involved.
>
> Thanks!
>
>
> Best regards, Andrey Borodin.
>

I noticed CI failure for this patch. This does not look like a flap.

[0] https://cirrus-ci.com/task/4527545259917312
[1] 
https://api.cirrus-ci.com/v1/artifact/task/4527545259917312/log/src/test/modules/test_misc/tmp_check/log/regress_log_007_vacuum_btree

-- 
Best regards,
Kirill Reshke

Re: New function normal_rand_array function to contrib/tablefunc.

2024-11-01 Thread Japin Li

On Fri, 01 Nov 2024 at 09:33, Dean Rasheed  wrote:
> On Wed, 16 Oct 2024 at 08:43, Andy Fan  wrote:
>>
>>  Thanks for the detailed feedback! Here is the rebased version.
>>
>
> I took another look at this and I think it's in reasonable shape.
>
> I'm attaching an update, rebasing it on top of 9be4e5d293.
>
> Also it was missing a required update to the meson.build file --
> that's the immediate cause of the other cfbot failures.
>
> The rest is just cosmetic tidying up, fixing indentation, tweaking
> comments, and the like. I also hacked on the docs a bit -- the
> synopsis only listed one of the new function signatures for some
> reason. After fixing that, I think it's sufficient to just list one
> usage example.
>

LGTM expect there is a warning when applying the patch.

Applying: tablefunc: Add rand_array() functions.
.git/rebase-apply/patch:475: trailing whitespace.
  rand_array
warning: 1 line adds whitespace errors.

The other looks good to me.

-- 
Regrads,
Japin Li

Re: Separate memory contexts for relcache and catcache

Hi,

On 2024-10-29 15:00:02 -0700, Jeff Davis wrote:
> On Wed, 2024-04-03 at 16:12 +0300, Melih Mutlu wrote:
> > Rebased. PSA.
>
> Thank you. I missed your patch and came up with a similar patch over
> here:
>
> https://www.postgresql.org/message-id/flat/78599c442380ddb5990117e281a4fa65a74231af.ca...@j-davis.com
>
> I closed my thread and we can continue this one.
>
> One difference is that I tried to capture almost all uses of
> CacheMemoryContext so that it would become just a parent context
> without many allocations of its own.

I'm a bit worried about the increase in "wasted" memory we might end up when
creating one aset for *everything*. Just splitting out Relcache and CatCache
isn't a big deal from that angle, they're always used reasonably much. But
creating a bunch of barely used contexts does have the potential for lots of
memory being wasted at the end of a page and on freelists.  It might be ok as
far was what you proposed in the above email, I haven't analyzed that in depth
yet.

> I agree with others that we should look at changing the initial size or
> type of the contexts, but that should be a separate commit.

It needs to be done close together though, otherwise we'll increase the
new-connection-memory-usage of postgres measurably.

I've previously proposed creating a type of memory context that's intended for
places where we never expect to allocate much which allocates from either a
superior memory context or just from the system allocator and tracks memory
via linked lists.  That'd allow us to use fairly granular memory contexts with
low overhead, which we e.g. could use to actually create each catcache &
relcache entry in its own context.

One concern that was voiced about that idea was that it'd perform badly if
such a context did end up being used hotly - I'm not sure that's a real
problem, but we could address it by switching to a different allocation scheme
once a certain size is reached.

Greetings,

Andres Freund

Re: Separate memory contexts for relcache and catcache

2024-11-01 Thread Jeff Davis

On Fri, 2024-11-01 at 15:19 -0400, Andres Freund wrote:
> I'm a bit worried about the increase in "wasted" memory we might end
> up when
> creating one aset for *everything*. Just splitting out Relcache and
> CatCache
> isn't a big deal from that angle, they're always used reasonably
> much. But
> creating a bunch of barely used contexts does have the potential for
> lots of
> memory being wasted at the end of a page and on freelists.  It might
> be ok as
> far was what you proposed in the above email, I haven't analyzed that
> in depth
> yet.

Melih raised similar concerns. The new contexts that my patch created
were CatCacheContext, RelCacheContext, SPICacheContext,
PgOutputContext, PlanCacheContext, TextSearchCacheContext, and
TypCacheContext.

Those are all created lazily, so you need to at least be using the
relevant feature before it has any cost (with the exception of the
first two).

> > I agree with others that we should look at changing the initial
> > size or
> > type of the contexts, but that should be a separate commit.
> 
> It needs to be done close together though, otherwise we'll increase
> the
> new-connection-memory-usage of postgres measurably.

I don't have a strong opinion here; that was a passing comment. But I'm
curious: why it would increase the per-connection memory usage much to
just have a couple new memory contexts?

> I've previously proposed creating a type of memory context that's
> intended for
> places where we never expect to allocate much which allocates from
> either a
> superior memory context or just from the system allocator and tracks
> memory
> via linked lists.

Why not just use ALLOCSET_SMALL_SIZES?

Regards,
Jeff Davis

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-11-01 Thread Jacob Champion

Hi all,

Here's a v4, with a separate wait event for each location. (I could
use some eyes on the specific phrasing I've chosen for each one.)

On Sun, Sep 1, 2024 at 5:10 PM Michael Paquier  wrote:
> Could it be more transparent to use a "startup" or
> "starting"" state instead that gets also used by pgstat_bestart() in
> the case of the patch where !pre_auth?

Done. (I've used "starting".)

On Mon, Sep 9, 2024 at 10:30 PM Michael Paquier  wrote:
> A new category would be more adapted.  IPC is not adapted because are
> not waiting for another server process.  Perhaps just use a new
> "Authentication" class, as in "The server is waiting for an
> authentication operation to complete"?

Added a new "Auth" class (to cover both authn and authz during
startup), plus documentation.

On Wed, Sep 11, 2024 at 4:42 PM Michael Paquier  wrote:
> Setting up twice the structure as the patch proposes is kind of
> a weird concept, but it feels to me that we should split that and set
> the fields in the pre-auth step and ignore the irrelevant ones, then
> complete the rest in a second step.

The more I look at this, the more uneasy I feel about the goal. Best I
can tell, the pre-auth step can't ignore irrelevant fields, because
they may contain junk from the previous owner of the shared memory. So
if we want to optimize, we can only change the second step to skip
fields that were already filled in by the pre-auth step.

That has its own problems: not every backend type uses the pre-auth
step in the current patch. Which means a bunch of backends that don't
benefit from the two-step initialization nevertheless have to either
do two PGSTAT_BEGIN_WRITE_ACTIVITY() dances in a row, or else we
duplicate a bunch of the logic to make sure they maintain the same
efficient code path as before.

Finally, if we're okay with all of that, future maintainers need to be
careful about which fields get copied in the first (preauth) step, the
second step, or both. GSS, for example, can be set up during transport
negotiation (first step) or authentication (second step), so we have
to duplicate the logic there. SSL is currently first-step-only, I
think -- but are we sure we want to hardcode the assumption that cert
auth can't change any of those parameters after the transport has been
established? (I've been brainstorming ways we might use TLS 1.3's
post-handshake CertificateRequest, for example.)

So before I commit to this path, I just want to double-check that all
of the above sounds good and non-controversial. :)

--

In the meantime, is anyone willing and able to commit 0001 and/or 0002?

Thanks!
--Jacob
1:  8f9315949e = 1:  64289b97e5 BackgroundPsql: handle empty query results
2:  7f404f5ee8 ! 2:  18a9531a25 Test::Cluster: let background_psql() work 
asynchronously
@@ src/test/perl/PostgreSQL/Test/Cluster.pm: connection.

  =cut
 @@ src/test/perl/PostgreSQL/Test/Cluster.pm: sub background_psql
-   '-');
+   '-XAtq', '-d', $psql_connstr, '-f', '-');

$params{on_error_stop} = 1 unless defined $params{on_error_stop};
 +  $params{wait} = 1 unless defined $params{wait};
3:  a7a7551d09 ! 3:  c8071f91d8 pgstat: report in earlier with 
STATE_AUTHENTICATING
@@ Metadata
 Author: Jacob Champion 

  ## Commit message ##
-pgstat: report in earlier with STATE_AUTHENTICATING
+pgstat: report in earlier with STATE_STARTING

-Add pgstat_bestart_pre_auth(), which reports an 'authenticating' state
-while waiting for backend initialization and client authentication to
+Add pgstat_bestart_pre_auth(), which reports a 'starting' state while
+waiting for backend initialization and client authentication to
 complete. Since we hold a transaction open for a good amount of that,
 and some authentication methods call out to external systems, having a
 pg_stat_activity entry helps DBAs debug when things go badly wrong.

+ ## doc/src/sgml/monitoring.sgml ##
+@@ doc/src/sgml/monitoring.sgml: postgres   27093  0.0  0.0  30096  2752 ? 
   Ss   11:34   0:00 postgres: ser
+Current overall state of this backend.
+Possible values are:
+
++
++ 
++  starting: The backend is in initial startup. 
Client
++  authentication is performed during this phase.
++ 
++
+ 
+ 
+   active: The backend is executing a query.
+
  ## src/backend/utils/activity/backend_status.c ##
 @@ src/backend/utils/activity/backend_status.c: static int 
localNumBackends = 0;
  static MemoryContext backendStatusSnapContext;
@@ src/backend/utils/activity/backend_status.c: pgstat_bestart(void)
  #endif

 -  lbeentry.st_state = STATE_UNDEFINED;
-+  lbeentry.st_state = pre_auth ? STATE_AUTHENTICATING : STATE_UNDEFINED;
++  lbeentry.st_state = pre_auth ? STATE_START

Re: Using Expanded Objects other than Arrays from plpgsql

2024-11-01 Thread Michel Pelletier

On Fri, Nov 1, 2024 at 3:20 PM Tom Lane  wrote:

> Michel Pelletier  writes:
>
> Well, you shouldn't be using PERFORM.  Not only does it not do the
> right thing, but it's not optimized for expanded objects at all:
> they'll get flattened both on the way into the statement and on
> the way out.  Try it with
>
>  graph := set_element(graph, 1, 1, 1);
>  RETURN nvals(graph);
>

Ah my bad, you mentioned that already and I missed it, here's the two
backtraces with the assignment:

https://gist.githubusercontent.com/michelp/20b917686149d482be2359569845f232/raw/ca8349ae4b0469674b4b2c77f34c5a02553583a9/gistfile1.txt


>
> regards, tom lane
>

Re: Using Expanded Objects other than Arrays from plpgsql

2024-11-01 Thread Michel Pelletier

On Fri, Nov 1, 2024 at 3:27 PM Tom Lane  wrote:

> Michel Pelletier  writes:
>
> Here is a v1 patch series that does the first part of what we've been
> talking about, which is to implement the new optimization rule for
> the case of a single RHS reference to the target variable.  I'm
> curious to know if it helps you at all by itself.  You'll definitely
> also need commit 534d0ea6c, so probably applying these on our git
> master branch would be the place to start.
>

I'll apply these tonight and get back to you asap.  There are many
functions in my API that take only one expanded RHS argument, so I'll look
for some cases where your changes reduce expansions when I run my tests.

Thank you!

-Michel

PG does not support one function of its extension pg_hint_plan

2024-11-01 Thread 李奇隆

HI, all hackers:

In the GitHub repository for PostgreSQL’s pg_hint_plan extension, there is an 
issue where the generated join order does not match the assigned join order. 
After reviewing the source code, I found that this inconsistency with input 
hints is due to PostgreSQL’s implementation and is not a bug in pg_hint_plan.

PostgreSQL with pg_hint_plan supports disabling certain operators (e.g., hash 
join, seq scan) by setting pg parameters like “set enable_hashjoin = false”. 
This setting causes PostgreSQL to add a high disable_cost (e.g., 1e10) to the 
estimated cost of the hash join operator, effectively preventing the planner 
from selecting hash joins due to the inflated cost. Additionally, pg_hint_plan 
supports enforcing specific join orders. To do this, pg_hint_plan disables all 
join algorithms when it encounters inconsistent join orders, by adding the 
disable_cost to each join operator. As a result, only the assigned join order 
will be selected. This is the mechanism behind pg_hint_plan.

Then, we take an example of the GitHub issue to demonstrate this problem:

Here is a query with pg_hint:

/*+
Leading((rt (it ((n (chn (mc (mi (t (ci an)) cn
HashJoin(ci an)
NestLoop(ci an t)
NestLoop(ci an t mi)
NestLoop(ci an t mi mc)
NestLoop(ci an t mi mc chn)
NestLoop(ci an t mi mc chn n)
NestLoop(ci an t mi mc chn n cn)
NestLoop(ci an t mi mc chn n cn it)
NestLoop(ci an t mi mc chn n cn it rt)
*/
EXPLAIN (FORMAT TEXT)
SELECT MIN(n.name) AS voicing_actress,
   MIN(t.title) AS voiced_movie
FROM aka_name AS an,
 char_name AS chn,
 cast_info AS ci,
 company_name AS cn,
 info_type AS it,
 movie_companies AS mc,
 movie_info AS mi,
 name AS n,
 role_type AS rt,
 title AS t
WHERE ci.note IN ('(voice)',
  '(voice: Japanese version)',
  '(voice) (uncredited)',
  '(voice: English version)')
  AND cn.country_code ='[us]'
  AND it.info = 'release dates'
  AND mc.note IS NOT NULL
  AND (mc.note LIKE '%(USA)%'
   OR mc.note LIKE '%(worldwide)%')
  AND mi.info IS NOT NULL
  AND (mi.info LIKE 'Japan:%200%'
   OR mi.info LIKE 'USA:%200%')
  AND n.gender ='f'
  AND n.name LIKE '%Ang%'
  AND rt.role ='actress'
  AND t.production_year BETWEEN 2005 AND 2009
  AND t.id = mi.movie_id
  AND t.id = mc.movie_id
  AND t.id = ci.movie_id
  AND mc.movie_id = ci.movie_id
  AND mc.movie_id = mi.movie_id
  AND mi.movie_id = ci.movie_id
  AND cn.id = mc.company_id
  AND it.id = mi.info_type_id
  AND n.id = ci.person_id
  AND rt.id = ci.role_id
  AND n.id = an.person_id
  AND ci.person_id = an.person_id
  AND chn.id = ci.person_role_id;

The hint specifies a join order (rt (it ((n (chn (mc (mi (t (ci an)) cn))), 
but the generated join order is (rt (it ((n ((mc (mi ((ci an) t))) chn)) cn))). 
Here, PostgreSQL generates sub-join order ((ci an) t) instead of the assigned 
sub-join order (t (ci an)), and ((mc (mi ((ci an) t))) chn) instead of (chn (mc 
(mi ((ci an) t. This discrepancy arises because PostgreSQL estimates 
operator costs in two phases. In the first phase, it filters out paths that are 
obviously suboptimal based on estimated costs; however, it does not factor in 
disable_cost for disabled operators in this phase, only doing so in the second 
phase. As a result, while  (t (ci an)) would use a regular nested loop join 
with a sequential scan on t,  ((ci an) t) uses an index-based nested loop join 
with an index scan on t, which is significantly faster. Consequently, (t (ci 
an)) is filtered out after the first phase of cost estimation. The same 
reasoning applies to (chn (mc (mi ((ci an) t.

In the following example, by forcing PostgreSQL to access relations t and chn 
with a sequential scan, PostgreSQL generates the assigned join order. This is 
because forcing a sequential scan for t and chn prevents PostgreSQL from 
considering index-based nested loop joins for them.

/*+
SeqScan(t) SeqScan(chn)
Leading((rt (it ((n (chn (mc (mi (t (ci an)) cn
HashJoin(ci an)
NestLoop(ci an t)
NestLoop(ci an t mi)
NestLoop(ci an t mi mc)
NestLoop(ci an t mi mc chn)
NestLoop(ci an t mi mc chn n)
NestLoop(ci an t mi mc chn n cn)
NestLoop(ci an t mi mc chn n cn it)
NestLoop(ci an t mi mc chn n cn it rt)
*/
EXPLAIN (FORMAT TEXT)
SELECT MIN(n.name) AS voicing_actress,
   MIN(t.title) AS voiced_movie
FROM aka_name AS an,
 char_name AS chn,
 cast_info AS ci,
 company_name AS cn,
 info_type AS it,
 movie_companies AS mc,
 movie_info AS mi,
 name AS n,
 role_type AS rt,
 title AS t
WHERE ci.note IN ('(voice)',
  '(voice: Japanese version)',
  '(voice) (uncredited)',
  '(voice: English version)')
  AND cn.country_code ='[us]'
  AND it.info = 'release dates'
  AND mc.note IS NOT NULL
  AND (mc.note LIKE '%(USA)%'
   OR mc.note LIKE '%(worldwide)%')
  AND mi.info IS NOT NULL
  AND (mi.info LIKE 'Japan:%200%'
   OR mi.info LIKE 'USA:%

Re: PG does not support one function of its extension pg_hint_plan

On Fri, Nov  1, 2024 at 11:13:09AM +0800, 李奇隆 wrote:
> HI, all hackers:
> 
> 
> In the GitHub repository for PostgreSQL’s pg_hint_plan extension, there is an
> issue where the generated join order does not match the assigned join order.
> After reviewing the source code, I found that this inconsistency with input
> hints is due to PostgreSQL’s implementation and is not a bug in pg_hint_plan.

Just to clarify, the bug is not in pg_hint_plan but in the fact that the
Postgres server ignores "disable_cost of disabled operators in the
initial phase of cost estimation," right?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Using read stream in autoprewarm

2024-11-01 Thread Andrey M. Borodin

> On 1 Nov 2024, at 12:51, Nazir Bilal Yavuz  wrote:
> 
>  am not
> sure whether 'BufferStrategyControl.lastFreeBuffer -
> BufferStrategyControl.firstFreeBuffer' is safe to use.

Ugh... it will work. But it seems to me too dirty hack. There's no scalable way 
to know size of a free list.
Let's just comment that we might read some more buffers if database does not 
fit into memory?
Alternatively we can count size of a free list on the start.

Best regards, Andrey Borodin.

Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-10-30 09:58:30 -0400, Andres Freund wrote:
> On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote:
> > Could we put the overhead on the FlushBuffer()
> > instead?
> >
> > How about something like this:
> >
> > To set hint bits:
> >
> > 1. Lock page in SHARED mode.
> > 2. Read BM_IO_IN_PROGRESS
> > 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
> > 4. Unlock page
> >
> > To flush a buffer:
> >
> > 1. Lock page in SHARED mode
> > 2. Set BM_IO_IN_PROGRESS
> > 3. Read the lock count on the buffer lock, to see if we're the only locker.
> > 4. If anyone else is holding the lock, upgrade it to exclusive mode, and
> > immediately downgrade back to share mode.
> > 5. calculate CRC, flush the buffer
> > 6. Clear BM_IO_IN_PROGRESS and unlock page.
> >
> > This goes back to the idea of adding LWLock support for this, but the amount
> > of changes could be pretty small. The missing operation we don't have today
> > is reading the share-lock count on the lock in step 3. That seems simple to
> > add.
>
> I've played around with a bunch of ideas like this. There are two main
> reasons I didn't like them that much in the end:
>
> 1) The worst case latency impacts seemed to make them not that
>interesting. A buffer that is heavily contended with share locks might not
>get down to zero share lockers for quite a while. That's not a problem for
>individual FlushBuffer() calls, but it could very well add up to a decent
>sized delay for something like a checkpoint that has to flush a lot of
>buffers.
>
>Waiting for all pre-existing share lockers is easier said than done. We
>don't record the acquisition order anywhere and a share-lock release won't
>wake anybody if the lockcount doesn't reach zero. Changing that wouldn't
>exactly be free and the cost would be born by all lwlock users.
>
> 2) They are based on holding an lwlock. But it's actually quite conceivable
>that we'd want to set something hint-bit-like without any lock, as we
>e.g. currently do for freespace/. That would work with something like the
>approach I chose here, but not if we rely on lwlocks.
>
>E.g. with a bit of work we could actually do sequential scans without
>blocking concurrent buffer modifications by carefully ordering when
>pd_lower is modified and making sure that tuple header fields are written
>in a sensible order.

I still don't like this idea a whole lot - but perhaps we could get reduce the
overhead of my proposal some, to get closer to yours. When setting hint bits
for many tuples on a page the overhead of my approach is neglegible, but when 
doing it
for individual tuples it's a bit less neglegible.

We can reduce the efficiency difference substantially by adding a bufmgr.c API
that set hints on a page. That function can set the hint bit while holding the
buffer header lock, and therefore doesn't need to set BM_SETTING_HINTS and
thus also doesn't need to do resowner.c accounting.

To see the worst case overhead, I
a) disabling the "batch" optimization
b) disabled checksums, as that otherwise would hide small efficiency
   differences
c) used an unlogged table

and measured the performance difference for a previously-unhinted sequential
scan of a narrow table that immediately discards all tuples due to OFFSET -
afaict the worst case for the proposed new behaviour.

Previously this was 30.8% slower than master. Now it's only 1.9% slower.

With the batch optimization enabled, the patchset is 7.5% faster.

I also looked at the performance impact on scans that cannot use the batched
approach. The worst case I could think of was a large ordered indexscan of a
previously unhinted table.

For an IOS, the performance difference is a slowdown of 0.65%.

But the difference being so small is partially just due to IOS being
considerably slower than a plain index scan when all tuples need to be fetched
from the table (we need to address that...). Forcing a non-IOS IOS scan using
enable_indexonlyscan, I get a slowdown of 5.0%.

Given that this is an intentionally pathological workload - there's no IO /
the workload is CPU bound, yet the data is only ever accessed once, with a
query that doesn't ever actually look at the data - I'm quite happy with that.

As soon as I make it a query that actually uses the data, the difference
vanishes in the noise.

Greetings,

Andres Freund

Re: Count and log pages set all-frozen by vacuum

2024-11-01 Thread Alastair Turner

On Thu, 31 Oct 2024 at 18:51, Melanie Plageman 
wrote:

>
> Would it also be useful to have the number set all-visible? It seems
> like if we added a new line prefixed with visibility map, it ought to
> include all-visible and all-frozen then.
> Something like this:
>  visibility map: %u pages set all-visible, %u pages set all-frozen.
>
> I find it more confusing to say "up to X may have been removed from
> the table." It's unclear what that means -- especially since we
> already have "pages removed" in another part of the log message.
>

Yeah, on looking at it again, that does seem to make things worse.

We actually could call visibilitymap_count() at the beginning of the
> vacuum and then log the difference between that and the results of
> calling it after finishing the vacuum. We currently call it after
> truncating the table anyway. That won't tell you how many pages were
> set all-frozen by this vacuum, as pages could have been unfrozen by
> DML that occurred after the page was vacuumed. It might be useful in
> addition to the line about the visibility map.
>
> This is somewhat in conflict with Robert and Peter's points about how
> autovacuum logging should be about what this vacuum did. But, we do
> have lines that talk about the before and after values:
>
> new relfrozenxid: 748, which is 3 XIDs ahead of previous value
>
> So, we could do something like:
> visibility map before: %u pages all-visible, %u pages all-frozen
> visibility map after: %u pages all-visible, %u pages all-frozen
> or
> visibility map after: %u pages all-visible (%u more than before), %u
> pages all-frozen (%u more than before)
>
> I still prefer adding how many pages were set all-frozen by this vacuum,
> though.
>
> I also like the idea of showing how many pages were set all-frozen by this
vacuum (which meets Robert's requirement for figuring out if a vacuum
operation did anything useful). The values for pages marked all visible and
all frozen can fluctuate for a number of reasons, even, as you point out,
from concurrent activity during the vacuum. This is different from
relfrozenxid which is a kind of high water mark. So I think the output
styles can reasonably be different.

visibility map: %u pages all-visible (%u marked by this operation), %u
pages all-frozen (%u marked by this operation)

seems to support everyone's requirements

Re: Eager aggregation, take 3

2024-11-01 Thread Robert Haas

On Fri, Nov 1, 2024 at 2:21 AM Richard Guo  wrote:
> ... an aggregated row from the partial
> aggregation matches the other side of the join if and only if each row
> in the partial group does, thereby ensuring that all rows in the same
> partial group have the same 'destiny'.

Ah, I really like this turn of phrase! I think it's clearer and
simpler than what I said. And I think it implies that we don't need to
explicitly deduce surrogate grouping keys. For example if we have  A
JOIN B JOIN C JOIN D JOIN E JOIN F, grouped by columns from A, we
don't need to work out surrogate grouping keys for B and then C and
then D and then E and then F. We can just look at F's join clauses and
that tells us how to group, independent of anything else.

But is there any hole in that approach? I think the question is
whether the current row could be used in some way that doesn't show up
in the join clauses. I can't think of any way for that to happen,
really. I believe that any outerjoin-delayed quals will show up as
join clauses, and stuff like ORDER BY and HAVING will happen after the
aggregation (at least logically) so it should be fine. Windowed
functions and ordered aggregates may be a blocker to the optimization,
though: if the window function needs access to the unaggregated rows,
or even just needs to know how many rows there are, then we'd better
not aggregate them before the window function runs; and if the
aggregate is ordered, we can only partially aggregate the data if it
is already ordered in a way that is compatible with the final, desired
ordering. Another case we might need to watch out for is RLS. RLS
wants to apply all the security quals before any non-leakproof
functions, and pushing down the aggregation might push an aggregate
function past security quals. Perhaps there are other cases to worry
about as well; this is all I can think of at the moment.

But regardless of those kinds of cases, the basic idea that we want
the partially aggregate rows to join if and only if the unaggregated
rows would have joined seems exactly correct to me, and that provides
theoretical justification for deriving the surrogate grouping key
directly from the join quals. Woot!

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Changing shared_buffers without restart

2024-11-01 Thread Dmitry Dolgov

> On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
>
> TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> changing shared memory mapping layout. Any feedback is appreciated.

It was pointed out to me, that earlier this year there was a useful
discussion about similar matters "PGC_SIGHUP shared_buffers?" [1]. From
what I see the patch series falls into the "re-map" category in that
thread.

[1]: 
https://www.postgresql.org/message-id/flat/CA%2BTgmoaGCFPhMjz7veJOeef30%3DKdpOxgywcLwNbr-Gny-mXwcg%40mail.gmail.com

Re: Add missing tab completion for ALTER TABLE ADD COLUMN IF NOT EXISTS

2024-11-01 Thread Kirill Reshke

On Fri, 1 Nov 2024 at 15:43, Kirill Reshke  wrote:
>
> Hi hackers!
>
> I have been working on cloudberry/greenplum tab completion features,
> and I spotted that the postgres version of psql does not tab-complete
> ALTER TABLE ADD COLUMN IF NOT EXISTS pattern.
> So, I propose to support this.
>
> --
> Best regards,
> Kirill Reshke

Hi!
I found another pattern which has no tab competition.

-- 
Best regards,
Kirill Reshke


v2-0002-Enhance-tab-completion-to-ALTER-TYPE-ADD-DROP-ATT.patch
Description: Binary data


v2-0001-Add-missing-psql-tab-completion-for-som-ALTER-TAB.patch
Description: Binary data

Re: Collation & ctype method table, and extension hooks

2024-11-01 Thread Jeff Davis

On Fri, 2024-11-01 at 14:08 +0100, Andreas Karlsson wrote:
> > Agreed -- a lot of work has gone into optimizing the regex code,
> > and we
> > don't want a perf regression there. But I'm also not sure exactly
> > which
> > kinds of tests I should be running for that.
> 
> I think we should at least try to find the worst case to see how big
> the 
> performance hit for that is. And then after that try to figure out a 
> more typical case benchmark.

What I had in mind was:

  * a large table with a single ~100KiB text field
  * a scan with a case insensitive regex that uses some character
classes

Does that sound like a worst case?

> The painful part was mostly just a reference to that without a
> catalog 
> table where new providers can be added we would need to add
> collations 
> for our new custom provider on some already existing provider and
> then 
> do for example some pattern matching on the name of the new
> collation. 
> Really ugly but works.

To add a catalog table for the locale providers, the main challenge is
around the database default collation and, relatedly, initdb. Do you
have some ideas around that?

Regards,
Jeff Davis

Re: Considering fractional paths in Append node

2024-11-01 Thread Nikita Malakhov

Hi!

I've checked set_subquery_pathlist(), and would rather say that it has
a different purpose and I'd better not compare these functions directly.

generate_orderedappend_paths() was introduced 6 years ago, and
commit that considers tuple_fraction authored by Tomas Vondra
was suggested by Tom Lane himself. Also, generate_orderedappend_paths()
is directly used in add_paths_to_append_rel(), so I don't see any
reasons not to use tuple_fraction for the same purpose as it is used
in generate_orderedappend_paths().

Maybe, let's wait what other reviewers would say?

Thank you for your opinion and interest!

I've corrected failing test and created a patch at Commitfest:
https://commitfest.postgresql.org/51/5361/

Please check.

On Tue, Oct 29, 2024 at 2:38 AM Andy Fan  wrote:

> Nikita Malakhov  writes:
>
> > Hi,
> >
> > Andy, thank you, I've checked this thread out along with run-time
> > partition pruning.
> I'm not sure the relationshipe between this topic and run-time
> partitioin pruning..
>
> > I've spend some time hovering on the tuple_fraction field usage and
> would disagree
> > with you on this topic - it is already used on the RelOptInfo level
> later on, in
> > generate_orderedappend_paths()
>
> Looks you are right that root->tuple_fraction has been used on RelOptInfo
> level in the generate_orderedappend_paths(). But we also tried to
> use not in the RelOptInfo level such as set_subquery_pathlist. See..
>
> """
> /*
>  * We can safely pass the outer tuple_fraction down to the subquery if the
>  * outer level has no joining, aggregation, or sorting to do. Otherwise
>  * we'd better tell the subquery to plan for full retrieval. (XXX This
>  * could probably be made more intelligent ...)
>  */
> """
>
> I'm not sure the "more intelligent" would be just use it directly.
>
> So I'm not saying we can't do this, just that the facts are:
> (a).  root->tuple_fraction are not exactly same as RelOptInfo's
> tuple_fraction.
> (b).  We have used root->tuple_fraction in RelOptInfo in some cases and
> also tried to not use it in some other case (and only use it under some
> situation similar like what I did before).
>
> Looks different committers have different opinion on this.
>
> --
> Best Regards
> Andy Fan
>
>
>
>

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


v2-0001-append-limit-v1.patch
Description: Binary data

Re: ActiveState Perl is not valid anymore to build PG17 on the Windows 10/11 platforms, So Documentation still suggesting it should be updated

On Mon, Oct 28, 2024 at 01:07:16PM +0100, Daniel Gustafsson wrote:
> > On 17 Oct 2024, at 00:35, Bruce Momjian  wrote:
> > On Tue, Jul 16, 2024 at 08:23:11AM -0400, Andrew Dunstan wrote:
> 
> >> See  https://postgr.es/m/4acddcd4-1c08-44e7-ba60-cab102259...@dunslane.net
> >> 
> >> I agree we should fix the docco.
> > 
> > I have written the attached patch to make these changes.
> 
> +1 for applying backpatched to at least 17 but possibly further down judging 
> by
> the linked threads.

I added a URL for Magicsplat Tcl and backpatched to 12.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Having problems generating a code coverage report

2024-11-01 Thread Aleksander Alekseev

Hi everyone,

> I upgraded to Meson 1.6.0 and Lcov 2.0-1. Unfortunately it doesn't work:
> [...]
> I didn't investigate further.

Lcov 2.1 and 2.2beta don't work either.

-- 
Best regards,
Aleksander Alekseev

Re: Collation & ctype method table, and extension hooks

2024-11-01 Thread Andreas Karlsson


On 10/26/24 12:42 AM, Jeff Davis wrote:

On Thu, 2024-10-24 at 10:05 +0200, Andreas Karlsson wrote:

Why is there no pg_locale_builtin.c?


Just that it would be a fairly small file, but I'm fine with doing
that.


I think adding such a small file would make life easier for people new 
to the collation part of the code base. It would be a nice symmetry 
between collation providers and where code for them can be found.



Why are casemap and ctype_methods not the same struct? They seem very
closely related.


The code impact was in fairly different places, so it seemed like a
nice way to break it out. I could combine them, but it would be a
fairly large patch.


For me combining them would make the intention of the code easier to 
understand since aren't the casemap functions just a set of "ctype_methods"?



This commit makes me tempted to handle the ctype_is_c logic for
character classes also in callbacks and remove the if in functions
like
pg_wc_ispunct(). But this si something that would need to be
benchmarked.


That's a good idea. The reason collate_is_c is important is because
there are quite a few caller-specific optimizations, but that doesn't
seem to be true of ctype_is_c.


Yeah, that was my though too but I have not confirmed it.


I wonder if the bitmask idea isn't terrible for the branch predictor
and
that me may want one function per character class, but this is yet
again
something we need to benchmark.


Agreed -- a lot of work has gone into optimizing the regex code, and we
don't want a perf regression there. But I'm also not sure exactly which
kinds of tests I should be running for that.


I think we should at least try to find the worst case to see how big the 
performance hit for that is. And then after that try to figure out a 
more typical case benchmark.



= v6-0011-Introduce-hooks-for-creating-custom-pg_locale_t.patch

Looks good but seems like a quite painful API to use.


How is it painful and can we make it better?


The painful part was mostly just a reference to that without a catalog 
table where new providers can be added we would need to add collations 
for our new custom provider on some already existing provider and then 
do for example some pattern matching on the name of the new collation. 
Really ugly but works.


I am thinking of implementing ICU4x as an external extension to try out 
the hook, but for the in-core contrib module we likely want to use 
something which does not require an external dependency. Or what do you 
think?


Andreas

Re: Count and log pages set all-frozen by vacuum

On Fri, Nov 1, 2024 at 5:12 AM Alastair Turner  wrote:
>
> On Thu, 31 Oct 2024 at 18:51, Melanie Plageman  
> wrote:
>>
>>
>> Would it also be useful to have the number set all-visible? It seems
>> like if we added a new line prefixed with visibility map, it ought to
>> include all-visible and all-frozen then.
>> Something like this:
>>  visibility map: %u pages set all-visible, %u pages set all-frozen.
>>
>> I find it more confusing to say "up to X may have been removed from
>> the table." It's unclear what that means -- especially since we
>> already have "pages removed" in another part of the log message.
>
>
> Yeah, on looking at it again, that does seem to make things worse.
>
>> We actually could call visibilitymap_count() at the beginning of the
>> vacuum and then log the difference between that and the results of
>> calling it after finishing the vacuum. We currently call it after
>> truncating the table anyway. That won't tell you how many pages were
>> set all-frozen by this vacuum, as pages could have been unfrozen by
>> DML that occurred after the page was vacuumed. It might be useful in
>> addition to the line about the visibility map.
>>
>> This is somewhat in conflict with Robert and Peter's points about how
>> autovacuum logging should be about what this vacuum did. But, we do
>> have lines that talk about the before and after values:
>>
>> new relfrozenxid: 748, which is 3 XIDs ahead of previous value
>>
>> So, we could do something like:
>> visibility map before: %u pages all-visible, %u pages all-frozen
>> visibility map after: %u pages all-visible, %u pages all-frozen
>> or
>> visibility map after: %u pages all-visible (%u more than before), %u
>> pages all-frozen (%u more than before)
>>
>> I still prefer adding how many pages were set all-frozen by this vacuum, 
>> though.
>>
> I also like the idea of showing how many pages were set all-frozen by this 
> vacuum (which meets Robert's requirement for figuring out if a vacuum 
> operation did anything useful).

+1

> visibility map: %u pages all-visible (%u marked by this operation), %u pages 
> all-frozen (%u marked by this operation)

Having "marked by this operation" twice seems to be redundant. How
about something like the output below?

visibility map: %u pages set all-visible (%u pages total), %u pages
set all-frozen (%u pages total)

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Count and log pages set all-frozen by vacuum

2024-11-01 Thread Robert Haas

On Fri, Nov 1, 2024 at 12:23 PM Masahiko Sawada  wrote:
> Having "marked by this operation" twice seems to be redundant. How
> about something like the output below?
>
> visibility map: %u pages set all-visible (%u pages total), %u pages
> set all-frozen (%u pages total)

For me, the meaning of that isn't clear.

I think this is the wrong direction, anyway. If someone says "hey, we
should add X to the output" and someone else says "we should add Y
instead," it doesn't follow that the right thing to do is to add both.
I happen to think that the right answer is X, both because X of my
understanding of the purpose of this log message, and also because X
is in the service of Melanie's larger goal and Y is not. But I also
feel like bikeshedding the patch that somebody should have written
instead of reviewing the one they actually wrote is to be avoided. Of
course, sometimes there's no getting around the fact that the person
chose to do something that didn't really make sense, and then it's
reasonable to suggest alternatives. But here, what was actually done
does make sense and is the first choice of some people. What is
proposed can be added now, provided that it actually gets some review,
and the other thing can be added later if someone wants to do the work
and if no problems are discovered, but it isn't Melanie's job to add
data that isn't needed for her project.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Inquiry on Future Plans for Enhancements to PostgreSQL MVCC Model and Vacuum Process

2024-11-01 Thread Heikki Linnakangas

On 01/11/2024 14:56, 송영욱 wrote:

Hi hackers,

I’m currently studying the internals of PostgreSQL, specifically the
vacuum process.

I've learned that PostgreSQL’s MVCC model has had issues in the past,
such as XID wraparound and table bloating problems. It seems that these
issues may have been mitigated by implementing features like autovacuum.

However, from my understanding, the current MVCC model seems to have
more negative effects than positive ones. Is there any plan to modify
the MVCC model in future versions?

There are no ongoing efforts for a big rewrite. If we were writing
PostgreSQL from scratch, we'd probably do many things differently, but
it's very difficult to change something so ingrained as MVCC. Even if
you come up with a scheme that works better for 95% of users, that last
5% of users will be unhappy, which is bad. That means either maintaining
two systems forever, or throwing some users under the bus.

There are always efforts further to mitigate the issues though. See this
thread
https://www.postgresql.org/message-id/caj7c6tnd0bcnwu1smxtsfewk4xjgbep343vf+t+gq-a5s5h...@mail.gmail.com
for an effort to switch to 64-bit XIDs, for example. That largely
eliminates XID wraparound.

There was a notable project called zheap a few years ago to replace the
heap implementation with a new table AM
(https://wiki.postgresql.org/wiki/Zheap). AFAIK no one is working on
that at the moment however. Alexander Korotkov's OrioleDb is also worth
mentioning. It's an even bigger rewrite with various pros and cons
however, which means it's unlikely to replace PostgreSQL's current MVCC.

My personal favorite in this area is TED
(https://www.postgresql.org/message-id/55511D1F.7050902%40iki.fi).
That's a pretty modest change to how MVCC currently works, so it won't
solve all the issues, but it's a small enough change that I think
there's a better chance of getting it done. However, it's just an idea;
there has been zero work on actually implementing it.

At the end of the day, this is an open source project, so the future
depends on what people decide to work on and submit patches for. The
above are just the notable efforts that I'm aware of.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Changing the default random_page_cost value

2024-11-01 Thread Greg Sabino Mullane

On Thu, Oct 31, 2024 at 1:43 PM Bruce Momjian  wrote:

> > I am not a fan of this patch.  I don't see why _removing_ the
> magnetic
> > part helps because you then have no logic for any 1.2 was chosen.
> >
> > Okay, but we have no documented logic on why 4.0 was chosen either. :)
>
> Uh, we do, and it is in the docs:
>
> Random access to mechanical disk storage is normally much more
> expensive
> than four times sequential access.  However, a lower default is
> used
> (4.0) because the majority of random accesses to disk, such as
> indexed
> reads, are assumed to be in cache.  The default value can be
> thought of
> as modeling random access as 40 times slower than sequential, while
> expecting 90% of random reads to be cached.
>
> You surely saw this when you created the patch and removed the text.
>

Yes, I did, but there is no documentation to support those numbers. Is it
really 40 times slower? Which hard disks, in what year? Where did the 90%
come from, and is that really an accurate average for production systems
with multiple caching layers? I know Tom ran actual tests many years ago,
but at the end of the day, all of these numbers will vary wildly depending
on a host of factors, so we have to make some educated guesses.

I guess my point was that my 1.2 (based on many years of tuning many client
systems) seems no less imprecise than 4.0 (based on actual tests many years
ago, with hardware that has changed a lot), for a default value that people
should tune for their specific system anyway.

Maybe these new tests people are asking for in this thread to determine a
better default rpc for SSDs can also help us determine a better rpc for HDs
as well.

Cheers,
Greg

Re: UUID v7

2024-11-01 Thread Andrey M. Borodin




> On 31 Oct 2024, at 23:04, Stepan Neretin  wrote:
> 
> 
> Firstly, I'd like to discuss the increased_clock_precision variable, which
> currently divides the timestamp into milliseconds and nanoseconds. However,
> this approach only approximates the extra bits for sub-millisecond
> precision, leading to imprecise timestamps in high-frequency UUID
> generation.
No, timestamp is taken in nanoseconds, we keep precision of 1/4096 of ms. If 
you observe precision loss anywhere let me know.

> 
> To address this issue, we could consider using a more accurate method for
> calculating the timestamp. For instance, we could utilize a higher
> resolution clock or implement a more precise algorithm to ensure accurate
> timestamps.

That's what we do.

> 
> Additionally, it would be beneficial to add validation checks for the
> interval argument. These checks could verify that the input interval is
> within reasonable bounds and that the calculated timestamp is accurate.
> Examples of checks could include verifying if the interval is too small,
> too large, or exceeds the maximum possible number of milliseconds and
> nanoseconds in a timestamp.

timestamptz_pl_interval() is already doing this.

> What do you think about these suggestions? Let me know your thoughts!

Thanks a lot for reviewing the patch!


> On 1 Nov 2024, at 10:33, Masahiko Sawada  wrote:
> 
> On Thu, Oct 31, 2024 at 9:53 PM Andrey M. Borodin  
> wrote:
>> 
>> 
>> 
>>> On 1 Nov 2024, at 03:00, Masahiko Sawada  wrote:
>>> 
>>> Therefore, if the
>>> system clock moves backward due to NTP, we cannot guarantee
>>> monotonicity and sortability. Is that right?
>> 
>> Not exactly. Monotonicity is ensured for a given backend. We make sure that 
>> timestamp is advanced at least for ~250ns forward on each UUID generation. 
>> 60 bits of time are unique and ascending for a given backend.
>> 
> 
> Thank you for your explanation. I now understand this code guarantees
> the monotonicity:
> 
> +/* minimum amount of ns that guarantees step of increased_clock_precision */
> +#define SUB_MILLISECOND_STEP (100/4096 + 1)
> +   ns = get_real_time_ns();
> +   if (previous_ns + SUB_MILLISECOND_STEP >= ns)
> +   ns = previous_ns + SUB_MILLISECOND_STEP;
> +   previous_ns = ns;
> 
> 
> I think that one of the most important parts in UUIDv7 implementation
> is which method (1, 2, or 3 described in RFC 9562) we use to guarantee
> the monotonicity. The current patch employs method 3 with the
> assumption that 12 bits of sub-millisecond information is available on
> most of the systems we support. However, as far as I tested, on MacOS,
> values returned by  clock_gettime(CLOCK_REALTIME) are only microsecond
> precision, meaning that we could waste some randomness. Has this point
> been considered?
> 

There was a thread "What is a typical precision of gettimeofday()?" [0]
There we found out that routines of instr_time.h are precise enough. On my 
machine (MacBook Air M3) I do not observe significant differences between 
CLOCK_MONOTONIC_RAW and CLOCK_REALTIME in pg_test_timing results.

CLOCK_MONOTONIC_RAW
x4mmm@x4mmm-osx bin % ./pg_test_timing 
Testing timing overhead for 3 seconds.
Per loop time including overhead: 15.30 ns
Histogram of timing durations:
  < us   % of total  count
 1 98.47856  193113929
 2  1.520392981452
 4  0.00025485
 8  0.00062   1211
16  0.00012237
32  0.4 79
64  0.2 30
   128  0.0  8
   256  0.0  5
   512  0.0  3
  1024  0.0  1
  2048  0.0  2

CLOCK_REALTIME
x4mmm@x4mmm-osx bin % ./pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 15.04 ns
Histogram of timing durations:
  < us   % of total  count
 1 98.49709  196477842
 2  1.502682997479
 4  0.7130
 8  0.00012238
16  0.5 91
32  0.0  4
64  0.0  1

Thanks!


[0] 
https://www.postgresql.org/message-id/flat/212C2E24-32CF-400E-982E-A446AB21E8CC%40yandex-team.ru#c89fa36829b2003147b6ce72170b5342

Re: define pg_structiszero(addr, s, r)

2024-11-01 Thread Bertrand Drouvot

Hi,

On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> On Fri, 1 Nov 2024 at 20:49, Michael Paquier  wrote:
> >
> > On Fri, Nov 01, 2024 at 07:44:22AM +, Bertrand Drouvot wrote:
> > > Worth to add a comment as to why pg_memory_is_all_zeros() should not
> > > be used here?
> >
> > I would not add one in bufpage.c, documenting that where
> > pg_memory_is_all_zeros() is defined may be more adapted.
> 
> The thought of having to write a comment to warn people not to use it
> for performance-critical things makes me think it might be better just
> to write a more optimal version of the function so we don't need to
> warn people.

Yeah, that's probably a good idea to write a more elaborate function.

> I looked around at the callers of the function I saw the
> following numbers of bytes being used for the length: 8192 (the one in
> question), 88, 32 and 112.
> 
> I don't know how performance-critical the final three of those are,
> but I imagine all apart from the 32-byte one might be better with a
> non-inlined and more optimised version of the function. The problem
> with inlining the optimised version is that it's more code to inline.

I agree that's more code to inline and contains multiple loops and branches.

For the last 3 callers, I think that non inline would still be "cheap" as 
compared
to what lead to those code paths (stats increments).

> I've attached what I thought a more optimal version might look like in
> case anyone thinks making it better is a good idea.
> 

Thanks for the proposal!

I like the idea, I think that's worth to add a few comments, something like:

1 ===

+  while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)

Add a comment like "Checks bytes, byte by byte, until the pointer is aligned"?

2 ===

+  for (; p < aligned_end; p += sizeof(size_t))

Add a comment like "Multiple bytes comparison(s) at once"?

3 ===

+  while (p < end)
+  {

Add a comment like "Compare remaining bytes, byte by byte"?

4 ===

Out of curiosity I did test your proposal and it performs well (see [0]) for
the PageIsVerifiedExtended() case.

[0]: https://godbolt.org/z/Mdaxz5W7c

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PG does not support one function of its extension pg_hint_plan

2024-11-01 Thread Tom Lane

Bruce Momjian  writes:
> On Fri, Nov  1, 2024 at 11:13:09AM +0800, 李奇隆 wrote:
>> In the GitHub repository for PostgreSQL’s pg_hint_plan extension, there is an
>> issue where the generated join order does not match the assigned join order.
>> After reviewing the source code, I found that this inconsistency with input
>> hints is due to PostgreSQL’s implementation and is not a bug in pg_hint_plan.

> Just to clarify, the bug is not in pg_hint_plan but in the fact that the
> Postgres server ignores "disable_cost of disabled operators in the
> initial phase of cost estimation," right?

We have never promised anything about supporting pg_hint_plan.

Having said that, this analysis is all obsolete in the wake of
commit e22253467.  Somebody (not me) would need to look into
whether a similar effect still exists with the new model for
disabling plan types.

Also, there's a highly relevant thread over at

https://www.postgresql.org/message-id/flat/CA%2BTgmoZY%2BbaV-T-5ifDn6P%3DL%3DaV-VkVBrPmi0TQkcEq-5Finww%40mail.gmail.com

It would probably be better to bring any conclusions to that
thread instead of starting a new one.

regards, tom lane

Re: Doc: typo in config.sgml

On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote:
> On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
> > On 15.10.24 23:51, Bruce Momjian wrote:
> > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
> > > > Bruce Momjian  writes:
> > > > > Well, we can only use Latin-1, so the idea is that we will be explicit
> > > > > about specifying Latin-1 only as HTML entities, rather than letting
> > > > > non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML 
> > > > > files
> > > > > if desired.
> > > > 
> > > > That policy would cause substantial problems with contributor names
> > > > in the release notes.  I agree with Peter that we don't need this.
> > > > Catching otherwise-invisible characters seems sufficient.
> > > 
> > > Uh, why can't we use HTML entities going forward?  Is that harder?
> > 
> > I think the question should be the other way around.  The entities are a
> > historical workaround for when encoding support and rendering support was
> > poor.  Now you can just type in the characters you want as is, which seems
> > nicer.
> 
> Yes, that does make sense, and if we fully supported Unicode, we could
> ignore all of this.

Patch applied to master --- no new UTF8 restrictions.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Making error message more user-friendly with spaces in a URI

2024-11-01 Thread Greg Sabino Mullane

>
> $ psql "postgres://localhost:5432/postgres?application_name=a b"
> psql: error: trailing data found: "a b"
>

This works fine for me, and sets a space in the application_name string as
expected. Do you have a different example?

Cheers,
Greg

Re: Having problems generating a code coverage report

2024-11-01 Thread Aleksander Alekseev

Hi,

> Thanks for the hint. I'm using Meson 1.3.2. Although it's not ancient
> (Feb 2024) there is 1.6.0 available. I'll try upgrading and let you
> know the results.

I upgraded to Meson 1.6.0 and Lcov 2.0-1. Unfortunately it doesn't work:

```
genhtml: ERROR: no data for line:864, TLA:UNC,
file:/home/eax/projects/c/postgresql/src/bin/psql/psqlscanslash.l
(use "genhtml --ignore-errors unmapped ..." to bypass this error)
Traceback (most recent call last):
  File "/home/eax/.venv/bin/meson", line 8, in 
sys.exit(main())
 ^^
  File "/home/eax/.venv/lib/python3.12/site-packages/mesonbuild/mesonmain.py",
line 293, in main
return run(sys.argv[1:], launcher)
   ^^^
  File "/home/eax/.venv/lib/python3.12/site-packages/mesonbuild/mesonmain.py",
line 281, in run
return run_script_command(args[1], args[2:])
   ^
  File "/home/eax/.venv/lib/python3.12/site-packages/mesonbuild/mesonmain.py",
line 222, in run_script_command
return module.run(script_args)
   ^^^
  File 
"/home/eax/.venv/lib/python3.12/site-packages/mesonbuild/scripts/coverage.py",
line 208, in run
return coverage(options.outputs, options.source_root,
   ^^
  File 
"/home/eax/.venv/lib/python3.12/site-packages/mesonbuild/scripts/coverage.py",
line 148, in coverage
subprocess.check_call([genhtml_exe,
  File "/usr/lib/python3.12/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['genhtml', '--prefix',
'/home/eax/projects/c/postgresql/build', '--prefix',
'/home/eax/projects/c/postgresql', '--output-directory',
'/home/eax/projects/c/postgresql/build/meson-logs/coveragereport',
'--title', 'Code coverage', '--legend', '--show-details',
'--branch-coverage',
'/home/eax/projects/c/postgresql/build/meson-logs/coverage.info']'
returned non-zero exit status 1.
ninja: build stopped: subcommand failed.
```

If I add `--ignore-errors unmapped` as suggested and execute:

```
genhtml --ignore-errors unmapped --prefix
/home/eax/projects/c/postgresql/build --prefix
/home/eax/projects/c/postgresql --output-directory
/home/eax/projects/c/postgresql/build/meson-logs/coveragereport
--title 'Code coverage' --legend --show-details --branch-coverage
/home/eax/projects/c/postgresql/build/meson-logs/coverage.info
```

... I get:

```
...
Processing file src/pl/plpython/plpy_util.c
  lines=33 hit=22 functions=4 hit=4 branches=14 hit=5
Processing file src/include/catalog/index.h
  lines=10 hit=10 functions=2 hit=2
genhtml: ERROR: duplicate merge record src/include/catalog
```

I didn't investigate further.

-- 
Best regards,
Aleksander Alekseev

Inquiry on Future Plans for Enhancements to PostgreSQL MVCC Model and Vacuum Process

2024-11-01 Thread 송영욱

Hi hackers,

I’m currently studying the internals of PostgreSQL, specifically the vacuum
process.

I've learned that PostgreSQL’s MVCC model has had issues in the past, such
as XID wraparound and table bloating problems. It seems that these issues
may have been mitigated by implementing features like autovacuum.

However, from my understanding, the current MVCC model seems to have more
negative effects than positive ones. Is there any plan to modify the MVCC
model in future versions?

I’m still learning about PostgreSQL, so there may be some inaccuracies in
my understanding. If there are, please let me know.

Thank you

Re: Allowing pg_recvlogical to create temporary replication slots

2024-11-01 Thread Torsten Förtsch

On Wed, Oct 30, 2024 at 9:01 AM Peter Eisentraut 
wrote:

> On 27.10.24 13:37, Torsten Förtsch wrote:
> > The attached patch enables pg_recvlogical to create a temporary slot.
>
> I think you should explain a bit why you want to do that, what use case
> or scenario this is meant to address.
>

In my particular case I want to filter for messages sent by
pg_logical_emit_message(). I don't care much about getting ALL the changes.
If the replication slot disappears and a few changes are lost it does not
matter. So, a temporary rep slot makes more sense than creating one and
then having to make sure it is not retaining wal forever later.

I can imagine this also as a tool to monitor changes for a while and then
simply disconnect without the need to remove the slot.

Why am I interested in pg_logical_emit_message()? We have an application
with relatively complicated transactions involving multiple tables. Some of
them use pg_notify(). We also have synchronous replication. Sometimes I see
lock avalanches that can be traced to the "AccessExclusiveLock on object 0
of class 1262 of database 0". This lock is taken during commit when the
notification is inserted in the queue. After that the backend waits for the
confirmation by the sync replica. So, this lock presses all the
transactions sending notifications into a sequence.

Now, if the application uses pg_logical_emit_message() instead, then I
think there is no equivalent lock. I understand the semantics are a bit
different (timing) but close enough for my use case. Another advantage of
pg_logical_emit_message() is the ability to send them even if the
transaction is aborted.

I was in the process of experimenting with this idea and found that
pg_recvlogical can:
- only create the slot or
- create the slot and immediately use it
- try to create the slot and if the slot is already there use it

So, why not also allow it to create a temporary slot?

Re: Time to add a Git .mailmap?

2024-11-01 Thread Peter Eisentraut


On 01.11.24 12:53, Alvaro Herrera wrote:

On 2024-Oct-31, Daniel Gustafsson wrote:


When looking at our Git tree for a recent conference presentation I happened to
notice that we have recently gained duplicate names in the shortlog.  Not sure
if we care enough to fix that with a .mailmap, but if we do the attached diff
makes sure that all commits are accounted for a single committer entry.


LGTM.  I'd also add this line while at it:

Peter Eisentraut  

This takes care of all the duplicate "identities" in the history AFAICT.


I'm not sure if this is a good use of the mailmap feature.  If someone 
commits under  for a while and then later as 
, and the mailmap maps everything to the most 
recent one, that seems kind of misleading or unfair?  The examples on 
the gitmailmap man page all indicate that this feature is to correct 
accidental variations or obvious mistakes, but not to unify everything 
to the extent that it alters the historical record.

Re: proposal: schema variables

2024-11-01 Thread Laurenz Albe

On Tue, 2024-10-29 at 08:16 +0100, Pavel Stehule wrote:
> again, necessary rebase

I have started looking at patch 5, and I have some questions and comments.

- The commit message is headed "memory cleaning after DROP VARIABLE", but
  the rest of the commit message speaks of sinval messages.  These two
  things are independent, aren't they?  And both lead to the need to validate
  the variables, right?

  Then this code comment would for example be wrong:

 /* true after accepted sinval message */
 static bool needs_validation = false;

  It also becomes "true" after DROP VARIABLE, right?
  I am happy to fix the comment, but I want to understand the patch first.

- I see that the patch adds cleanup of invalid session variable to each
  COMMIT.  Is that a good idea?  I'd expect that it is good enough to clean
  up whenever session variables are accessed.
  Calling remove_invalid_session_variables() during each COMMIT will affect
  all transactions, and I don't see the benefit.

  Also, do we need to call it during pg_session_variables()?

Yours,
Laurenz Albe

Announcing Release 18 of the PostgreSQL Buildfarm client

2024-11-01 Thread Andrew Dunstan

I have pushed  Release 18 of the PostgreSQL Buildfarm client

In addition to numerous minor tweaks and bug fixes, the following changes
are made:

   - many improvements in the collection of log files
   - accommodate change in data checksum default in cross version upgrade
   testing
   - increase default PGCTLTIMEOUT to 180s
   - improve "safe" environment set, including all in the build_env
   configuration settings
   - on script timeout, don't just fail but send the result to the server
   (Note: the timeout mechanism doesn't work on Windows, and has been disabled
   there)
   - set the default timeout to 4 hours.
   - remove redundant TestDecoding module
   - add a module for running "make dist"
   - improve detection of failures when running with meson
   - add README files to the set ignored by the sample trigger_exclude
   setting.


The release can be downloaded from

https://github.com/PGBuildFarm/client-code/releases/tag/REL_18
 or

https://buildfarm.postgresql.org/downloads 

Upgrading is highly recommended.


cheers


andrew

Re: Adding NetBSD and OpenBSD to Postgres CI

Hi,

Thanks for the patch!


On 2024-11-01 12:17:00 +0300, Nazir Bilal Yavuz wrote:
> I made these tasks triggered manually like MinGW task to save CI credits
> but a related line is commented out for now to trigger CFBot.

Oh, I need to pick my patch which allows repo-level config of which tasks run
back up. Then we can enable these by default for cfbot, but not for individual
repos...


> +task:
> +  depends_on: SanityCheck
> +  # trigger_type: manual
> +
> +  env:
> +# Below are experimentally derived to be a decent choice.
> +CPUS: 2
> +BUILD_JOBS: 8
> +TEST_JOBS: 8
> +
> +CIRRUS_WORKING_DIR: /home/postgres/postgres

Why do you need to set that?


> +CCACHE_DIR: /tmp/ccache_dir
> +
> +PATH: /usr/sbin:$PATH
> +
> +# Postgres interprets LANG as a 'en_US.UTF-8' but it is 'C', then

What does "Postgres interprets LANG as a 'en_US.UTF-8'" mean?


> +# Postgres tries to set 'LC_COLLATE' to 'en_US.UTF-8' but it is not
> +# changeable. Initdb fails because of that. So, LANG is forced to be 'C'.
> +LANG: "C"
> +LC_ALL: "C"

> +  matrix:
> +- name: NetBSD - 10 - Meson
> +  only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || 
> $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*netbsd.*'
> +  env:
> +IMAGE_FAMILY: pg-ci-netbsd-postgres
> +INCLUDE_DIRS: -Dextra_lib_dirs=/usr/pkg/lib 
> -Dextra_include_dirs=/usr/pkg/include
> +  <<: *netbsd_task_template
> +
> +- name: OpenBSD - 7 - Meson
> +  only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || 
> $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*openbsd.*'
> +  env:
> +IMAGE_FAMILY: pg-ci-openbsd-postgres
> +INCLUDE_DIRS: -Dextra_include_dirs=/usr/local/include 
> -Dextra_lib_dirs=/usr/local/lib
> +UUID: -Duuid=e2fs

Shouldn't something be added to PKG_CONFIG_PATH / --pkg-config-path?


For other OSs we have stanzas like

  setup_additional_packages_script: |
#apt-get update
#DEBIAN_FRONTEND=noninteractive apt-get -y install ...

it'd be good to have something similar for openbsd/netbsd, given that most
won't be as familiar with that.



> +  <<: *openbsd_task_template
> +  sysinfo_script: |
> +locale
> +id
> +uname -a
> +ulimit -a -H && ulimit -a -S
> +env
> +
> +  ccache_cache:
> +folder: $CCACHE_DIR
> +
> +  create_user_script: |
> +useradd postgres
> +chown -R postgres:users /home/postgres
> +mkdir -p ${CCACHE_DIR}
> +chown -R postgres:users ${CCACHE_DIR}
> +
> +  # -Duuid=bsd is not set since 'bsd' uuid option
> +  # is not working on NetBSD & OpenBSD. See
> +  # 
> https://www.postgresql.org/message-id/17358-89806e7420797...@postgresql.org
> +  # And other uuid options are not available on NetBSD.
> +  configure_script: |
> +su postgres <<-EOF
> +  meson setup \
> +--buildtype debug \

I suspect it'd be good to either add -Og to cflags (as done in a bunch of
other tasks) or to use debugoptimized, given that the tests on these machines
are fairly slow.


> +-Dcassert=true -Dssl=openssl ${UUID} \
> +-DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
> +${INCLUDE_DIRS} \
> +build
> +EOF

Should probably enable injection points.


> +  build_script: su postgres -c 'ninja -C build -j${BUILD_JOBS}'
> +  upload_caches: ccache
> +
> +  test_world_script: |
> +su postgres <<-EOF
> +  ulimit -c unlimited
> +  # Otherwise tests will fail on OpenBSD, due to the lack of enough 
> processes.
> +  ulimit -p 256
> +  meson test $MTEST_ARGS --num-processes ${TEST_JOBS}
> +EOF
> +
> +  on_failure:
> +<<: *on_failure_meson
> +
> +

Right now you don't seem to be collecting core files - but you're still
enabling them via ulimit -c unlimited.  At least we shouldn't use ulimit -c
unlimited without collecting core files, but it'd probably be better to add
support for collecting core files. Shouldn't be too hard.


Greetings,

Andres Freund

Re: Inval reliability, especially for inplace updates

2024-11-01 Thread Noah Misch

On Thu, Oct 31, 2024 at 09:20:52PM -0700, Noah Misch wrote:
> Here, one of the autovacuum workers had the guilty stack trace, appearing at
> the end of this message.  heap_inplace_update_and_unlock() calls
> CacheInvalidateHeapTupleInplace() while holding BUFFER_LOCK_EXCLUSIVE on a
> buffer of pg_class.  CacheInvalidateHeapTupleInplace() may call
> CatalogCacheInitializeCache(), which opens the cache's rel.  If there's not a
> valid relcache entry for the catcache's rel, we scan pg_class to make a valid
> relcache entry.  The ensuing hang makes sense.
> 
> Tomorrow, I'll think more about fixes.  Two that might work:
> 
> 1. Call CacheInvalidateHeapTupleInplace() before locking the buffer.  Each
>time we need to re-find the tuple, discard the previous try's inplace
>invals and redo CacheInvalidateHeapTupleInplace().  That's because
>concurrent activity may have changed cache key fields like relname.

Attached.  With this, I got no hangs in 1.9h of your test procedure.  Without
the patch, I got fourteen hangs in the same amount of time.

> 2. Add some function that we call before locking the buffer.  Its charter is
>to ensure PrepareToInvalidateCacheTuple() won't have to call
>CatalogCacheInitializeCache().

The existing InitCatalogCachePhase2() could satisfy that charter.  I liked
this a little less than (1), for two reasons.  First, confirming that it
avoids deadlocks requires thinking through more of the catcache and relcache
procedures.  Second, InitCatalogCachePhase2() would init unnecessary caches.

>I think nothing resets catcache to the
>extent that CatalogCacheInitializeCache() must happen again, so this should
>suffice regardless of concurrent sinval traffic, debug_discard_caches, etc.

The comment at CatalogCacheFlushCatalog() confirms that.

> What else is worth considering?  Any preferences among those?

3. Let a process take LW_SHARED if it already holds LW_EXCLUSIVE.

That would change outcomes well beyond inplace update, and it feels like a
bandage for doing things in a suboptimal order.  (1) or (2) would be an
improvement even if we did (3), since they would reduce I/O done while holding
BUFFER_LOCK_EXCLUSIVE.

This was a near miss to having a worst-in-years regression in a minor release,
so I'm proposing this sequence:

- Revert from non-master branches commits 8e7e672 (inplace180, "WAL-log
  inplace update before revealing it to other sessions.") and 243e9b4
  (inplace160, "For inplace update, send nontransactional invalidations.").

- Back-patch inplace230-index_update_stats-io-before-buflock to harden commit
  a07e03f (inplace110, "Fix data loss at inplace update after heap_update()").

- Push attached inplace240 to master.

- Make the commitfest entry a request for review of v17 inplace160+inplace240.
  After some amount of additional review and master bake time, the reverted
  patches would return to non-master branches.

If someone agrees or if nobody objects by 2024-11-02T15:00+, I'll make it
so.  That's not much time, but I want to minimize buildfarm members hanging
and maximize inplace230 bake time before the release wrap.

Less urgently, we likely should add defense in depth against adding
rarely-reached LWLock deadlocks.  Perhaps one or both of:

- Assert-fail if entering a conditional CatalogCacheInitializeCache() caller
  (e.g. SearchCatCacheInternal()) while holding a catalog buffer lock.

- elog(ERROR) instead of sleeping in LWLockAcquire() if the current process is
  the current lock holder.

Thanks,
nm
Author: Noah Misch 
Commit: Noah Misch 

Fix inplace update buffer self-deadlock.

A CacheInvalidateHeapTuple* callee might call
CatalogCacheInitializeCache(), which needs a relcache entry.  Acquiring
a valid relcache entry might scan pg_class.  Hence, to prevent
undetected LWLock self-deadlock, CacheInvalidateHeapTuple* callers must
not hold BUFFER_LOCK_EXCLUSIVE on buffers of pg_class.  Move the
CacheInvalidateHeapTupleInplace() before the BUFFER_LOCK_EXCLUSIVE.
Commit 243e9b40f1b2dd09d6e5bf91ebf6e822a2cd3704 introduced this
regression.  No back-patch, since I've reverted the culprit from
non-master branches.

Reported by Alexander Lakhin.  Reviewed by FIXME.

Discussion: 
https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec41145...@gmail.com

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1748eaf..9a31cdc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6214,6 +6214,17 @@ heap_inplace_lock(Relation relation,

Assert(BufferIsValid(buffer));

+   /*
+* Construct shared cache inval if necessary.  Because we pass a tuple
+* version without our own inplace changes or inplace changes other
+* sessions complete while we wait for locks, inplace update mustn't
+* change catcache lookup keys.  But we aren't bothering with index
+* updates either, so t

Re: Doc: typo in config.sgml

2024-11-01 Thread Tatsuo Ishii

> Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
> LATIN1 characters we had with HTML entities, so there are none
> currently.
> 
> I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
> so I added a cron job on my server to alert me when non-ASCII characters
> appear.

So you convert LATIN1 characters to HTML entities so that it's easier
to detect non-LATIN1 characters is in the SGML docs? If my
understanding is correct, it can be also achieved by using some tools
like:

iconv -t ISO-8859-1 -f UTF-8 release-17.sgml 

If there are some non-LATIN1 characters in release-17.sgml,
it will complain like:

iconv: illegal input sequence at position 175

An advantage of this is, we don't need to covert each LATIN1
characters to HTML entities and make the sgml file authors life a
little bit easier.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Consider pipeline implicit transaction as a transaction block

On Thu, Oct 31, 2024 at 03:32:39PM +0900, Michael Paquier wrote:
> @Tom added in CC: Is there a specific reason why CheckTransactionBlock()
> did not include a check based on XACT_FLAGS_PIPELINING when it got
> introduced in 20432f873140, while IsInTransactionBlock() considers it?

This also makes LOCK able to work flawlessly within pipelines as far
as I can see, as an error would be reported on HEAD.  That's nice. 

The behavior of the first command is interesting, still this saves
from the checks after any follow-up commands, and this is intended
from what I can see in postgres.c, because the xact flag is only set
once the first command completes.

Now, here is a fancy case: SAVEPOINT and its two brothers.  An error
would be reported on HEAD if attempting a SAVEPOINT, complaining that
we are not in a transaction block.  The patch causes a different, more
confusing, failure:
FATAL:  DefineSavepoint: unexpected state STARTED

This is a bit user-unfriendly.  I am not sure to see the point of
supporting savepoints in this context, so perhaps we should just issue
a cleaner error when we are under a XACT_FLAGS_PIPELINING?  Reporting
that we are not in a transaction block, while, well, we are in an
implicit transaction block because of the use of pipelines is
confusing.  The new error is actually worse.
--
Michael

signature.asc
Description: PGP signature

Re: Using Expanded Objects other than Arrays from plpgsql

2024-11-01 Thread Tom Lane

Michel Pelletier  writes:
> Here's two backtraces from gdb from this function:
> CREATE OR REPLACE FUNCTION test2(graph matrix)
> RETURNS bigint LANGUAGE plpgsql AS
> $$
> BEGIN
> perform set_element(graph, 1, 1, 1);
> RETURN nvals(graph);
> end;
> $$;

Well, you shouldn't be using PERFORM.  Not only does it not do the
right thing, but it's not optimized for expanded objects at all:
they'll get flattened both on the way into the statement and on
the way out.  Try it with

 graph := set_element(graph, 1, 1, 1);
 RETURN nvals(graph);

regards, tom lane

Re: Doc: typo in config.sgml

2024-11-01 Thread Tatsuo Ishii

Hi Bruce,

> On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote:
>> On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
>> > On 15.10.24 23:51, Bruce Momjian wrote:
>> > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
>> > > > Bruce Momjian  writes:
>> > > > > Well, we can only use Latin-1, so the idea is that we will be 
>> > > > > explicit
>> > > > > about specifying Latin-1 only as HTML entities, rather than letting
>> > > > > non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML 
>> > > > > files
>> > > > > if desired.
>> > > > 
>> > > > That policy would cause substantial problems with contributor names
>> > > > in the release notes.  I agree with Peter that we don't need this.
>> > > > Catching otherwise-invisible characters seems sufficient.
>> > > 
>> > > Uh, why can't we use HTML entities going forward?  Is that harder?
>> > 
>> > I think the question should be the other way around.  The entities are a
>> > historical workaround for when encoding support and rendering support was
>> > poor.  Now you can just type in the characters you want as is, which seems
>> > nicer.
>> 
>> Yes, that does make sense, and if we fully supported Unicode, we could
>> ignore all of this.
> 
> Patch applied to master --- no new UTF8 restrictions.

I thought the conclusion of the discussion was allowing to use LATIN1
(or UTF-8 encoded LATIN1) characters in SGML files without converting
them to HTML entities. Your patch seems to do opposite.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=641a5b7a1447954076728f259342c2f9201bb0b5

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Separate memory contexts for relcache and catcache

Hi,

On 2024-11-01 14:47:37 -0700, Jeff Davis wrote:
> On Fri, 2024-11-01 at 15:19 -0400, Andres Freund wrote:
> > I'm a bit worried about the increase in "wasted" memory we might end
> > up when
> > creating one aset for *everything*. Just splitting out Relcache and
> > CatCache
> > isn't a big deal from that angle, they're always used reasonably
> > much. But
> > creating a bunch of barely used contexts does have the potential for
> > lots of
> > memory being wasted at the end of a page and on freelists.  It might
> > be ok as
> > far was what you proposed in the above email, I haven't analyzed that
> > in depth
> > yet.
>
> Melih raised similar concerns. The new contexts that my patch created
> were CatCacheContext, RelCacheContext, SPICacheContext,
> PgOutputContext, PlanCacheContext, TextSearchCacheContext, and
> TypCacheContext.
>
> Those are all created lazily, so you need to at least be using the
> relevant feature before it has any cost (with the exception of the
> first two).

Well, you can't get very far without using at least CatCacheContext,
RelCacheContext, PlanCacheContext, TypCacheContext. The others are indeed much
more specific and not really worth worrying about.


> > > I agree with others that we should look at changing the initial
> > > size or
> > > type of the contexts, but that should be a separate commit.
> >
> > It needs to be done close together though, otherwise we'll increase
> > the
> > new-connection-memory-usage of postgres measurably.
>
> I don't have a strong opinion here; that was a passing comment. But I'm
> curious: why it would increase the per-connection memory usage much to
> just have a couple new memory contexts?

"much" is maybe too strong. But the memory usage in a new connection is fairly
low, it doesn't take a large increase to be noticeable percentage-wise. And
given how much people love having poolers full of idle connections, it shows
up in aggregate.


> > I've previously proposed creating a type of memory context that's
> > intended for
> > places where we never expect to allocate much which allocates from
> > either a
> > superior memory context or just from the system allocator and tracks
> > memory
> > via linked lists.
>
> Why not just use ALLOCSET_SMALL_SIZES?

That helps some, but not *that* much. You still end up with a bunch of partially
filled blocks. Here's e.g. an excerpt with your patch applied:

│ name │ ident  
│   type   │ level │ path  │ total_bytes │ total_nblocks │ free_bytes │ 
free_chunks │ used_bytes │
├──┼┼──┼───┼───┼─┼───┼┼─┼┤
│ CacheMemoryContext   │ (null) 
│ AllocSet │ 2 │ {1,19}│8192 │ 1 │   7952 │ 
  0 │240 │
│ TypCacheContext  │ (null) 
│ AllocSet │ 3 │ {1,19,28} │8192 │ 1 │   4816 │ 
  0 │   3376 │
│ search_path processing cache │ (null) 
│ AllocSet │ 3 │ {1,19,29} │8192 │ 1 │   5280 │ 
  7 │   2912 │
│ CatCacheContext  │ (null) 
│ AllocSet │ 3 │ {1,19,30} │  262144 │ 6 │  14808 │ 
  0 │ 247336 │
│ RelCacheContext  │ (null) 
│ AllocSet │ 3 │ {1,19,31} │  262144 │ 6 │   8392 │ 
  2 │ 253752 │
│ relation rules   │ pg_backend_memory_contexts 
│ AllocSet │ 4 │ {1,19,31,34}  │8192 │ 4 │   3280 │ 
  1 │   4912 │
│ index info   │ manyrows_pkey  
│ AllocSet │ 4 │ {1,19,31,35}  │2048 │ 2 │864 │ 
  1 │   1184 │
│ index info   │ pg_statistic_ext_relid_index   
│ AllocSet │ 4 │ {1,19,31,36}  │2048 │ 2 │928 │ 
  1 │   1120 │
│ index info   │ pg_class_tblspc_relfilenode_index  
│ AllocSet │ 4 │ {1,19,31,37}  │2048 │ 2 │440 │ 
  1 │   1608 │

(this is a tiny bit misleading as "search_path processing cache" was just 
moved")

You can quickly see that the various contexts have a decent amount of free
space, some of their space.

We've already been more aggressive about using separate contets for indexes -
and in aggregate that memory usage shows up:

postgres[1088243][1]=# SELECT count(*), sum(total_bytes) as total_bytes, 
sum(total_nblocks) as total_nblocks, sum(free_bytes) free_bytes, 
sum(free_chunks) as free_chunks, sum(used_bytes) used_bytes FROM 
p

Re: Using Expanded Objects other than Arrays from plpgsql

2024-11-01 Thread Tom Lane

Michel Pelletier  writes:
> That all sounds great, and it sounds like my prosupport function just needs
> to return true, or set some kind of flag saying aliasing is ok.  I'd like
> to help as much as possible, but some of that reparenting stuff was pretty
> deep for me, other than being a quick sanity check case, is there anything
> I can do to help?

I wasn't expecting you to write the code, but if you can test it and
see how much it helps your use-case, that would be great.

Here is a v1 patch series that does the first part of what we've been
talking about, which is to implement the new optimization rule for
the case of a single RHS reference to the target variable.  I'm
curious to know if it helps you at all by itself.  You'll definitely
also need commit 534d0ea6c, so probably applying these on our git
master branch would be the place to start.

regards, tom lane

From e207d9d377a75181f5bf655ffb81ef53328d3ea3 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Wed, 30 Oct 2024 15:57:35 -0400
Subject: [PATCH v1 1/3] Preliminary refactoring.

This short and boring patch simply moves the responsibility for
initializing PLpgSQL_expr.target_param into plpgsql parsing,
rather than doing it at first execution of the expr as before.
This doesn't save anything in terms of runtime, since the work was
trivial and done only once per expr anyway.  But it makes the info
available during parsing, which will be useful for the next step.

Likewise set PLpgSQL_expr.func during parsing.  According to the
comments, this was once impossible; but it's certainly possible
since we invented the plpgsql_curr_compile variable.  Again, this
saves little runtime, but it seems far cleaner conceptually.

While at it, I reordered stuff in struct PLpgSQL_expr to make it
clearer which fields are filled when, and merged some duplicative
code in pl_gram.y.

Discussion: https://postgr.es/m/CACxu=vjakfnsyxoosnw1wegsao5u_v1xybacfvj14wgjv_p...@mail.gmail.com
---
 src/pl/plpgsql/src/pl_exec.c | 27 ---
 src/pl/plpgsql/src/pl_gram.y | 65 
 src/pl/plpgsql/src/plpgsql.h | 31 +
 3 files changed, 62 insertions(+), 61 deletions(-)

diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 86c5bd324a..ce7cdb6458 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -4174,12 +4174,6 @@ exec_prepare_plan(PLpgSQL_execstate *estate,
 	SPIPlanPtr	plan;
 	SPIPrepareOptions options;
 
-	/*
-	 * The grammar can't conveniently set expr->func while building the parse
-	 * tree, so make sure it's set before parser hooks need it.
-	 */
-	expr->func = estate->func;
-
 	/*
 	 * Generate and save the plan
 	 */
@@ -5016,21 +5010,7 @@ exec_assign_expr(PLpgSQL_execstate *estate, PLpgSQL_datum *target,
 	 * If first time through, create a plan for this expression.
 	 */
 	if (expr->plan == NULL)
-	{
-		/*
-		 * Mark the expression as being an assignment source, if target is a
-		 * simple variable.  (This is a bit messy, but it seems cleaner than
-		 * modifying the API of exec_prepare_plan for the purpose.  We need to
-		 * stash the target dno into the expr anyway, so that it will be
-		 * available if we have to replan.)
-		 */
-		if (target->dtype == PLPGSQL_DTYPE_VAR)
-			expr->target_param = target->dno;
-		else
-			expr->target_param = -1;	/* should be that already */
-
 		exec_prepare_plan(estate, expr, 0);
-	}
 
 	value = exec_eval_expr(estate, expr, &isnull, &valtype, &valtypmod);
 	exec_assign_value(estate, target, value, isnull, valtype, valtypmod);
@@ -6282,13 +6262,6 @@ setup_param_list(PLpgSQL_execstate *estate, PLpgSQL_expr *expr)
 		 * that they are interrupting an active use of parameters.
 		 */
 		paramLI->parserSetupArg = (void *) expr;
-
-		/*
-		 * Also make sure this is set before parser hooks need it.  There is
-		 * no need to save and restore, since the value is always correct once
-		 * set.  (Should be set already, but let's be sure.)
-		 */
-		expr->func = estate->func;
 	}
 	else
 	{
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 8182ce28aa..5431977d69 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -64,6 +64,10 @@ static	bool			tok_is_keyword(int token, union YYSTYPE *lval,
 static	void			word_is_not_variable(PLword *word, int location);
 static	void			cword_is_not_variable(PLcword *cword, int location);
 static	void			current_token_is_not_variable(int tok);
+static	PLpgSQL_expr	*make_plpgsql_expr(const char *query,
+		   RawParseMode parsemode);
+static	void			expr_is_assignment_source(PLpgSQL_expr *expr,
+  PLpgSQL_datum *target);
 static	PLpgSQL_expr	*read_sql_construct(int until,
 			int until2,
 			int until3,
@@ -529,6 +533,10 @@ decl_statement	: decl_varname decl_const decl_datatype decl_collate decl_notnull
 	 errmsg("variable \"%s\" must have a default value, since it's declared NOT NULL",

Re: Doc: typo in config.sgml

On Sat, Nov  2, 2024 at 07:27:00AM +0900, Tatsuo Ishii wrote:
> > On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote:
> >> On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
> >> > On 15.10.24 23:51, Bruce Momjian wrote:
> >> > > On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
> >> > > > Bruce Momjian  writes:
> >> > > > > Well, we can only use Latin-1, so the idea is that we will be 
> >> > > > > explicit
> >> > > > > about specifying Latin-1 only as HTML entities, rather than letting
> >> > > > > non-Latin-1 creep in as UTF8.  We can exclude certain UTF8 or SGML 
> >> > > > > files
> >> > > > > if desired.
> >> > > > 
> >> > > > That policy would cause substantial problems with contributor names
> >> > > > in the release notes.  I agree with Peter that we don't need this.
> >> > > > Catching otherwise-invisible characters seems sufficient.
> >> > > 
> >> > > Uh, why can't we use HTML entities going forward?  Is that harder?
> >> > 
> >> > I think the question should be the other way around.  The entities are a
> >> > historical workaround for when encoding support and rendering support was
> >> > poor.  Now you can just type in the characters you want as is, which 
> >> > seems
> >> > nicer.
> >> 
> >> Yes, that does make sense, and if we fully supported Unicode, we could
> >> ignore all of this.
> > 
> > Patch applied to master --- no new UTF8 restrictions.
> 
> I thought the conclusion of the discussion was allowing to use LATIN1
> (or UTF-8 encoded LATIN1) characters in SGML files without converting
> them to HTML entities. Your patch seems to do opposite.
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=641a5b7a1447954076728f259342c2f9201bb0b5

Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
LATIN1 characters we had with HTML entities, so there are none
currently.

I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
so I added a cron job on my server to alert me when non-ASCII characters
appear.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: UUID v7

On Fri, Nov 1, 2024 at 10:33 AM Andrey M. Borodin  wrote:
>
>
>
> > On 31 Oct 2024, at 23:04, Stepan Neretin  wrote:
> >
> >
> > Firstly, I'd like to discuss the increased_clock_precision variable, which
> > currently divides the timestamp into milliseconds and nanoseconds. However,
> > this approach only approximates the extra bits for sub-millisecond
> > precision, leading to imprecise timestamps in high-frequency UUID
> > generation.
> No, timestamp is taken in nanoseconds, we keep precision of 1/4096 of ms. If 
> you observe precision loss anywhere let me know.
>
> >
> > To address this issue, we could consider using a more accurate method for
> > calculating the timestamp. For instance, we could utilize a higher
> > resolution clock or implement a more precise algorithm to ensure accurate
> > timestamps.
>
> That's what we do.
>
> >
> > Additionally, it would be beneficial to add validation checks for the
> > interval argument. These checks could verify that the input interval is
> > within reasonable bounds and that the calculated timestamp is accurate.
> > Examples of checks could include verifying if the interval is too small,
> > too large, or exceeds the maximum possible number of milliseconds and
> > nanoseconds in a timestamp.
>
> timestamptz_pl_interval() is already doing this.
>
> > What do you think about these suggestions? Let me know your thoughts!
>
> Thanks a lot for reviewing the patch!
>
>
> > On 1 Nov 2024, at 10:33, Masahiko Sawada  wrote:
> >
> > On Thu, Oct 31, 2024 at 9:53 PM Andrey M. Borodin  
> > wrote:
> >>
> >>
> >>
> >>> On 1 Nov 2024, at 03:00, Masahiko Sawada  wrote:
> >>>
> >>> Therefore, if the
> >>> system clock moves backward due to NTP, we cannot guarantee
> >>> monotonicity and sortability. Is that right?
> >>
> >> Not exactly. Monotonicity is ensured for a given backend. We make sure 
> >> that timestamp is advanced at least for ~250ns forward on each UUID 
> >> generation. 60 bits of time are unique and ascending for a given backend.
> >>
> >
> > Thank you for your explanation. I now understand this code guarantees
> > the monotonicity:
> >
> > +/* minimum amount of ns that guarantees step of increased_clock_precision 
> > */
> > +#define SUB_MILLISECOND_STEP (100/4096 + 1)
> > +   ns = get_real_time_ns();
> > +   if (previous_ns + SUB_MILLISECOND_STEP >= ns)
> > +   ns = previous_ns + SUB_MILLISECOND_STEP;
> > +   previous_ns = ns;
> >
> >
> > I think that one of the most important parts in UUIDv7 implementation
> > is which method (1, 2, or 3 described in RFC 9562) we use to guarantee
> > the monotonicity. The current patch employs method 3 with the
> > assumption that 12 bits of sub-millisecond information is available on
> > most of the systems we support. However, as far as I tested, on MacOS,
> > values returned by  clock_gettime(CLOCK_REALTIME) are only microsecond
> > precision, meaning that we could waste some randomness. Has this point
> > been considered?
> >
>
> There was a thread "What is a typical precision of gettimeofday()?" [0]
> There we found out that routines of instr_time.h are precise enough. On my 
> machine (MacBook Air M3) I do not observe significant differences between 
> CLOCK_MONOTONIC_RAW and CLOCK_REALTIME in pg_test_timing results.
>
> CLOCK_MONOTONIC_RAW
> x4mmm@x4mmm-osx bin % ./pg_test_timing
> Testing timing overhead for 3 seconds.
> Per loop time including overhead: 15.30 ns
> Histogram of timing durations:
>   < us   % of total  count
>  1 98.47856  193113929
>  2  1.520392981452
>  4  0.00025485
>  8  0.00062   1211
> 16  0.00012237
> 32  0.4 79
> 64  0.2 30
>128  0.0  8
>256  0.0  5
>512  0.0  3
>   1024  0.0  1
>   2048  0.0  2
>
> CLOCK_REALTIME
> x4mmm@x4mmm-osx bin % ./pg_test_timing
> Testing timing overhead for 3 seconds.
> Per loop time including overhead: 15.04 ns
> Histogram of timing durations:
>   < us   % of total  count
>  1 98.49709  196477842
>  2  1.502682997479
>  4  0.7130
>  8  0.00012238
> 16  0.5 91
> 32  0.0  4
> 64  0.0  1

I applied the patch shared on that thread[1] to measure nanoseconds
and changed instr_time.h to use CLOCK_REALTIME even on macOS. Here is
the results on my machine (macOS 14.7, M1 Pro):

Testing timing overhead for 3 seconds.
Per loop time including overhead: 18.61 ns
Histogram of timing durations:
   <= ns   % of total  running %  count
   0  98.143398.1433  158212921
   1   0.98.1433  0
   3   0.98.1433  0
   7   0.98.1433  0
  15   0.98.1433  0
  31   0.98.1433  0
  63   0.98.1433

Re: New "raw" COPY format

On Wed, Oct 30, 2024 at 4:54 AM Joel Jacobson  wrote:
>
> On Wed, Oct 30, 2024, at 09:14, Joel Jacobson wrote:
> > $ psql -f bench_result.sql
>
> Ops, I realized I benchmarked a debug build,
> reran the benchmark with `meson setup build --buildtype=release`,
> and also added benchmarking of HEAD:
>
> [plot image showing time distribution by format and version]
>
>
> For those unable to see the image, here is the same data as a text table:
>
> select format, version, min(time_ms), round(avg(time_ms),2) avg, 
> max(time_ms), round(stddev(time_ms)) as stddev from bench group by 1,2 order 
> by 1,2;
>
> format | version |   min|   avg   |   max| stddev
> +-+--+-+--+
> csv| HEAD| 2862.537 | 2895.91 | 2968.420 | 27
> csv| v16 | 2834.663 | 2869.02 | 2926.489 | 28
> csv| v17 | 2958.570 | 3002.71 | 3056.949 | 31
> raw| v16 | 1697.899 | 1723.48 | 1798.020 | 22
> raw| v17 | 1720.251 | 1753.56 | 1811.399 | 28
> text   | HEAD| 2442.845 | 2476.60 | 2543.767 | 33
> text   | v16 | 2358.062 | 2389.06 | 2472.382 | 32
> text   | v17 | 2517.778 | 2552.65 | 2621.161 | 30
> (8 rows)
>
> I think the improvement for the 'text' format between HEAD and v16 could be 
> just noise,
> possibly due to different binary layout when compiling.
>
> v17 (single function) seems slower than v16 (separate functions), also in 
> release build.
>

Here are review comments on v17 patches:

+data values. Unlike in other formats, the delimiter in
+raw format can be any string, including multi-byte
+characters. If no DELIMITER is specified, the entire
+input or output is treated as a single data value.

As I mentioned in a separate email, if we use the OS default EOL as
the default EOL in raw format, it would not be necessary to allow it
to be multi characters. I think it's worth considering it.

---
 * copyfromparse.c
 *  Parse CSV/text/binary format for COPY FROM.

We need to update here as well.

--
- * 4. CopyReadAttributesText/CSV() function takes the input line from
+ * 4. CopyReadAttributesText/CSV/Raw() function takes the input line from

I think we don't have CopyReadAttributesRaw() function now.

---
I think it would be better to explain how to parse data in raw mode,
especially which steps in the pipeline we skip, in the comment at the
top of copyfromparse.c.

---
-   if (cstate->opts.format == COPY_FORMAT_CSV)
+   if (cstate->opts.format == COPY_FORMAT_RAW)
{
-   /*
-* If character is '\r', we may need to look ahead below.  Force
-* fetch of the next character if we don't already have it.  We
-* need to do this before changing CSV state, in case '\r' is also
-* the quote or escape character.
-*/
-   if (c == '\r')
+   if (delim_len == 0)
{
-   IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
+   /* When reading entire file, consume all remaining
bytes at once */
+   input_buf_ptr = copy_buf_len;
+   continue;
}
+   else

I think that the code become much simpler if we do 'continue' at the
end of  'if (cstate->opts.format == COPY_FORMAT_RAW)' branch, instead
of doing 'if (cstate->opts.format == COPY_FORMAT_RAW) {} else ...'.

---
@@ -1461,6 +1536,7 @@ CopyReadLineText(CopyFromState cstate)
return result;
 }

+
 /*
  * Return decimal value for a hexadecimal digit
  */
@@ -1937,7 +2013,6 @@ endfield:
return fieldno;
 }

-
 /*
  * Read a binary attribute
  */

Unnecessary line addition and removal.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Count and log pages set all-frozen by vacuum

On Fri, Nov 1, 2024 at 9:41 AM Robert Haas  wrote:
>
> On Fri, Nov 1, 2024 at 12:23 PM Masahiko Sawada  wrote:
> > Having "marked by this operation" twice seems to be redundant. How
> > about something like the output below?
> >
> > visibility map: %u pages set all-visible (%u pages total), %u pages
> > set all-frozen (%u pages total)
>
> For me, the meaning of that isn't clear.
>
> I think this is the wrong direction, anyway. If someone says "hey, we
> should add X to the output" and someone else says "we should add Y
> instead," it doesn't follow that the right thing to do is to add both.
> I happen to think that the right answer is X, both because X of my
> understanding of the purpose of this log message, and also because X
> is in the service of Melanie's larger goal and Y is not. But I also
> feel like bikeshedding the patch that somebody should have written
> instead of reviewing the one they actually wrote is to be avoided. Of
> course, sometimes there's no getting around the fact that the person
> chose to do something that didn't really make sense, and then it's
> reasonable to suggest alternatives. But here, what was actually done
> does make sense and is the first choice of some people. What is
> proposed can be added now, provided that it actually gets some review,
> and the other thing can be added later if someone wants to do the work
> and if no problems are discovered, but it isn't Melanie's job to add
> data that isn't needed for her project.

Agreed.

I think we agreed with what the patches proposed by Melanie do, so
let's focus on these patches on this thread. We can add other
information later if we need.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Allowing pg_recvlogical to create temporary replication slots

2024-11-01 Thread Torsten Förtsch

This is another version of the patch. It now includes tests.


On Fri, Nov 1, 2024 at 1:42 PM Torsten Förtsch 
wrote:

> On Wed, Oct 30, 2024 at 9:01 AM Peter Eisentraut 
> wrote:
>
>> On 27.10.24 13:37, Torsten Förtsch wrote:
>> > The attached patch enables pg_recvlogical to create a temporary slot.
>>
>> I think you should explain a bit why you want to do that, what use case
>> or scenario this is meant to address.
>>
>
> In my particular case I want to filter for messages sent by
> pg_logical_emit_message(). I don't care much about getting ALL the changes.
> If the replication slot disappears and a few changes are lost it does not
> matter. So, a temporary rep slot makes more sense than creating one and
> then having to make sure it is not retaining wal forever later.
>
> I can imagine this also as a tool to monitor changes for a while and then
> simply disconnect without the need to remove the slot.
>
> Why am I interested in pg_logical_emit_message()? We have an application
> with relatively complicated transactions involving multiple tables. Some of
> them use pg_notify(). We also have synchronous replication. Sometimes I see
> lock avalanches that can be traced to the "AccessExclusiveLock on object 0
> of class 1262 of database 0". This lock is taken during commit when the
> notification is inserted in the queue. After that the backend waits for the
> confirmation by the sync replica. So, this lock presses all the
> transactions sending notifications into a sequence.
>
> Now, if the application uses pg_logical_emit_message() instead, then I
> think there is no equivalent lock. I understand the semantics are a bit
> different (timing) but close enough for my use case. Another advantage of
> pg_logical_emit_message() is the ability to send them even if the
> transaction is aborted.
>
> I was in the process of experimenting with this idea and found that
> pg_recvlogical can:
> - only create the slot or
> - create the slot and immediately use it
> - try to create the slot and if the slot is already there use it
>
> So, why not also allow it to create a temporary slot?
>
From 82d6256fb621e2dc75a67603bef5b511cfdf0e27 Mon Sep 17 00:00:00 2001
From: Torsten Foertsch 
Date: Fri, 1 Nov 2024 13:56:43 +0100
Subject: [PATCH v2] Allow pg_recvlogical to create temp slots

With this patch pg_recvlogical can be called with the
--temporary-slot option together with --create-slot and --start.

If called that way, the created slot exists only for the duration
of the connection. If the connection is dropped and reestablished
by pg_recvlogical, a new temp slot by the same name is created.

If the slot exists and --if-not-exists is not passed, then
pg_recvlogical fails. If the slot exists and --if-not-exists is
given, the slot will be used.

In addition a few tests have been added for previously untested
options.

Discussion: https://www.postgresql.org/message-id/CAKkG4_%3DoMpa-AXhw9m044ZH5YdneNFTp6WxG_kEPA0cTkfiMNQ%40mail.gmail.com
---
 src/bin/pg_basebackup/pg_recvlogical.c| 33 --
 src/bin/pg_basebackup/t/030_pg_recvlogical.pl | 65 +++
 2 files changed, 92 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 3db520ed38..4a131cca6d 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -50,6 +50,7 @@ static int	fsync_interval = 10 * 1000; /* 10 sec = default */
 static XLogRecPtr startpos = InvalidXLogRecPtr;
 static XLogRecPtr endpos = InvalidXLogRecPtr;
 static bool do_create_slot = false;
+static bool slot_is_temporary = false;
 static bool slot_exists_ok = false;
 static bool do_start_slot = false;
 static bool do_drop_slot = false;
@@ -104,6 +105,7 @@ usage(void)
 	printf(_("  -s, --status-interval=SECS\n"
 			 " time between status packets sent to server (default: %d)\n"), (standby_message_timeout / 1000));
 	printf(_("  -S, --slot=SLOTNAMEname of the logical replication slot\n"));
+	printf(_("  --temporary-slot   the slot created exists until the connection is dropped\n"));
 	printf(_("  -t, --two-phaseenable decoding of prepared transactions when creating a slot\n"));
 	printf(_("  -v, --verbose  output verbose messages\n"));
 	printf(_("  -V, --version  output version information, then exit\n"));
@@ -216,7 +218,7 @@ StreamLogicalLog(void)
 	char	   *copybuf = NULL;
 	TimestampTz last_status = -1;
 	int			i;
-	PQExpBuffer query;
+	PQExpBuffer query = NULL;
 	XLogRecPtr	cur_record_lsn;
 
 	output_written_lsn = InvalidXLogRecPtr;
@@ -227,10 +229,24 @@ StreamLogicalLog(void)
 	 * Connect in replication mode to the server
 	 */
 	if (!conn)
+	{
 		conn = GetConnection();
-	if (!conn)
-		/* Error message already written in GetConnection() */
-		return;
+		if (!conn)
+			/* Error message already written in GetConnection() */
+			return;
+
+		/* Recreate a replication slot. */
+		if (do_create_slot && slot_is_temp

Re: define pg_structiszero(addr, s, r)

On Fri, 1 Nov 2024 at 19:27, Michael Paquier  wrote:
> Under gcc -O2 or -O3, the single-byte check or the 8-byte check don't
> make a difference.  Please see the attached (allzeros.txt) for a quick
> check if you want to check by yourself.  With 1M iterations, both
> average around 3ms for 1M iterations on my laptop (not the fastest
> thing around).
>
> Under -O0, though, the difference is noticeable:
> - 1-byte check: 3.52s for 1M iterations, averaging one check at
> 3.52ns.
> - 8-byte check: 0.46s for 1M iterations, averaging one check at
> 0.46ns.
>
> Even for that, I doubt that this is going to be noticeable in
> practice, still the difference exists.

The reason you're not seeing the slowdown with -O2 and -O3 is because
your compiler didn't think there was anything to do so didn't emit the
code you were trying to benchmark.  Try looking at allzeros.s after
doing "gcc allzeros.c -S -O2".

I've attached an updated version for you to try. I used a volatile
bool and assigned the function result to it to prevent the compiler
from optimising out the test.

$ gcc allzeros.c -O2 -o allzeros
$ ./allzeros
char: done in 1607800 nanoseconds
size_t: done in 208800 nanoseconds (7.70019 times faster)

$ gcc allzeros.c -O3 -o allzeros
$ ./allzeros
char: done in 1584500 nanoseconds
size_t: done in 225700 nanoseconds (7.02038 times faster)

David
#include 
#include 
#include 
#include 
#include 
#include 

#define BLCKSZ 8192
#define LOOPS 1000

static inline bool
allzeros_char(const void *ptr, size_t len)
{
const char *p = (const char *) ptr;

for (size_t i = 0; i < len; i++)
{
if (p[i] != 0)
return false;
}
return true;
}

static inline bool
allzeros_size_t(const void *ptr, size_t len)
{
const size_t *p = (const size_t *) ptr;

for (size_t i = 0; i < len / sizeof(size_t); i++)
{
if (p[i] != 0)
return false;
}
return true;
}

#define NANOSEC_PER_SEC 10

// Returns difference in nanoseconds
int64_t
get_clock_diff(struct timespec *t1, struct timespec *t2)
{
int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC;
nanosec += (t1->tv_nsec - t2->tv_nsec);

return nanosec;
}

int main()
{
size_t pagebytes[BLCKSZ] = {0};
volatile bool result;
struct timespec start,end;
int64_t char_time, size_t_time;

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);

for (int i = 0; i < LOOPS; i++)
{
result = allzeros_char(pagebytes, BLCKSZ);
}

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
char_time = get_clock_diff(&end, &start);
printf("char: done in %ld nanoseconds\n", char_time);

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);

for (int i = 0; i < LOOPS; i++)
{
result = allzeros_size_t(pagebytes, BLCKSZ);
}

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
size_t_time = get_clock_diff(&end, &start);
printf("size_t: done in %ld nanoseconds (%g times faster)\n", 
size_t_time, (double) char_time / size_t_time);   

return 0;
}

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

Hi,

On Wed, Oct 23, 2024 at 10:48 PM Tender Wang  wrote:
> I think the root cause of this thread and [1] are same.  We don't use the 
> Partition Key collation but column's
> collation to fill the RelOptInfo partexprs field in 
> set_baserel_partition_key_exprs().
> If the Partition Key definition is same as column definition,  which most 
> times is,
> that will be ok. But if it's not, this thread issue will arise.
>
> As far as I know, only partition pruning logic considers not only call 
> equal(), but also check collation match.
> Other codes only call equal() to check if the exprs match the partition key.
> For example, in this thread case, match_expr_to_partition_keys() think the 
> expr match the partition key:
> if (equal(lfirst(lc), expr))
> return cnt;
>
> Although We can fix this issue like [1], I think why not directly use the 
> partkey->partcollation[cnt], which its value is
> same with pg_partitioned_table's partcollation. I tried this to fix [1], but 
> at that time, I was unsure if it was the correct fix.

I think it would be better to keep RelOptInfo.partexprs unchanged in
these fixes.  I haven't been able to come up with a way to "assign"
the correct collation to partition keys that are not simple column
references.  Checking PartitionScheme.partcollation separately is
enough to fix these bugs and aligns with the style of existing code,
such as match_clause_to_partition_key() in partprune.c.


--
Thanks, Amit Langote

Re: define pg_structiszero(addr, s, r)

On Fri, Nov 01, 2024 at 08:21:50PM +1300, David Rowley wrote:
> My vote is to just revert this usage of the function. Anything more
> elaborate would need to check pointer alignment before using any types
> larger than char. The previous code does not need to do that because
> the page is going to be at least MAXALIGNed.

Fine, here you go.  The attached reverts back this part in bufpage.c
to what it was in 49d6c7d8daba.
--
Michael
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 5ee1e58cd4..be6f1f62d2 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -89,8 +89,10 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
 {
 	PageHeader	p = (PageHeader) page;
 	size_t	   *pagebytes;
+	int			i;
 	bool		checksum_failure = false;
 	bool		header_sane = false;
+	bool		all_zeroes = false;
 	uint16		checksum = 0;
 
 	/*
@@ -124,9 +126,18 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
 	}
 
 	/* Check all-zeroes case */
+	all_zeroes = true;
 	pagebytes = (size_t *) page;
+	for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
+	{
+		if (pagebytes[i] != 0)
+		{
+			all_zeroes = false;
+			break;
+		}
+	}
 
-	if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t
+	if (all_zeroes)
 		return true;
 
 	/*


signature.asc
Description: PGP signature

Re: define pg_structiszero(addr, s, r)

2024-11-01 Thread Bertrand Drouvot

Hi,

On Fri, Nov 01, 2024 at 04:36:45PM +0900, Michael Paquier wrote:
> On Fri, Nov 01, 2024 at 08:21:50PM +1300, David Rowley wrote:
> > My vote is to just revert this usage of the function. Anything more
> > elaborate would need to check pointer alignment before using any types
> > larger than char. The previous code does not need to do that because
> > the page is going to be at least MAXALIGNed.
> 
> Fine, here you go.  The attached reverts back this part in bufpage.c
> to what it was in 49d6c7d8daba.

Thanks! Worth to add a comment as to why pg_memory_is_all_zeros() should not
be used here?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: define pg_structiszero(addr, s, r)

On Fri, Nov 01, 2024 at 07:44:22AM +, Bertrand Drouvot wrote:
> Thanks!

Cool, will fix that in a bit.

> Worth to add a comment as to why pg_memory_is_all_zeros() should not
> be used here?

I would not add one in bufpage.c, documenting that where
pg_memory_is_all_zeros() is defined may be more adapted.
--
Michael


signature.asc
Description: PGP signature

Re: define pg_structiszero(addr, s, r)

On Fri, Nov 01, 2024 at 06:33:51AM +, Bertrand Drouvot wrote:
> We could try to write a more elaborate version of pg_memory_is_all_zeros(), 
> but
> as it looks like there is only one use case, then it's probably better to not
> implement (revert) this change here and "just" add a comment as to why 
> pg_memory_is_all_zeros()
> should not be used here, thoughts?
> 
> [0]: https://godbolt.org/z/xqnW4MPY5

Note that the two printf() calls make the code less optimized.

Anyway, I see the following from bufpage.s for these lines under -O2:

1) On HEAD at 07e9e28b56db:
.LVL306:
   .loc 3 201 23 is_stmt 1 discriminator 1 view .LVU547
   cmpq$1024, %rbx <- Yep, that's wrong.
   jne .L417

2) On HEAD at 49d6c7d8daba:
.LVL299:
   .loc 1 131 16 is_stmt 0 discriminator 1 view .LVU524
   cmpq$8192, %rbx
   je  .L419

3) With the patch sent at [1]:
.LVL306:
   .loc 3 201 23 is_stmt 1 discriminator 1 view .LVU545
   cmpq$8192, %rbx
   jne .L417

So it does not matter one way or another for 2) or 3), does it?

[1]: https://www.postgresql.org/message-id/zyr02ofhiwg1h...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature

Re: define pg_structiszero(addr, s, r)

On Fri, 1 Nov 2024 at 19:33, Bertrand Drouvot
 wrote:
> Agree, I did a quick test (see [0]) and it looks like it's significantly 
> slower
> using the new inline function.
>
> We could try to write a more elaborate version of pg_memory_is_all_zeros(), 
> but
> as it looks like there is only one use case, then it's probably better to not
> implement (revert) this change here and "just" add a comment as to why 
> pg_memory_is_all_zeros()
> should not be used here, thoughts?

My vote is to just revert this usage of the function. Anything more
elaborate would need to check pointer alignment before using any types
larger than char. The previous code does not need to do that because
the page is going to be at least MAXALIGNed.

David

Re: define pg_structiszero(addr, s, r)

On Fri, 1 Nov 2024 at 20:14, Michael Paquier  wrote:
> 2) On HEAD at 49d6c7d8daba:
> .LVL299:
>.loc 1 131 16 is_stmt 0 discriminator 1 view .LVU524
>cmpq$8192, %rbx
>je  .L419
>
> 3) With the patch sent at [1]:
> .LVL306:
>.loc 3 201 23 is_stmt 1 discriminator 1 view .LVU545
>cmpq$8192, %rbx
>jne .L417
>
> So it does not matter one way or another for 2) or 3), does it?

The patch in [1] will fix the bug. But I'm still concerned about the
performance implications of moving to byte-at-a-time processing. There
are about 8 times more instructions being expected to do the same
work.

David

> [1]: https://www.postgresql.org/message-id/zyr02ofhiwg1h...@paquier.xyz

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

2024-11-01 Thread Tender Wang

Amit Langote  于2024年11月1日周五 15:27写道：

> Hi,
>
> On Wed, Oct 23, 2024 at 10:48 PM Tender Wang  wrote:
> > I think the root cause of this thread and [1] are same.  We don't use
> the Partition Key collation but column's
> > collation to fill the RelOptInfo partexprs field in
> set_baserel_partition_key_exprs().
> > If the Partition Key definition is same as column definition,  which
> most times is,
> > that will be ok. But if it's not, this thread issue will arise.
> >
> > As far as I know, only partition pruning logic considers not only call
> equal(), but also check collation match.
> > Other codes only call equal() to check if the exprs match the partition
> key.
> > For example, in this thread case, match_expr_to_partition_keys() think
> the expr match the partition key:
> > if (equal(lfirst(lc), expr))
> > return cnt;
> >
> > Although We can fix this issue like [1], I think why not directly use
> the partkey->partcollation[cnt], which its value is
> > same with pg_partitioned_table's partcollation. I tried this to fix [1],
> but at that time, I was unsure if it was the correct fix.
>
> I think it would be better to keep RelOptInfo.partexprs unchanged in
> these fixes.  I haven't been able to come up with a way to "assign"
> the correct collation to partition keys that are not simple column
> references.  Checking PartitionScheme.partcollation separately is
> enough to fix these bugs and aligns with the style of existing code,
> such as match_clause_to_partition_key() in partprune.c.
>

Agree. I can't figure out another solution.


-- 
Thanks,
Tender Wang

RE: Pgoutput not capturing the generated columns

2024-11-01 Thread Hayato Kuroda (Fujitsu)

Dear Vignesh,

Thanks for rebasing the patch! Before reviewing deeply, I want to confirm the 
specification.
I understood like below based on the documentation.

- If publish_generated_columns is false, the publication won't replicate 
generated columns
- If publish_generated_columns is true, the behavior on the subscriber depends 
on the table column:
  - If it is a generated column even on the subscriber, it causes an ERROR.
  - If it is a regular column, the generated value is replicated.
  - If the column is missing, it causes an ERROR.

However, below test implies that generated columns can be replicated even if
publish_generated_columns is false. Not sure...

```
# Verify that incremental replication of generated columns occurs
# when they are included in the column list, regardless of the
# publish_generated_columns option.
$result =
  $node_subscriber->safe_psql('postgres', "SELECT * FROM tab3 ORDER BY a");
is( $result, qq(|2
|4
|6
|8),
'tab3 incremental replication, when publish_generated_columns=false');
```

Also, I've tested the case both pub and sub had the generated column, but the 
ERROR was strange for me.

```
test_pub=# CREATE TABLE gencoltable (a int PRIMARY KEY, b int GENERATED ALWAYS 
AS (a * 2) STORED);
test_pub=# CREATE PUBLICATION pub FOR TABLE gencoltable(a, b) WITH 
(publish_generated_columns = true);
test_pub=# INSERT INTO gencoltable (a) VALUES (generate_series(1, 10));

test_sub=# CREATE TABLE gencoltable (a int PRIMARY KEY, b int GENERATED ALWAYS 
AS (a * 2) STORED);
test_sub=# CREATE SUBSCRIPTION sub CONNECTION ... PUBLICATION pub;

-> ERROR: logical replication target relation "public.gencoltable" is missing 
replicated column: "b"
```

The attribute existed on the sub but it was reported as missing column. I think
we should somehow report like "generated column on publisher is replicated the
generated column on the subscriber".

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: small pg_combinebackup improvements

2024-11-01 Thread Bertrand Drouvot

Hi,

On Thu, Oct 31, 2024 at 12:06:25PM -0400, Robert Haas wrote:
> On Thu, Oct 31, 2024 at 11:41 AM Bertrand Drouvot
>  wrote:
> > I'm not sure about 0001 but I think 0002 deserves a back patch as I think 
> > it fits
> > into the "low-risk fixes" category [0].
> 
> I'm inclined to back-patch both, then. We might have more small fixes
> and they'll be less risky to back-patch if we back-patch them all.

Yeah, that makes fully sense. +1 to back-patch both then.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

2024-11-01 Thread jian he

just a quick reply while testing v4-0001.
tests copy from src/test/regress/sql/partition_aggregate.sql first 40 lines.

drop table if exists pagg_tab;
CREATE TABLE pagg_tab (a int, b int, c text, d int) PARTITION BY
LIST(c collate "C");
CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('',
'0001', '0002', '0003', '0004');
CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005',
'0006', '0007', '0008');
CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009',
'0010', '0011');
INSERT INTO pagg_tab SELECT (i % 20), (i % 30), to_char(i % 12,
'FM'), i % 30 FROM generate_series(0, 2999) i;
ANALYZE pagg_tab;

EXPLAIN (COSTS OFF, settings)
SELECT a, sum(b), avg(b), count(*), max(b) FROM pagg_tab GROUP BY a;


 QUERY PLAN

---
 Finalize HashAggregate
   Group Key: pagg_tab.a
   ->  Append
 ->  Partial HashAggregate
   Group Key: pagg_tab.a
   ->  Seq Scan on pagg_tab_p1 pagg_tab
 ->  Partial HashAggregate
   Group Key: pagg_tab_1.a
   ->  Seq Scan on pagg_tab_p2 pagg_tab_1
 ->  Partial HashAggregate
   Group Key: pagg_tab_2.a
   ->  Seq Scan on pagg_tab_p3 pagg_tab_2
 Settings: enable_partitionwise_aggregate = 'on',
enable_partitionwise_join = 'on', max_parallel_workers_per_gather =
'0', enable_increm
ental_sort = 'off'



drop table if exists pagg_tab;
CREATE TABLE pagg_tab (a text, b int, c text, d int) PARTITION BY
LIST(c collate "C");
CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('',
'0001', '0002', '0003', '0004');
CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005',
'0006', '0007', '0008');
CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009',
'0010', '0011');
INSERT INTO pagg_tab SELECT (i % 20)::text, (i % 30), to_char(i % 12,
'FM'), i % 30 FROM generate_series(0, 2999) i;
ANALYZE pagg_tab;
EXPLAIN (COSTS OFF, settings)
SELECT a, sum(b), avg(b), count(*), max(b) FROM pagg_tab GROUP BY a;


 QUERY PLAN

---
 HashAggregate
   Group Key: pagg_tab.a
   ->  Append
 ->  Seq Scan on pagg_tab_p1 pagg_tab_1
 ->  Seq Scan on pagg_tab_p2 pagg_tab_2
 ->  Seq Scan on pagg_tab_p3 pagg_tab_3
 Settings: enable_partitionwise_aggregate = 'on',
enable_partitionwise_join = 'on', max_parallel_workers_per_gather =
'0', enable_increm
ental_sort = 'off'


it seems "PARTITION BY LIST(c collate "C");" collation compare with
"GROUP BY a;".
set collation_incompatible returned true.
make it cannot do PARTITIONWISE_AGGREGATE_PARTIAL.

but here "group by a", "a" is text data type, we can still do
PARTITIONWISE_AGGREGATE_PARTIAL
?

Re: Using read stream in autoprewarm

2024-11-01 Thread Nazir Bilal Yavuz

Hi,

Thanks for looking into this!

On Thu, 31 Oct 2024 at 21:18, Andrey M. Borodin  wrote:
>
> > On 8 Aug 2024, at 11:32, Nazir Bilal Yavuz  wrote:
> >
> > Any feedback would be appreciated.
>
> I've took a look into the patch. It seems to me that you add new block 
> numbers to the read stream until you have buffers. So when there are no more 
> buffers you will still have some queued blocks.
> Maybe can you change the logic so that number of free buffers must be enough 
> to allocate all blocks in look-ahead distance?

I see what you mean. When the have_free_buffer() function returns
false in the callback, there are still queued blocks in the stream
although there are no free buffers in the buffer pool. I think the
best way to solve this is to get the number of free buffers in the
buffer pool by 'BufferStrategyControl.lastFreeBuffer -
BufferStrategyControl.firstFreeBuffer' and then compare it with
'stream->pending_read_nblocks'. When the 'stream->pending_read_nblocks
== number_of_free_buffers_in_buffer_pool', end the stream. The problem
with that is stream->pending_read_nblocks isn't public, also I am not
sure whether 'BufferStrategyControl.lastFreeBuffer -
BufferStrategyControl.firstFreeBuffer' is safe to use.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: define pg_structiszero(addr, s, r)

On Fri, 1 Nov 2024 at 20:49, Michael Paquier  wrote:
>
> On Fri, Nov 01, 2024 at 07:44:22AM +, Bertrand Drouvot wrote:
> > Worth to add a comment as to why pg_memory_is_all_zeros() should not
> > be used here?
>
> I would not add one in bufpage.c, documenting that where
> pg_memory_is_all_zeros() is defined may be more adapted.

The thought of having to write a comment to warn people not to use it
for performance-critical things makes me think it might be better just
to write a more optimal version of the function so we don't need to
warn people. I looked around at the callers of the function I saw the
following numbers of bytes being used for the length: 8192 (the one in
question), 88, 32 and 112.

I don't know how performance-critical the final three of those are,
but I imagine all apart from the 32-byte one might be better with a
non-inlined and more optimised version of the function. The problem
with inlining the optimised version is that it's more code to inline.

I've attached what I thought a more optimal version might look like in
case anyone thinks making it better is a good idea.

David

pg_memory_is_all_zeros_speedup.patch
Description: Binary data

Re: Improve the efficiency of _bt_killitems.

2024-11-01 Thread feichanghong



> On Nov 1, 2024, at 16:24, Heikki Linnakangas  wrote:
> 
> On 01/11/2024 09:19, feichanghong wrote:
>> Hi hackers,
>> In the _bt_killitems function, the following logic is present: we search to 
>> the right for an index item that matches the heap TID and attempt to mark it 
>> as dead. If that index item has already been marked as dead by other 
>> concurrent processes, we will continue
>> searching. However, there should not be any more matching index items
>> on the current page.
> Why could there not be more matching items on the page?
> 
> Are you assuming a unique index? Even then it's not right; you can have 
> multiple index entries point to different heap tuples with the same key, as 
> long as they're not visible at the same time. For example, if you UPDATE or 
> DELETE+INSERT a row.

Maybe I didn't describe it clearly. What I meant to say is that there shouldn't
be multiple index items on the current page pointing to the same heap TID(ctid).
rather than the same index key.

Best Regards,
Fei Changhong

Re: Using read stream in autoprewarm

2024-11-01 Thread Nazir Bilal Yavuz

Hi,

Thanks for looking into this!

On Thu, 31 Oct 2024 at 22:02, Stepan Neretin  wrote:
>
> At first A quick look it looks good. I will take a closer look at it 
> tomorrow. Could you please let me know about the performance tests and 
> graphics?

Sorry but I didn't understand what you mean by performance tests and
graphics. Do you want something else than the information in the first
email [1]?

[1] 
postgr.es/m/CAN55FZ3n8Gd%2BhajbL%3D5UkGzu_aHGRqnn%2BxktXq2fuds%3D1AOR6Q%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Improving tracking/processing of buildfarm test failures

2024-11-01 Thread Alexander Lakhin


Hello hackers,

Please take a look at the October report on buildfarm failures:
# SELECT br, count(*) FROM failures WHERE dt >= '2024-10-01' AND
 dt < '2024-11-01' GROUP BY br;
REL_12_STABLE: 9
REL_13_STABLE: 9
REL_14_STABLE: 19
REL_15_STABLE: 25
REL_16_STABLE: 12
REL_17_STABLE: 14
master: 109
-- Total: 197
(Counting test failures only, excluding indent-check, Configure, Build
errors.)

# SELECT COUNT(*) FROM (SELECT DISTINCT issue_link FROM failures WHERE
 dt >= '2024-10-01' AND dt < '2024-11-01');
22

# SELECT issue_link, count(*) FROM failures WHERE dt >= '2024-10-01' AND
 dt < '2024-11-01' GROUP BY issue_link ORDER BY 2 DESC LIMIT 6;
https://www.postgresql.org/message-id/d63a3295-cac1-4a8e-9de1-0ebab996d...@eisentraut.org:
 54
-- Fixed

https://www.postgresql.org/message-id/caa4ek1k_ikmjekqsof9ssdau4s8_cu6n15rp13-j4cmskn-...@mail.gmail.com:
 23
-- Fixed

https://www.postgresql.org/message-id/362289.1730241...@sss.pgh.pa.us: 11
-- Will be fixed soon

https://www.postgresql.org/message-id/2480333.1729784...@sss.pgh.pa.us: 6
-- Fixed

https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986b...@gmail.com:
 5

https://www.postgresql.org/message-id/c638873f-8b1e-4770-ba49-5a0b3e140...@iki.fi:
 4
-- Fixed

# SELECT count(*) FROM failures WHERE dt >= '2024-10-01' AND
 dt < '2024-11-01' AND issue_link IS NULL; -- Unsorted/unhelpful failures
74

Short-lived failures:
165 (+ 11 from 362289.1730241...@sss.pgh.pa.us)

Best regards,
Alexander

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

On Fri, Nov 1, 2024 at 5:08 PM jian he  wrote:
> just a quick reply while testing v4-0001.
> tests copy from src/test/regress/sql/partition_aggregate.sql first 40 lines.
>
> drop table if exists pagg_tab;
> CREATE TABLE pagg_tab (a int, b int, c text, d int) PARTITION BY
> LIST(c collate "C");
> CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('',
> '0001', '0002', '0003', '0004');
> CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005',
> '0006', '0007', '0008');
> CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009',
> '0010', '0011');
> INSERT INTO pagg_tab SELECT (i % 20), (i % 30), to_char(i % 12,
> 'FM'), i % 30 FROM generate_series(0, 2999) i;
> ANALYZE pagg_tab;
>
> EXPLAIN (COSTS OFF, settings)
> SELECT a, sum(b), avg(b), count(*), max(b) FROM pagg_tab GROUP BY a;
>
>
>  QUERY PLAN
>
> ---
>  Finalize HashAggregate
>Group Key: pagg_tab.a
>->  Append
>  ->  Partial HashAggregate
>Group Key: pagg_tab.a
>->  Seq Scan on pagg_tab_p1 pagg_tab
>  ->  Partial HashAggregate
>Group Key: pagg_tab_1.a
>->  Seq Scan on pagg_tab_p2 pagg_tab_1
>  ->  Partial HashAggregate
>Group Key: pagg_tab_2.a
>->  Seq Scan on pagg_tab_p3 pagg_tab_2
>  Settings: enable_partitionwise_aggregate = 'on',
> enable_partitionwise_join = 'on', max_parallel_workers_per_gather =
> '0', enable_increm
> ental_sort = 'off'
>
>
>
> drop table if exists pagg_tab;
> CREATE TABLE pagg_tab (a text, b int, c text, d int) PARTITION BY
> LIST(c collate "C");
> CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('',
> '0001', '0002', '0003', '0004');
> CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005',
> '0006', '0007', '0008');
> CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009',
> '0010', '0011');
> INSERT INTO pagg_tab SELECT (i % 20)::text, (i % 30), to_char(i % 12,
> 'FM'), i % 30 FROM generate_series(0, 2999) i;
> ANALYZE pagg_tab;
> EXPLAIN (COSTS OFF, settings)
> SELECT a, sum(b), avg(b), count(*), max(b) FROM pagg_tab GROUP BY a;
>
>
>  QUERY PLAN
>
> ---
>  HashAggregate
>Group Key: pagg_tab.a
>->  Append
>  ->  Seq Scan on pagg_tab_p1 pagg_tab_1
>  ->  Seq Scan on pagg_tab_p2 pagg_tab_2
>  ->  Seq Scan on pagg_tab_p3 pagg_tab_3
>  Settings: enable_partitionwise_aggregate = 'on',
> enable_partitionwise_join = 'on', max_parallel_workers_per_gather =
> '0', enable_increm
> ental_sort = 'off'
>
>
> it seems "PARTITION BY LIST(c collate "C");" collation compare with
> "GROUP BY a;".
> set collation_incompatible returned true.
> make it cannot do PARTITIONWISE_AGGREGATE_PARTIAL.
>
> but here "group by a", "a" is text data type, we can still do
> PARTITIONWISE_AGGREGATE_PARTIAL
> ?

Good catch.  Looks like I added a bug in group_by_has_partkey() --
collation_incompatible should be set only when a grouping expression
matches a partition key.

-- 
Thanks, Amit Langote


v5-0002-Disallow-partitionwise-join-when-collation-doesn-.patch
Description: Binary data


v5-0001-Disallow-partitionwise-grouping-when-collation-do.patch
Description: Binary data

Adding NetBSD and OpenBSD to Postgres CI

2024-11-01 Thread Nazir Bilal Yavuz

Hi,

NetBSD and OpenBSD Postgres CI images are generated [1] but their tasks are
not added to the upstream Postgres yet. The attached patch adds NetBSD and
OpenBSD tasks to the Postgres CI.

I made these tasks triggered manually like MinGW task to save CI credits
but a related line is commented out for now to trigger CFBot.

CPU: 2 and TEST_JOBS: 8 are chosen based on manual tests.

╔══╦╦═╗
║  CI Run Tim  ║║ ║
║   (Only Test Step)   ║ NetBSD ║ OpenBSD ║
║ (in minutes:seconds) ║║ ║
╠══╬╬═╣
║ CPU: 2, TJ: 4║  13:18 ║  17:07  ║
╠══╬╬═╣
║ CPU: 2, TJ: 6║  11:01 ║  16:23  ║
╠══╬╬═╣
║ CPU: 2, TJ: 8║  10:14 ║  15:41  ║
╠══╬╬═╣
║ CPU: 4, TJ: 4║  11:46 ║  16:03  ║
╠══╬╬═╣
║ CPU: 4, TJ: 6║  09:56 ║  14:59  ║
╠══╬╬═╣
║ CPU: 4, TJ: 8║  10:02 ║  15:09  ║
╚══╩╩═╝

Any kind of feedback would be appreciated.

[1] https://github.com/anarazel/pg-vm-images

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 18acef23ac64ec8906d05d19f47f097c96ef3b45 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 1 Nov 2024 12:02:29 +0300
Subject: [PATCH v1] Add NetBSD and OpenBSD tasks to the Postgres CI

NetBSD and OpenBSD Postgres CI images are generated [1] but their tasks
are not added to the upstream Postgres yet. This patch adds them.

Note: These tasks will be triggered manually to save CI credits but a
related line is commented out for now to trigger CFBot.

[1] https://github.com/anarazel/pg-vm-images
---
 .cirrus.tasks.yml | 82 +++
 .cirrus.yml   | 10 ++
 2 files changed, 92 insertions(+)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 90cb95c8681..6c57584bdf9 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -214,6 +214,88 @@ task:
 cores_script: src/tools/ci/cores_backtrace.sh freebsd /tmp/cores
 
 
+task:
+  depends_on: SanityCheck
+  # trigger_type: manual
+
+  env:
+# Below are experimentally derived to be a decent choice.
+CPUS: 2
+BUILD_JOBS: 8
+TEST_JOBS: 8
+
+CIRRUS_WORKING_DIR: /home/postgres/postgres
+CCACHE_DIR: /tmp/ccache_dir
+
+PATH: /usr/sbin:$PATH
+
+# Postgres interprets LANG as a 'en_US.UTF-8' but it is 'C', then
+# Postgres tries to set 'LC_COLLATE' to 'en_US.UTF-8' but it is not
+# changeable. Initdb fails because of that. So, LANG is forced to be 'C'.
+LANG: "C"
+LC_ALL: "C"
+
+  matrix:
+- name: NetBSD - 10 - Meson
+  only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*netbsd.*'
+  env:
+IMAGE_FAMILY: pg-ci-netbsd-postgres
+INCLUDE_DIRS: -Dextra_lib_dirs=/usr/pkg/lib -Dextra_include_dirs=/usr/pkg/include
+  <<: *netbsd_task_template
+
+- name: OpenBSD - 7 - Meson
+  only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*openbsd.*'
+  env:
+IMAGE_FAMILY: pg-ci-openbsd-postgres
+INCLUDE_DIRS: -Dextra_include_dirs=/usr/local/include -Dextra_lib_dirs=/usr/local/lib
+UUID: -Duuid=e2fs
+  <<: *openbsd_task_template
+
+  sysinfo_script: |
+locale
+id
+uname -a
+ulimit -a -H && ulimit -a -S
+env
+
+  ccache_cache:
+folder: $CCACHE_DIR
+
+  create_user_script: |
+useradd postgres
+chown -R postgres:users /home/postgres
+mkdir -p ${CCACHE_DIR}
+chown -R postgres:users ${CCACHE_DIR}
+
+  # -Duuid=bsd is not set since 'bsd' uuid option
+  # is not working on NetBSD & OpenBSD. See
+  # https://www.postgresql.org/message-id/17358-89806e7420797...@postgresql.org
+  # And other uuid options are not available on NetBSD.
+  configure_script: |
+su postgres <<-EOF
+  meson setup \
+--buildtype debug \
+-Dcassert=true -Dssl=openssl ${UUID} \
+-DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
+${INCLUDE_DIRS} \
+build
+EOF
+
+  build_script: su postgres -c 'ninja -C build -j${BUILD_JOBS}'
+  upload_caches: ccache
+
+  test_world_script: |
+su postgres <<-EOF
+  ulimit -c unlimited
+  # Otherwise tests will fail on OpenBSD, due to the lack of enough processes.
+  ulimit -p 256
+  meson test $MTEST_ARGS --num-processes ${TEST_JOBS}
+EOF
+
+  on_failure:
+<<: *on_failure_meson
+
+
 # configure feature flags, shared between the task running the linux tests and
 # the CompilerWarnings task
 LINUX_CONFIGURE_FEATURES: &LINUX_CONFIGURE_FEATURES >-
diff --git a/.cirrus.yml b/.cirrus.yml
index a83129ae46d..33c6e481d74 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -52,6 +52,16 @@ default_freebsd_task_template: &freebsd_task_template
 PLATFORM: freebsd
   <<: *cirrus

Re: New function normal_rand_array function to contrib/tablefunc.

2024-11-01 Thread Dean Rasheed

On Wed, 16 Oct 2024 at 08:43, Andy Fan  wrote:
>
>  Thanks for the detailed feedback! Here is the rebased version.
>

I took another look at this and I think it's in reasonable shape.

I'm attaching an update, rebasing it on top of 9be4e5d293.

Also it was missing a required update to the meson.build file --
that's the immediate cause of the other cfbot failures.

The rest is just cosmetic tidying up, fixing indentation, tweaking
comments, and the like. I also hacked on the docs a bit -- the
synopsis only listed one of the new function signatures for some
reason. After fixing that, I think it's sufficient to just list one
usage example.

Regards,
Dean
From 2a14be071dd2e721e768fdbaa57b096d509773aa Mon Sep 17 00:00:00 2001
From: Dean Rasheed 
Date: Wed, 30 Oct 2024 08:41:41 +
Subject: [PATCH v3] tablefunc: Add rand_array() functions.

These functions return sets of random-length arrays, containing
uniformly distributed random values of type integer, bigint, or
numeric.

Andy Fan, reviewed by Jim Jones, Japin Li, and Dean Rasheed.

Discussion: https://postgr.es/m/87plssezpc@163.com
---
 contrib/tablefunc/Makefile|   2 +-
 contrib/tablefunc/expected/tablefunc.out  |  66 
 contrib/tablefunc/meson.build |   1 +
 contrib/tablefunc/sql/tablefunc.sql   |  16 ++
 contrib/tablefunc/tablefunc--1.0--1.1.sql |  19 +++
 contrib/tablefunc/tablefunc.c | 197 ++
 contrib/tablefunc/tablefunc.control   |   2 +-
 doc/src/sgml/tablefunc.sgml   |  66 
 src/backend/utils/adt/arrayfuncs.c|   1 +
 src/tools/pgindent/typedefs.list  |   1 +
 10 files changed, 369 insertions(+), 2 deletions(-)
 create mode 100644 contrib/tablefunc/tablefunc--1.0--1.1.sql

diff --git a/contrib/tablefunc/Makefile b/contrib/tablefunc/Makefile
index 191a3a1d38..f0c67308fd 100644
--- a/contrib/tablefunc/Makefile
+++ b/contrib/tablefunc/Makefile
@@ -3,7 +3,7 @@
 MODULES = tablefunc
 
 EXTENSION = tablefunc
-DATA = tablefunc--1.0.sql
+DATA = tablefunc--1.0.sql tablefunc--1.0--1.1.sql
 PGFILEDESC = "tablefunc - various functions that return tables"
 
 REGRESS = tablefunc
diff --git a/contrib/tablefunc/expected/tablefunc.out b/contrib/tablefunc/expected/tablefunc.out
index ddece79029..9c3f7529d5 100644
--- a/contrib/tablefunc/expected/tablefunc.out
+++ b/contrib/tablefunc/expected/tablefunc.out
@@ -13,6 +13,72 @@ SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
 SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
 ERROR:  number of rows cannot be negative
 --
+-- rand_array()
+-- use setseed() to get stable results
+--
+SELECT setseed(0);
+ setseed 
+-
+ 
+(1 row)
+
+SELECT pg_typeof(rand_array), * FROM rand_array(10, 0, 3, 50::int, 80::int);
+ pg_typeof | rand_array 
+---+
+ integer[] | {63,71,66}
+ integer[] | {64}
+ integer[] | {54}
+ integer[] | {72,64,60}
+ integer[] | {75}
+ integer[] | {53,73}
+ integer[] | {69}
+ integer[] | {74,67}
+ integer[] | {65}
+ integer[] | {73}
+(10 rows)
+
+SELECT pg_typeof(rand_array), * FROM rand_array(10, 0, 3, 50::bigint, 80::bigint);
+ pg_typeof | rand_array 
+---+
+ bigint[]  | {}
+ bigint[]  | {59,53}
+ bigint[]  | {72}
+ bigint[]  | {80,79}
+ bigint[]  | {71}
+ bigint[]  | {80,80}
+ bigint[]  | {61,64}
+ bigint[]  | {62,76,80}
+ bigint[]  | {}
+ bigint[]  | {}
+(10 rows)
+
+SELECT pg_typeof(rand_array), * FROM rand_array(10, 0, 3, 50::numeric, 80::numeric);
+ pg_typeof | rand_array 
+---+
+ numeric[] | {77,66,73}
+ numeric[] | {51,53}
+ numeric[] | {65,54}
+ numeric[] | {72}
+ numeric[] | {55,72}
+ numeric[] | {}
+ numeric[] | {55,70,64}
+ numeric[] | {75}
+ numeric[] | {}
+ numeric[] | {}
+(10 rows)
+
+-- negative number of tuples
+SELECT pg_typeof(rand_array), * FROM rand_array(-1, 0, 3, 50::int, 80::int);
+ERROR:  number of rows cannot be negative
+-- invalid length bounds
+SELECT pg_typeof(rand_array), * FROM rand_array(10, -1, 3, 50::int, 80::int);
+ERROR:  minlen must be greater than or equal to zero
+SELECT pg_typeof(rand_array), * FROM rand_array(10, 3, 0, 50::int, 80::int);
+ERROR:  maxlen must be greater than or equal to minlen
+-- invalid value bounds
+SELECT pg_typeof(rand_array), * FROM rand_array(10, 0, 3, 80::int, 50::int);
+ERROR:  lower bound must be less than or equal to upper bound
+--
 -- crosstab()
 --
 CREATE TABLE ct(id int, rowclass text, rowid text, attribute text, val text);
diff --git a/contrib/tablefunc/meson.build b/contrib/tablefunc/meson.build
index dccf3b3758..f794163ecc 100644
--- a/contrib/tablefunc/meson.build
+++ b/contrib/tablefunc/meson.build
@@ -18,6 +18,7 @@ contrib_targets += tablefunc
 
 install_data(
   'tablefunc--1.0.sql',
+  'tablefunc--1.0--1.1.sql',
   'tablefunc.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/tablefunc/sql/tablefunc.sql b/contrib/tablefunc/sql/tablefunc.sql
index 0fb8e40de2..f83ccb7ec4 100644
--- a/contrib/tablef

Re: Improve the efficiency of _bt_killitems.

2024-11-01 Thread Heikki Linnakangas


On 01/11/2024 09:19, feichanghong wrote:

Hi hackers,

In the _bt_killitems function, the following logic is present: we 
search to the right for an index item that matches the heap TID and 
attempt to mark it as dead. If that index item has already been 
marked as dead by other concurrent processes, we will continue

searching. However, there should not be any more matching index items
on the current page.

Why could there not be more matching items on the page?

Are you assuming a unique index? Even then it's not right; you can have 
multiple index entries point to different heap tuples with the same key, 
as long as they're not visible at the same time. For example, if you 
UPDATE or DELETE+INSERT a row.


--
Heikki Linnakangas
Neon (https://neon.tech)

Add missing tab completion for ALTER TABLE ADD COLUMN IF NOT EXISTS

2024-11-01 Thread Kirill Reshke

Hi hackers!

I have been working on cloudberry/greenplum tab completion features,
and I spotted that the postgres version of psql does not tab-complete
ALTER TABLE ADD COLUMN IF NOT EXISTS pattern.
So, I propose to support this.

-- 
Best regards,
Kirill Reshke


v1-0001-Add-missing-psql-tab-completion-for-som-ALTER-TAB.patch
Description: Binary data

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

Hi,

On Fri, Nov 1, 2024 at 11:39 AM Tender Wang  wrote:
> Amit Langote  于2024年10月31日周四 21:09写道：
>> On Wed, Oct 30, 2024 at 9:36 PM Junwang Zhao  wrote:
>> > On Wed, Oct 30, 2024 at 11:58 AM jian he  
>> > wrote:
> In have_partkey_equi_join()
> ...
> if (exprs_known_equal(root, expr1, expr2, btree_opfamily))
> {
>  Oid partcoll2 = rel1->part_scheme->partcollation[ipk];
>  
> I think we should use rel2 here, not rel1.

Hmm, yeah, this should be fixed.  Though, this is not a problem
because both rel1 and rel2 would be pointing to the same
PartitionScheme, because otherwise we wouldn't be in
have_partkey_equi_join().  See this in build_joinrel_partition_info():

/*
 * We can only consider this join as an input to further partitionwise
 * joins if (a) the input relations are partitioned and have
 * consider_partitionwise_join=true, (b) the partition schemes match, and
 * (c) we can identify an equi-join between the partition keys.  Note that
 * if it were possible for have_partkey_equi_join to return different
 * answers for the same joinrel depending on which join ordering we try
 * first, this logic would break.  That shouldn't happen, though, because
 * of the way the query planner deduces implied equalities and reorders
 * the joins.  Please see optimizer/README for details.
 */
if (outer_rel->part_scheme == NULL || inner_rel->part_scheme == NULL ||
!outer_rel->consider_partitionwise_join ||
!inner_rel->consider_partitionwise_join ||
outer_rel->part_scheme != inner_rel->part_scheme ||
!have_partkey_equi_join(root, joinrel, outer_rel, inner_rel,
sjinfo->jointype, restrictlist))
{
Assert(!IS_PARTITIONED_REL(joinrel));
return;
}

I've changed the condition to match only partcoll1 and exprcoll1, and
if they match, Assert that partcoll2 and exprcoll2 match too.  That's
because partcoll1 and partcoll2 would be the same as they are from the
same PartitionScheme and expr1 and expr2 are known equal() so their
collations should match too.

-- 
Thanks, Amit Langote

Improve the efficiency of _bt_killitems.

2024-11-01 Thread feichanghong

Hi hackers,

In the _bt_killitems function, the following logic is present: we search to the
right for an index item that matches the heap TID and attempt to mark it as
dead. If that index item has already been marked as dead by other concurrent
processes, we will continue searching. However, there should not be any more
matching index items on the current page.

```
while (offnum <= maxoff)
{
...
/*
* Mark index item as dead, if it isn't already.  Since this
* happens while holding a buffer lock possibly in shared mode,
* it's possible that multiple processes attempt to do this
* simultaneously, leading to multiple full-page images being 
sent
* to WAL (if wal_log_hints or data checksums are enabled), which
* is undesirable.
*/
if (killtuple && !ItemIdIsDead(iid))
{
/* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break;  /* out of inner search loop */
}
offnum = OffsetNumberNext(offnum);
}
```

Perhaps we should exit the current loop immediately when the killtuple is set,
stopping the search to the right. This could improve efficiency, especially
when the index item to be killed is the first one on the current page:

```
diff --git a/src/backend/access/nbtree/nbtutils.c 
b/src/backend/access/nbtree/nbtutils.c
index b4ba51357a5..529a3083165 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -4298,11 +4298,14 @@ _bt_killitems(IndexScanDesc scan)
  * to WAL (if wal_log_hints or data checksums are enabled), which
  * is undesirable.
  */
-if (killtuple && !ItemIdIsDead(iid))
+if (killtuple)
 {
-/* found the item/all posting list items */
-ItemIdMarkDead(iid);
-killedsomething = true;
+if (!ItemIdIsDead(iid))
+{
+/* found the item/all posting list items */
+ItemIdMarkDead(iid);
+killedsomething = true;
+}
 break;  /* out of inner search loop */
 }
 offnum = OffsetNumberNext(offnum);
```


Best Regards,
Fei Changhong

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.