date:20240620

Hi,

On Mon, Jun 17, 2024 at 9:47 PM Chapman Flack  wrote:
> On 06/17/24 02:20, Amit Langote wrote:
> >>>Apparently, the functions expect JSONB so that a cast is implied
> >>>when providing TEXT. However, the errors during that cast are
> >>>not subject to the ON ERROR clause.
> >>>
> >>>17beta1=# SELECT JSON_QUERY('invalid', '$' NULL ON ERROR);
> >>>ERROR:  invalid input syntax for type json
> >>>DETAIL:  Token "invalid" is invalid.
> >>>CONTEXT:  JSON data, line 1: invalid
> >>>
> >>>Oracle DB and Db2 (LUW) both return NULL in that case.
>
> I wonder, could prosupport rewriting be used to detect that the first
> argument is supplied by a cast, and rewrite the expression to apply the
> cast 'softly'? Or would that behavior be too magical?

I don't think prosupport rewriting can be used, because JSON_QUERY().

We could possibly use "runtime coercion" for context_item so that the
coercion errors can be "caught", which is how we coerce the jsonpath
result to the RETURNING type.

For now, I'm inclined to simply document the limitation that errors
when coercing string arguments to json are always thrown.

-- 
Thanks, Amit Langote

Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-20 Thread John Naylor

On Thu, Jun 20, 2024 at 8:12 AM Masahiko Sawada  wrote:

> On Thu, Jun 20, 2024 at 7:54 AM Tom Lane  wrote:
> >

> I agree with radixtree-fix-proposed.patch. Even if we zero more fields
> in the node it would not add noticeable overheads.

+1 in general, although I'm slightly concerned about this part:

> > (The RT_NODE_48 case could be optimized a bit if we cared
> > to swap the order of its slot_idxs[] and isset[] arrays;
> > then the initial zeroing could just go up to slot_idxs[].

- memset(n48->isset, 0, sizeof(n48->isset));
+ memset(n48, 0, offsetof(RT_NODE_48, children));
  memset(n48->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n48->slot_idxs));

I was a bit surprised that neither gcc 14 nor clang 18 can figure out
that they can skip zeroing the slot index array since it's later
filled in with "invalid index", so they actually zero out 272 bytes
before re-initializing 256 of those bytes. It may not matter in
practice, but it's also not free, and trivial to avoid.

> > I don't know if there's any reason why the current order
> > is preferable.)
>
> IIUC there is no particular reason for the current order in RT_NODE_48.

Yeah. I found that simply swapping them enables clang to avoid
double-initialization, but gcc still can't figure it out and must be
told to stop at slot_idxs[]. I'd prefer to do it that way and document
that slot_idxs is purposefully the last member of the fixed part of
the struct. If that's agreeable I'll commit it that way tomorrow
unless someone beats me to it.

Re: suspicious valgrind reports about radixtree/tidstore on arm64

John Naylor  writes:
> On Thu, Jun 20, 2024 at 8:12 AM Masahiko Sawada  wrote:
>> IIUC there is no particular reason for the current order in RT_NODE_48.

> Yeah. I found that simply swapping them enables clang to avoid
> double-initialization, but gcc still can't figure it out and must be
> told to stop at slot_idxs[]. I'd prefer to do it that way and document
> that slot_idxs is purposefully the last member of the fixed part of
> the struct.

WFM.

> If that's agreeable I'll commit it that way tomorrow
> unless someone beats me to it.

I was going to push it, but feel free.

regards, tom lane

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-06-20 Thread Kyotaro Horiguchi

Hi, I looked through the patch and have some comments.


== L68:
+Recovery Procedures

It looks somewhat confusing and appears as if the section is intended
to explain how to perform recovery. Since this is the first built-in
procedure, I'm not sure how should this section be written. However,
the section immediately above is named "Recovery Control Functions",
so "Reocvery Synchronization Functions" would align better with the
naming of the upper section. (I don't believe we need to be so strcit
about the distinction between functions and procedures here.)

It looks strange that the procedure signature includes the return type.


== L93:
+If timeout is not specified or zero, this
+procedure returns once WAL is replayed upto
+target_lsn.
+If the timeout is specified (in
+milliseconds) and greater than zero, the procedure waits until the
+server actually replays the WAL upto target_lsn or
+until the given time has passed. On timeout, an error is emitted.

The first sentence should mention the main functionality. Following
precedents, it might be better to use something like "Waits until
recovery surpasses the specified LSN. If no timeout is specified or it
is set to zero, this procedure waits indefinitely for the LSN. If the
timeout is specified (in milliseconds) and is greater than zero, the
procedure waits until the LSN is reached or the specified time has
elapsed. On timeout, or if the server is promoted before the LSN is
reached, an error is emitted."

The detailed explanation that follows the above seems somewhat too
verbose to me, as other functions don't have such detailed examples.

== L484
/*
+* Set latches for processes, whose waited LSNs are already replayed. 
This
+* involves spinlocks.  So, we shouldn't do this under a spinlock.
+*/

Here, I'm not quite sure what specifically spinlock (or mutex?) is
referring to.  However, more importantly, shouldn't we explain that it
is okay not to use any locks at all, rather than mentioning that
spinlocks should not be used here? I found a similar case around
freelist.c:238, which is written as follows.

>* Not acquiring ProcArrayLock here which is slightly icky. It's
>* actually fine because procLatch isn't ever freed, so we just 
> can
>* potentially set the wrong process' (or no process') latch.
>*/
>   SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);

= L518
+void
+WaitForLSN(XLogRecPtr targetLSN, int64 timeout)

This function is only called within the same module. I'm not sure if
we need to expose it. I we do, the name should probably be more
specific. I'm not quite sure if the division of functionality between
this function and its only caller function is appropriate.  As a
possible refactoring, we could have WaitForLSN() just return the
result as [reached, timedout, promoted] and delegate prerequisition
checks and error reporting to the SQL function.


= L524
+   /* Shouldn't be called when shmem isn't initialized */
+   Assert(waitLSN);

Seeing this assertion, I feel that the name "waitLSN" is a bit
obscure. How about renaming it to "waitLSNStates"?

= L527
+   /* Should be only called by a backend */
+   Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);

This is somewhat excessive, causing a server crash when ending with an
error would suffice. By the way, if I execute "CALL
pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes.
The condition doesn't seem appropriate.


= L561
+errdetail("Recovery ended before 
replaying the target LSN %X/%X; last replay LSN %X/%X.",

I don't think we need "the" before "target" in the above message.


= L565
+   if (timeout > 0)
+   {
+   delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+   latch_events |= WL_TIMEOUT;
+   if (delay_ms <= 0)
+   break;

"timeout" is immutable in the function. Therefore, we can calculate
"latch_events" before entering the loop. By the way, the name
'latch_events' seems a bit off. Latch is a kind of event the function
can wait for. Therefore, something like wait_events might be more
appropriate.

= L567
+   delay_ms = (endtime - GetCurrentTimestamp()) / 1000;

We can use TimestampDifferenceMilliseconds() here.


 L578
+   if (rc & WL_LATCH_SET)
+   ResetLatch(MyLatch);

I think we usually reset latches unconditionally after returning from
WaitLatch(), even when waiting for timeouts.


= L756
+{ oid => '16387', descr => 'wait for LSN with timeout',

The description seems a bit off. While timeout is mentioned, the more
important characteristic that the LSN is the replay LSN is not
included.

= L791
+ * WaitLSNProcInfo [e2 80

Re: AIX support


On 19/06/2024 17:55, Srirama Kucherlapati wrote:

+/* Commenting for XLC
+ * "IBM XL C/C++ for AIX, V12.1" miscompiles, for 32-bit, some inline
+ * expansions of ginCompareItemPointers() "long long" arithmetic.  To take
+ * advantage of inlining, build a 64-bit PostgreSQL.
+#if defined(__ILP32__) && defined(__IBMC__)
+#define PG_FORCE_DISABLE_INLINE
+#endif
+ */


This seems irrelevant.


+ * Ordinarily, we'd code the branches here using GNU-style local symbols, that
+ * is "1f" referencing "1:" and so on.  But some people run gcc on AIX with
+ * IBM's assembler as backend, and IBM's assembler doesn't do local symbols.
+ * So hand-code the branch offsets; fortunately, all PPC instructions are
+ * exactly 4 bytes each, so it's not too hard to count.


Could you use GCC assembler to avoid this?


@@ -662,6 +666,21 @@ tas(volatile slock_t *lock)
 
 #if !defined(HAS_TEST_AND_SET)	/* We didn't trigger above, let's try here */
 
+#if defined(_AIX)	/* AIX */

+/*
+ * AIX (POWER)
+ */
+#define HAS_TEST_AND_SET
+
+#include 
+
+typedef int slock_t;
+
+#define TAS(lock)  _check_lock((slock_t *) (lock), 0, 1)
+#define S_UNLOCK(lock) _clear_lock((slock_t *) (lock), 0)
+#endif  /* _AIX */
+
+
 /* These are in sunstudio_(sparc|x86).s */
 
 #if defined(__SUNPRO_C) && (defined(__i386) || defined(__x86_64__) || defined(__sparc__) || defined(__sparc))


What CPI/compiler/OS configuration is this for, exactly? Could we rely 
on GCC-provided __sync_lock_test_and_set() builtin function instead?



+# Allow platforms with buggy compilers to force restrict to not be
+# used by setting $FORCE_DISABLE_RESTRICT=yes in the relevant
+# template.


Surely we don't need that anymore? Or is the compiler still buggy?

Do you still care about 32-bit binaries on AIX? If not, let's make that 
the default in configure or a check for it, and remove the instructions 
on building 32-bit binaries from the docs.


Please try hard to remove any changes from the diff that are not 
absolutely necessary.


- Heikki

Re: ON ERROR in json_query and the like

Hi,

On Mon, Jun 17, 2024 at 10:07 PM Markus Winand  wrote:
> > On 17.06.2024, at 08:20, Amit Langote  wrote:
> > Agree that the documentation needs to be clear about this. I'll update
> > my patch at [1] to add a note next to table 9.16.3. SQL/JSON Query
> > Functions.
>
> Considering another branch of this thread [1] I think the
> "Supported Features” appendix of the docs should mention that as well.
>
> The way I see it is that the standards defines two overloaded
> JSON_QUERY functions, of which PostgreSQL will support only one.
> In case of valid JSON, the implied CAST makes it look as though
> the second variant of these function was supported as well but that
> illusion totally falls apart once the JSON is not valid anymore.
>
> I think it affects the following feature IDs:
>
>   - T821, Basic SQL/JSON query operators
>  For JSON_VALUE, JSON_TABLE and JSON_EXISTS
>   - T828, JSON_QUERY
>
> Also, how hard would it be to add the functions that accept
> character strings? Is there, besides the effort, any thing else
> against it? I’m asking because I believe once released it might
> never be changed — for backward compatibility.

Hmm, I'm starting to think that adding the implied cast to json wasn't
such a great idea after all, because it might mislead the users to
think that JSON parsing is transparent (respects ON ERROR), which is
what you are saying, IIUC.

I'm inclined to push the attached patch which puts back the
restriction to allow only jsonb arguments, asking users to add an
explicit cast if necessary.

--
Thanks, Amit Langote


v1-0001-SQL-JSON-Only-allow-arguments-of-jsonb-type.patch
Description: Binary data

Re: pg_combinebackup --clone doesn't work

On 6/20/24 07:55, Peter Eisentraut wrote:
> The pg_combinebackup --clone option currently doesn't work at all.  Even
> on systems where it should it be supported, you'll always get a "file
> cloning not supported on this platform" error.
> 
> The reason is this checking code in pg_combinebackup.c:
> 
> #if (defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)) || \
>     (defined(__linux__) && defined(FICLONE))
> 
>     if (opt.dry_run)
>     pg_log_debug("would use cloning to copy files");
>     else
>     pg_log_debug("will use cloning to copy files");
> 
> #else
>     pg_fatal("file cloning not supported on this platform");
> #endif
> 
> The problem is that this file does not include the appropriate OS header
> files that would define COPYFILE_CLONE_FORCE or FICLONE, respectively.
> 
> The same problem also exists in copy_file.c.  (That one has the right
> header file for macOS but still not for Linux.)
> 

Seems like my bug, I guess :-( Chances are the original patches had the
include, but it got lost during refactoring or something. Anyway, will
fix shortly.

> This should be pretty easy to fix up, and we should think about some
> ways to refactor this to avoid repeating all these OS-specific things a
> few times.  (The code was copied from pg_upgrade originally.)
> 

Yeah. The ifdef forest got rather hard to navigate.

> But in the short term, how about some test coverage?  You can exercise
> the different pg_combinebackup copy modes like this:
> 
> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm
> b/src/test/perl/PostgreSQL/Test/Cluster.pm
> index 83f385a4870..7e8dd024c82 100644
> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
> @@ -848,7 +848,7 @@ sub init_from_backup
>     }
> 
>     local %ENV = $self->_get_env();
> -   my @combineargs = ('pg_combinebackup', '-d');
> +   my @combineargs = ('pg_combinebackup', '-d', '--clone');
>     if (exists $params{tablespace_map})
>     {
>     while (my ($olddir, $newdir) = each %{
> $params{tablespace_map} })
> 

For ad hoc testing? Sure, but that won't work on platforms without the
clone support, right?

> We could do something like what we have for pg_upgrade, where we can use
> the environment variable PG_TEST_PG_UPGRADE_MODE to test the different
> copy modes.  We could turn this into a global option.  (This might also
> be useful for future work to use file cloning elsewhere, like in CREATE
> DATABASE?)
> 

Yeah, this sounds like a good way to do this. Is there a good reason to
have multiple different variables, one for each tool, or should we have
a single PG_TEST_COPY_MODE affecting all the places?

> Also, I think it would be useful for consistency if pg_combinebackup had
> a --copy option to select the default mode, like pg_upgrade does.
> 

I vaguely recall this might have been discussed in the thread about
adding cloning to pg_combinebackup, but I don't recall the details why
we didn't adopt the pg_uprade way. But we can revisit that, IMO.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SQL/JSON query functions context_item doc entry and type requirement

On Thu, Jun 20, 2024 at 1:03 AM David G. Johnston
 wrote:
> On Wed, Jun 19, 2024 at 8:29 AM jian he  wrote:
>>
>> On Mon, Jun 17, 2024 at 9:05 PM Chapman Flack  wrote:
>> >
>> > Hi,
>> >
>> > On 06/17/24 02:43, Amit Langote wrote:
>> > > context_item expression can be a value of
>> > > any type that can be cast to jsonb. This includes types
>> > > such as char,  text, bpchar,
>> > > character varying, and bytea (with
>> > > ENCODING UTF8), as well as any domains over these types.
>> >
>> > Reading this message in conjunction with [0] makes me think that we are
>> > really talking about a function that takes a first parameter of type jsonb,
>> > and behaves exactly that way (so any cast required is applied by the system
>> > ahead of the call). Under those conditions, this seems like an unusual
>> > sentence to add in the docs, at least until we have also documented that
>> > tan's argument can be of any type that can be cast to double precision.
>> >
>>
>> I guess it would be fine to add an unusual sentence to the docs.
>>
>> imagine a function: array_avg(anyarray) returns anyelement.
>> array_avg calculate an array's elements's avg. like
>> array('{1,2,3}'::int[]) returns 2.
>> but array_avg won't make sense if the input argument is a date array.
>> so mentioning in the doc: array_avg can accept anyarray, but anyarray
>> cannot date array.
>> seems ok.
>
>
> There is existing wording for this:
>
> "The expression can be of any JSON type, any character string type, or bytea 
> in UTF8 encoding."
>
> If you add this sentence to the paragraph the link that already exists, which 
> simply points the reader to this sentence, becomes redundant and should be 
> removed.

I've just posted a patch in the other thread [1] to restrict
context_item to be of jsonb type, which users would need to ensure by
adding an explicit cast if needed.  I think that makes this
clarification unnecessary.

> As for table 9.16.3 - it is unwieldy already.  Lets try and make the core 
> syntax shorter, not longer.  We already have precedence in the subsequent 
> json_table section - give each major clause item a name then below the table 
> define the syntax and meaning for those names.  Unlike in that section - 
> which probably should be modified too - context_item should have its own 
> description line.

I had posted a patch a little while ago at [1] to render the syntax a
bit differently with each function getting its own syntax synopsis.
Resending it here; have addressed Jian He's comments.

--
Thanks, Amit Langote
[1] 
https://www.postgresql.org/message-id/CA%2BHiwqF2Z6FATWQV6bG9NeKYf%3D%2B%2BfOgmdbYc9gWSNJ81jfqCuA%40mail.gmail.com


v1-0001-SQL-JSON-Various-fixes-to-SQL-JSON-query-function.patch
Description: Binary data

Re: Conflict Detection and Resolution

On Wed, Jun 19, 2024 at 2:51 PM Ashutosh Bapat
 wrote:
>
> On Wed, Jun 19, 2024 at 12:03 PM Amit Kapila  wrote:
>>
>> > I doubt that it would be that simple. The application will have to 
>> > intervene and tell one of the employees that their reservation has failed. 
>> > It looks natural that the first one to reserve the room should get the 
>> > reservation, but implementing that is more complex than resolving a 
>> > conflict in the database. In fact, mostly it will be handled outside 
>> > database.
>> >
>>
>> Sure, the application needs some handling but I have tried to explain
>> with a simple way that comes to my mind and how it can be realized
>> with db involved. This is a known conflict detection method but note
>> that I am not insisting to have "earliest_timestamp_wins". Even, if we
>> want this we can have a separate discussion on this and add it later.
>>
>
> It will be good to add a minimal set of conflict resolution strategies to 
> begin with, while designing the feature for extensibility. I imagine the 
> first version might just detect the conflict and throw error or do nothing. 
> That's already two simple conflict resolution strategies with minimal 
> efforts. We can add more complicated ones incrementally.
>

Agreed, splitting the work into multiple patches would help us to
finish the easier ones first.

I have thought to divide it such that in the first patch, we detect
conflicts like 'insert_exists', 'update_differ', 'update_missing', and
'delete_missing' (the definition of each could be found in the initial
email [1]) and throw an ERROR or write them in LOG. Various people
agreed to have this as a separate committable work [2]. This can help
users to detect and monitor the conflicts in a better way. I have
intentionally skipped update_deleted as it would require more
infrastructure and it would be helpful even without that.

In the second patch, we can implement simple built-in resolution
strategies like apply and skip (which can be named as remote_apply and
keep_local, see [3][4] for details on these strategies) with ERROR or
LOG being the default strategy. We can allow these strategies to be
configured at the global and table level.

In the third patch, we can add monitoring capability for conflicts and
resolutions as mentioned by Jonathan [5]. Here, we can have stats like
how many conflicts of a particular type have happened.

In the meantime, we can keep discussing and try to reach a consensus
on the timing-related resolution strategy like 'last_update_wins' and
the conflict strategy 'update_deleted'.

If we agree on the above, some of the work, especially the first one,
could even be discussed in a separate thread.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAJpy0uD0-DpYVMtsxK5R%3DzszXauZBayQMAYET9sWr_w0CNWXxQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAD21AoAa6JzqhXY02uNUPb-aTozu2RY9nMdD1%3DTUh%2BFpskkYtw%40mail.gmail.com
[3] - 
https://www.postgresql.org/message-id/CAJpy0uD0-DpYVMtsxK5R%3DzszXauZBayQMAYET9sWr_w0CNWXxQ%40mail.gmail.com
[4] - https://github.com/2ndquadrant/pglogical?tab=readme-ov-file#conflicts
[5] - 
https://www.postgresql.org/message-id/1eb9242f-dcb6-45c3-871c-98ec324e03ef%40postgresql.org

-- 
With Regards,
Amit Kapila.

Re: 回复: An implementation of multi-key sort

2024-06-20 Thread John Naylor

On Fri, Jun 14, 2024 at 6:20 PM Yao Wang  wrote:
>
> hi Tomas,
>
> So many thanks for your kind response and detailed report. I am working
> on locating issues based on your report/script and optimizing code, and
> will update later.

Hi,
This is an interesting proof-of-concept!

Given the above, I've set this CF entry to "waiting on author".

Also, I see you've added Heikki as a reviewer. I'm not sure how others
think, but I consider a "reviewer" in the CF app to be someone who has
volunteered to be responsible to help move this patch forward. If
there is a name in the reviewer column, it may discourage others from
doing review. It also can happened that people ping reviewers to ask
"There's been no review for X months -- are you planning on looking at
this?", and it's not great if that message is a surprise.

Note that we prefer not to top-post in emails since it makes our web
archive more difficult to read.

Thanks,
John

How about add counted_by attribute for flexible-array?

2024-06-20 Thread Japin Li



Hi, hackers

When I read [1], I think the "counted_by" attribute may also be valuable for
PostgreSQL.

The 'counted_by' attribute is used on flexible array members. The argument for
the attribute is the name of the field member in the same structure holding
the count of elements in the flexible array. This information can be used to
improve the results of the array bound sanitizer and the
'__builtin_dynamic_object_size' builtin [2].

It was introduced in Clang-18 [3] and will soon be available in GCC-15.

[1] 
https://embeddedor.com/blog/2024/06/18/how-to-use-the-new-counted_by-attribute-in-c-and-linux/
[2] https://reviews.llvm.org/D148381
[3] https://godbolt.org/z/5qKsEhG8o

-- 
Regrads,
Japin Li

Re: Speed up collation cache

2024-06-20 Thread John Naylor

On Sat, Jun 15, 2024 at 6:46 AM Jeff Davis  wrote:
> Attached is a patch to use simplehash.h instead, which speeds things up
> enough to make them fairly close (from around 15% slower to around 8%).

+#define SH_HASH_KEY(tb, key)   hash_uint32((uint32) key)

For a static inline hash for speed reasons, we can use murmurhash32
here, which is also inline.

Re: DROP OWNED BY fails to clean out pg_init_privs grants

Hi Tom,

Is there anything that could be back-patched with reasonable effort ?

--
Hannu

On Mon, Jun 17, 2024 at 6:37 PM Daniel Gustafsson  wrote:
>
> > On 17 Jun 2024, at 16:56, Tom Lane  wrote:
> > Daniel Gustafsson  writes:
>
> >> I wonder if this will break any tools/scripts in prod which relies on the
> >> previous (faulty) behaviour.  It will be interesting to see if anything 
> >> shows
> >> up on -bugs.  Off the cuff it seems like a good idea judging by where we 
> >> are
> >> and what we can fix with it.
> >
> > Considering that SHARED_DEPENDENCY_INITACL has existed for less than
> > two months, it's hard to believe that any outside code has grown any
> > dependencies on it, much less that it couldn't be adjusted readily.
>
> Doh, I was thinking about it backwards, clearly not a worry =)
>
> >> I wonder if it's worth reverting passing the owner ID for v17 and 
> >> revisiting
> >> that in 18 if we work on recording the ID.  Shaving a few catalog lookups 
> >> is
> >> generally worthwhile, doing them without needing the result for the next 
> >> five
> >> years might bite us.
> >
> > Yeah, that was the direction I was leaning in, too.  I'll commit the
> > revert of that separately, so that un-reverting it shouldn't be too
> > painful if we eventually decide to do so.
>
> Sounds good.
>
> --
> Daniel Gustafsson
>

confusing valgrind report about tuplestore+wrapper_handler (?) on 32-bit arm

Hi,

While running valgrind on 32-bit ARM (rpi5 with debian), I got this
really strange report:


==25520== Use of uninitialised value of size 4
==25520==at 0x94A550: wrapper_handler (pqsignal.c:108)
==25520==by 0x4D7826F: ??? (sigrestorer.S:64)
==25520==  Uninitialised value was created by a heap allocation
==25520==at 0x8FB780: palloc (mcxt.c:1340)
==25520==by 0x913067: tuplestore_begin_common (tuplestore.c:289)
==25520==by 0x91310B: tuplestore_begin_heap (tuplestore.c:331)
==25520==by 0x3EA717: ExecMaterial (nodeMaterial.c:64)
==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
==25520==by 0x3EF73F: ExecProcNode (executor.h:274)
==25520==by 0x3F0637: ExecMergeJoin (nodeMergejoin.c:703)
==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
==25520==by 0x3C47DB: ExecProcNode (executor.h:274)
==25520==by 0x3C4D4F: fetch_input_tuple (nodeAgg.c:561)
==25520==by 0x3C8233: agg_retrieve_direct (nodeAgg.c:2364)
==25520==by 0x3C7E07: ExecAgg (nodeAgg.c:2179)
==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
==25520==by 0x3A5EC3: ExecProcNode (executor.h:274)
==25520==by 0x3A8FBF: ExecutePlan (execMain.c:1646)
==25520==by 0x3A6677: standard_ExecutorRun (execMain.c:363)
==25520==by 0x3A644B: ExecutorRun (execMain.c:304)
==25520==by 0x6976D3: PortalRunSelect (pquery.c:924)
==25520==by 0x6972F7: PortalRun (pquery.c:768)
==25520==by 0x68FA1F: exec_simple_query (postgres.c:1274)
==25520==
{
   
   Memcheck:Value4
   fun:wrapper_handler
   obj:/usr/lib/arm-linux-gnueabihf/libc.so.6
}
**25520** Valgrind detected 1 error(s) during execution of "select
count(*) from
**25520**   (select * from tenk1 x order by x.thousand, x.twothousand,
x.fivethous) x
**25520**   left join
**25520**   (select * from tenk1 y order by y.unique2) y
**25520**   on x.thousand = y.unique2 and x.twothousand = y.hundred and
x.fivethous = y.unique2;"


I'm mostly used to weird valgrind stuff on this platform, but it's
usually about libarmmmem and (possibly) thinking it might access
undefined stuff when calculating checksums etc.

This seems somewhat different, so I wonder if it's something real? But
also, at the same time, it's rather weird, because the report says it's
this bit in pqsignal.c

(*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg);

but it also says the memory was allocated in tuplestore, and that's
obviously very unlikely, because it does not do anything with signals.

I've only seen this once, but if it's related to signals, that's not
surprising - the window may be pretty narrow.

Anyone saw/investigated a report like this?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Conflict Detection and Resolution

On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  wrote:
>
> On Tue, Jun 18, 2024 at 10:17 AM Dilip Kumar  wrote:
> >
> > >
> > > I think the unbounded size of the vector could be a problem to store
> > > for each event. However, while researching previous discussions, it
> > > came to our notice that we have discussed this topic in the past as
> > > well in the context of standbys. For recovery_min_apply_delay, we
> > > decided the clock skew is not a problem as the settings of this
> > > parameter are much larger than typical time deviations between servers
> > > as mentioned in docs. Similarly for casual reads [1], there was a
> > > proposal to introduce max_clock_skew parameter and suggesting the user
> > > to make sure to have NTP set up correctly. We have tried to check
> > > other databases (like Ora and BDR) where CDR is implemented but didn't
> > > find anything specific to clock skew. So, I propose to go with a GUC
> > > like max_clock_skew such that if the difference of time between the
> > > incoming transaction's commit time and the local time is more than
> > > max_clock_skew then we raise an ERROR. It is not clear to me that
> > > putting bigger effort into clock skew is worth especially when other
> > > systems providing CDR feature (like Ora or BDR) for decades have not
> > > done anything like vector clocks. It is possible that this is less of
> > > a problem w.r.t CDR and just detecting the anomaly in clock skew is
> > > good enough.
> >
> > I believe that if we've accepted this solution elsewhere, then we can
> > also consider the same. Basically, we're allowing the application to
> > set its tolerance for clock skew. And, if the skew exceeds that
> > tolerance, it's the application's responsibility to synchronize;
> > otherwise, an error will occur. This approach seems reasonable.
>
> This model can be further extended by making the apply worker wait if
> the remote transaction's commit_ts is greater than the local
> timestamp. This ensures that no local transactions occurring after the
> remote transaction appear to have happened earlier due to clock skew
> instead we make them happen before the remote transaction by delaying
> the remote transaction apply.  Essentially, by having the remote
> application wait until the local timestamp matches the remote
> transaction's timestamp, we ensure that the remote transaction, which
> seems to occur after concurrent local transactions due to clock skew,
> is actually applied after those transactions.
>
> With this model, there should be no ordering errors from the
> application's perspective as well if synchronous commit is enabled.
> The transaction initiated by the publisher cannot be completed until
> it is applied to the synchronous subscriber. This ensures that if the
> subscriber's clock is lagging behind the publisher's clock, the
> transaction will not be applied until the subscriber's local clock is
> in sync, preventing the transaction from being completed out of order.
>

As per the discussion, this idea will help us to resolve transaction
ordering issues due to clock skew. I was thinking of having two
variables max_clock_skew (indicates how much clock skew is
acceptable), max_clock_skew_options: ERROR, LOG, WAIT (indicates the
action we need to take once the clock skew is detected). There could
be multiple ways to provide these parameters, one is providing them as
GUCs, and another at the subscription or the table level. I am
thinking whether users would only like to care about a table or set of
tables or they would like to set such variables at the system level.
We already have an SKIP option (that allows us to skip the
transactions till a particular LSN) at the subscription level, so I am
wondering if there is a sense to provide these new parameters related
to conflict resolution also at the same level?

-- 
With Regards,
Amit Kapila.

Re: What is a typical precision of gettimeofday()?

(resending to list and other CC:s )

Hi Tom

This is my current patch which also adds running % and optionally uses
faster way to count leading zeros, though I did not  see a change from
that.

It also bucketizes first 128 ns to get better overview of exact behaviour.

We may want to put reporting this behind a flag

---
Hannu

On Wed, Jun 19, 2024 at 6:36 PM Tom Lane  wrote:
>
> Peter Eisentraut  writes:
> > AFAICT, pg_test_timing doesn't use gettimeofday(), so this doesn't
> > really address the original question.
>
> It's not exactly hard to make it do so (see attached).
>
> I tried this on several different machines, and my conclusion is that
> gettimeofday() reports full microsecond precision on any platform
> anybody is likely to be running PG on today.  Even my one surviving
> pet dinosaur, NetBSD 10 on PowerPC Mac (mamba), shows results like
> this:
>
> $ ./pg_test_timing
> Testing timing overhead for 3 seconds.
> Per loop time including overhead: 901.41 ns
> Histogram of timing durations:
>   < us   % of total  count
>  1 10.46074 348148
>  2 89.514952979181
>  4  0.00574191
>  8  0.00430143
> 16  0.00691230
> 32  0.00376125
> 64  0.00012  4
>128  0.00303101
>256  0.00027  9
>512  0.9  3
>   1024  0.9  3
>
> I also modified pg_test_timing to measure nanoseconds not
> microseconds (second patch attached), and got this:
>
> $ ./pg_test_timing
> Testing timing overhead for 3 seconds.
> Per loop time including overhead: 805.50 ns
> Histogram of timing durations:
>   < ns   % of total  count
>  1 19.84234 739008
>  2  0.0  0
>  4  0.0  0
>  8  0.0  0
> 16  0.0  0
> 32  0.0  0
> 64  0.0  0
>128  0.0  0
>256  0.0  0
>512  0.0  0
>   1024 80.140132984739
>   2048  0.00078 29
>   4096  0.00658245
>   8192  0.00290108
>  16384  0.00252 94
>  32768  0.00250 93
>  65536  0.00016  6
> 131072  0.00185 69
> 262144  0.8  3
> 524288  0.8  3
> 1048576  0.8  3
>
> confirming that when the result changes it generally does so by 1usec.
>
> Applying just the second patch, I find that clock_gettime on this
> old hardware seems to be limited to 1us resolution, but on my more
> modern machines (mac M1, x86_64) it can tick at 40ns or less.
> Even a raspberry pi 4 shows
>
> $ ./pg_test_timing
> Testing timing overhead for 3 seconds.
> Per loop time including overhead: 69.12 ns
> Histogram of timing durations:
>   < ns   % of total  count
>  1  0.0  0
>  2  0.0  0
>  4  0.0  0
>  8  0.0  0
> 16  0.0  0
> 32  0.0  0
> 64 37.59583   16317040
>128 62.38568   27076131
>256  0.01674   7265
>512  0.2  8
>   1024  0.0  0
>   2048  0.0  0
>   4096  0.00153662
>   8192  0.00019 83
>  16384  0.1  3
>  32768  0.1  5
>
> suggesting that the clock_gettime resolution is better than 64 ns.
>
> So I concur with Hannu that it's time to adjust pg_test_timing to
> resolve nanoseconds not microseconds.  I gather he's created a
> patch that does more than mine below, so I'll wait for that.
>
> regards, tom lane
>
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..20b2785f50 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -19,9 +19,12 @@ static void handle_args(int argc, char *argv[]);
 static uint64 test_timing(unsigned int duration);
 static void output(uint64 loop_count);
 
-/* record duration in powers of 2 microseconds */
+/* record duration in powers of 2 nanoseconds */
 long long int histogram[32];
 
+/* record duration of first 128 ns directly */
+long long int direct_histogram[128];
+
 int
 main(int argc, char *argv[])
 {
@@ -130,10 +133,10 @@ test_timing(unsigned int duration)
 end_time,
 temp;
 
-	total_time = duration > 0 ? duration * INT64CONST(100) : 0;
+	total_time = duration > 0 ? duration * INT64CONST(10) : 0;
 
 	INSTR_TIME_SET_CURRENT(start_time);
-	cur = INSTR_TIME_GET_MICROSEC(start_time);
+	cur = INSTR_TIME_GET_NANOSEC(start_time);
 
 	while (time_elapsed < total_time)
 	{
@@ -142,7 +145,7 @@ test_timing(unsigned int duration)
 
 		prev = cur;
 		INSTR_TIME_SET_CURRENT(temp);
-		cur = INSTR_TIME_GET_MICROSEC(temp);
+		cur = INSTR_TIME_GET_NANOSEC(temp);
 		diff = cur - prev;
 
 		/* Did time go backwards? */
@@

Re: What is a typical precision of gettimeofday()?

Another thing I changed in reporting was to report <= ns instead of < ns

This was inspired by not wanting to report "zero ns" as "< 1 ns" and
easiest was to change them all to <=

On Thu, Jun 20, 2024 at 12:41 PM Hannu Krosing  wrote:
>
> (resending to list and other CC:s )
>
> Hi Tom
>
> This is my current patch which also adds running % and optionally uses
> faster way to count leading zeros, though I did not  see a change from
> that.
>
> It also bucketizes first 128 ns to get better overview of exact behaviour.
>
> We may want to put reporting this behind a flag
>
> ---
> Hannu
>
> On Wed, Jun 19, 2024 at 6:36 PM Tom Lane  wrote:
> >
> > Peter Eisentraut  writes:
> > > AFAICT, pg_test_timing doesn't use gettimeofday(), so this doesn't
> > > really address the original question.
> >
> > It's not exactly hard to make it do so (see attached).
> >
> > I tried this on several different machines, and my conclusion is that
> > gettimeofday() reports full microsecond precision on any platform
> > anybody is likely to be running PG on today.  Even my one surviving
> > pet dinosaur, NetBSD 10 on PowerPC Mac (mamba), shows results like
> > this:
> >
> > $ ./pg_test_timing
> > Testing timing overhead for 3 seconds.
> > Per loop time including overhead: 901.41 ns
> > Histogram of timing durations:
> >   < us   % of total  count
> >  1 10.46074 348148
> >  2 89.514952979181
> >  4  0.00574191
> >  8  0.00430143
> > 16  0.00691230
> > 32  0.00376125
> > 64  0.00012  4
> >128  0.00303101
> >256  0.00027  9
> >512  0.9  3
> >   1024  0.9  3
> >
> > I also modified pg_test_timing to measure nanoseconds not
> > microseconds (second patch attached), and got this:
> >
> > $ ./pg_test_timing
> > Testing timing overhead for 3 seconds.
> > Per loop time including overhead: 805.50 ns
> > Histogram of timing durations:
> >   < ns   % of total  count
> >  1 19.84234 739008
> >  2  0.0  0
> >  4  0.0  0
> >  8  0.0  0
> > 16  0.0  0
> > 32  0.0  0
> > 64  0.0  0
> >128  0.0  0
> >256  0.0  0
> >512  0.0  0
> >   1024 80.140132984739
> >   2048  0.00078 29
> >   4096  0.00658245
> >   8192  0.00290108
> >  16384  0.00252 94
> >  32768  0.00250 93
> >  65536  0.00016  6
> > 131072  0.00185 69
> > 262144  0.8  3
> > 524288  0.8  3
> > 1048576  0.8  3
> >
> > confirming that when the result changes it generally does so by 1usec.
> >
> > Applying just the second patch, I find that clock_gettime on this
> > old hardware seems to be limited to 1us resolution, but on my more
> > modern machines (mac M1, x86_64) it can tick at 40ns or less.
> > Even a raspberry pi 4 shows
> >
> > $ ./pg_test_timing
> > Testing timing overhead for 3 seconds.
> > Per loop time including overhead: 69.12 ns
> > Histogram of timing durations:
> >   < ns   % of total  count
> >  1  0.0  0
> >  2  0.0  0
> >  4  0.0  0
> >  8  0.0  0
> > 16  0.0  0
> > 32  0.0  0
> > 64 37.59583   16317040
> >128 62.38568   27076131
> >256  0.01674   7265
> >512  0.2  8
> >   1024  0.0  0
> >   2048  0.0  0
> >   4096  0.00153662
> >   8192  0.00019 83
> >  16384  0.1  3
> >  32768  0.1  5
> >
> > suggesting that the clock_gettime resolution is better than 64 ns.
> >
> > So I concur with Hannu that it's time to adjust pg_test_timing to
> > resolve nanoseconds not microseconds.  I gather he's created a
> > patch that does more than mine below, so I'll wait for that.
> >
> > regards, tom lane
> >

Re: Extension security improvement: Add support for extensions with an owned schema

2024-06-20 Thread Jelte Fennema-Nio

On Wed, 19 Jun 2024 at 17:22, David G. Johnston
 wrote:
>
> On Wed, Jun 19, 2024 at 8:19 AM Jelte Fennema-Nio  wrote:
>
>>
>> Because part of it would
>> only be relevant once we support upgrading from PG18. So for now the
>> upgrade_code I haven't actually run.
>
>
> Does it apply against v16?  If so, branch off there, apply it, then upgrade 
> from the v16 branch to master.

I realized it's possible to do an "upgrade" with pg_upgrade from v17
to v17. So I was able to test both the pre and post PG18 upgrade logic
manually by changing the version in this line:

if (fout->remoteVersion >= 18)

As expected the new pg_upgrade code was severely broken. Attached is a
new patch where the pg_upgrade code now actually works.

v3-0001-Add-support-for-extensions-with-an-owned-schema.patch
Description: Binary data

Re: confusing valgrind report about tuplestore+wrapper_handler (?) on 32-bit arm

2024-06-20 Thread Ranier Vilela

Em qui., 20 de jun. de 2024 às 07:28, Tomas Vondra <
tomas.von...@enterprisedb.com> escreveu:

> Hi,
>
> While running valgrind on 32-bit ARM (rpi5 with debian), I got this
> really strange report:
>
>
> ==25520== Use of uninitialised value of size 4
> ==25520==at 0x94A550: wrapper_handler (pqsignal.c:108)
> ==25520==by 0x4D7826F: ??? (sigrestorer.S:64)
> ==25520==  Uninitialised value was created by a heap allocation
> ==25520==at 0x8FB780: palloc (mcxt.c:1340)
> ==25520==by 0x913067: tuplestore_begin_common (tuplestore.c:289)
> ==25520==by 0x91310B: tuplestore_begin_heap (tuplestore.c:331)
> ==25520==by 0x3EA717: ExecMaterial (nodeMaterial.c:64)
> ==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
> ==25520==by 0x3EF73F: ExecProcNode (executor.h:274)
> ==25520==by 0x3F0637: ExecMergeJoin (nodeMergejoin.c:703)
> ==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
> ==25520==by 0x3C47DB: ExecProcNode (executor.h:274)
> ==25520==by 0x3C4D4F: fetch_input_tuple (nodeAgg.c:561)
> ==25520==by 0x3C8233: agg_retrieve_direct (nodeAgg.c:2364)
> ==25520==by 0x3C7E07: ExecAgg (nodeAgg.c:2179)
> ==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
> ==25520==by 0x3A5EC3: ExecProcNode (executor.h:274)
> ==25520==by 0x3A8FBF: ExecutePlan (execMain.c:1646)
> ==25520==by 0x3A6677: standard_ExecutorRun (execMain.c:363)
> ==25520==by 0x3A644B: ExecutorRun (execMain.c:304)
> ==25520==by 0x6976D3: PortalRunSelect (pquery.c:924)
> ==25520==by 0x6972F7: PortalRun (pquery.c:768)
> ==25520==by 0x68FA1F: exec_simple_query (postgres.c:1274)
> ==25520==
> {
>
>Memcheck:Value4
>fun:wrapper_handler
>obj:/usr/lib/arm-linux-gnueabihf/libc.so.6
> }
> **25520** Valgrind detected 1 error(s) during execution of "select
> count(*) from
> **25520**   (select * from tenk1 x order by x.thousand, x.twothousand,
> x.fivethous) x
> **25520**   left join
> **25520**   (select * from tenk1 y order by y.unique2) y
> **25520**   on x.thousand = y.unique2 and x.twothousand = y.hundred and
> x.fivethous = y.unique2;"
>
>
> I'm mostly used to weird valgrind stuff on this platform, but it's
> usually about libarmmmem and (possibly) thinking it might access
> undefined stuff when calculating checksums etc.
>
> This seems somewhat different, so I wonder if it's something real?

It seems like a false positive to me.

According to valgrind's documentation:
https://valgrind.org/docs/manual/mc-manual.html#mc-manual.value

" This can lead to false positive errors, as the shared memory can be
initialised via a first mapping, and accessed via another mapping. The
access via this other mapping will have its own V bits, which have not been
changed when the memory was initialised via the first mapping. The bypass
for these false positives is to use Memcheck's client requests
VALGRIND_MAKE_MEM_DEFINED and VALGRIND_MAKE_MEM_UNDEFINED to inform
Memcheck about what your program does (or what another process does) to
these shared memory mappings. "

best regards,
Ranier Vilela

Re: Conflict Detection and Resolution

2024-06-20 Thread Ashutosh Bapat

On Thu, Jun 20, 2024 at 3:21 PM Amit Kapila  wrote:

> On Wed, Jun 19, 2024 at 2:51 PM Ashutosh Bapat
>  wrote:
> >
> > On Wed, Jun 19, 2024 at 12:03 PM Amit Kapila 
> wrote:
> >>
> >> > I doubt that it would be that simple. The application will have to
> intervene and tell one of the employees that their reservation has failed.
> It looks natural that the first one to reserve the room should get the
> reservation, but implementing that is more complex than resolving a
> conflict in the database. In fact, mostly it will be handled outside
> database.
> >> >
> >>
> >> Sure, the application needs some handling but I have tried to explain
> >> with a simple way that comes to my mind and how it can be realized
> >> with db involved. This is a known conflict detection method but note
> >> that I am not insisting to have "earliest_timestamp_wins". Even, if we
> >> want this we can have a separate discussion on this and add it later.
> >>
> >
> > It will be good to add a minimal set of conflict resolution strategies
> to begin with, while designing the feature for extensibility. I imagine the
> first version might just detect the conflict and throw error or do nothing.
> That's already two simple conflict resolution strategies with minimal
> efforts. We can add more complicated ones incrementally.
> >
>
> Agreed, splitting the work into multiple patches would help us to
> finish the easier ones first.
>
> I have thought to divide it such that in the first patch, we detect
> conflicts like 'insert_exists', 'update_differ', 'update_missing', and
> 'delete_missing' (the definition of each could be found in the initial
> email [1]) and throw an ERROR or write them in LOG. Various people
> agreed to have this as a separate committable work [2]. This can help
> users to detect and monitor the conflicts in a better way. I have
> intentionally skipped update_deleted as it would require more
> infrastructure and it would be helpful even without that.
>

Since we are in the initial months of release, it will be good to take a
stock of whether the receiver receives all the information needed for most
(if not all) of the conflict detection and resolution strategies. If there
are any missing pieces, we may want to add those in PG18 so that improved
conflict detection and resolution on a higher version receiver can still
work.


>
> In the second patch, we can implement simple built-in resolution
> strategies like apply and skip (which can be named as remote_apply and
> keep_local, see [3][4] for details on these strategies) with ERROR or
> LOG being the default strategy. We can allow these strategies to be
> configured at the global and table level.
>
> In the third patch, we can add monitoring capability for conflicts and
> resolutions as mentioned by Jonathan [5]. Here, we can have stats like
> how many conflicts of a particular type have happened.
>

That looks like a plan. Thanks for chalking it out.

-- 
Best Wishes,
Ashutosh Bapat

Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-20 Thread Ranier Vilela

Em qua., 19 de jun. de 2024 às 20:52, Tom Lane  escreveu:

> I wrote:
> > I hypothesize that the reason we're not seeing equivalent failures
> > on x86_64 is one of
>
> > 1. x86_64 valgrind is stupider than aarch64, and fails to track that
> > the contents of the SIMD registers are only partially defined.
>
> > 2. x86_64 valgrind is smarter than aarch64, and is able to see
> > that the "mask off invalid entries" step removes all the
> > potentially-uninitialized bits.
>
> Hah: it's the second case.  If I patch radixtree.h as attached,
> then x86_64 valgrind complains about
>
> ==00:00:00:32.759 247596== Conditional jump or move depends on
> uninitialised value(s)
> ==00:00:00:32.759 247596==at 0x52F668: local_ts_node_16_search_eq
> (radixtree.h:1018)
>
> showing that it knows that the result of vector8_highbit_mask is
> only partly defined.

I wouldn't be surprised if *RT_NODE_16_GET_INSERTPOS*
(src/include/lib/radixtree.h),
does not suffer from the same problem?
Even with Assert trying to protect.

Does the fix not apply here too?

best regards,
Ranier Vilela

Re: confusing valgrind report about tuplestore+wrapper_handler (?) on 32-bit arm




On 6/20/24 13:32, Ranier Vilela wrote:
> Em qui., 20 de jun. de 2024 às 07:28, Tomas Vondra <
> tomas.von...@enterprisedb.com> escreveu:
> 
>> Hi,
>>
>> While running valgrind on 32-bit ARM (rpi5 with debian), I got this
>> really strange report:
>>
>>
>> ==25520== Use of uninitialised value of size 4
>> ==25520==at 0x94A550: wrapper_handler (pqsignal.c:108)
>> ==25520==by 0x4D7826F: ??? (sigrestorer.S:64)
>> ==25520==  Uninitialised value was created by a heap allocation
>> ==25520==at 0x8FB780: palloc (mcxt.c:1340)
>> ==25520==by 0x913067: tuplestore_begin_common (tuplestore.c:289)
>> ==25520==by 0x91310B: tuplestore_begin_heap (tuplestore.c:331)
>> ==25520==by 0x3EA717: ExecMaterial (nodeMaterial.c:64)
>> ==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
>> ==25520==by 0x3EF73F: ExecProcNode (executor.h:274)
>> ==25520==by 0x3F0637: ExecMergeJoin (nodeMergejoin.c:703)
>> ==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
>> ==25520==by 0x3C47DB: ExecProcNode (executor.h:274)
>> ==25520==by 0x3C4D4F: fetch_input_tuple (nodeAgg.c:561)
>> ==25520==by 0x3C8233: agg_retrieve_direct (nodeAgg.c:2364)
>> ==25520==by 0x3C7E07: ExecAgg (nodeAgg.c:2179)
>> ==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
>> ==25520==by 0x3A5EC3: ExecProcNode (executor.h:274)
>> ==25520==by 0x3A8FBF: ExecutePlan (execMain.c:1646)
>> ==25520==by 0x3A6677: standard_ExecutorRun (execMain.c:363)
>> ==25520==by 0x3A644B: ExecutorRun (execMain.c:304)
>> ==25520==by 0x6976D3: PortalRunSelect (pquery.c:924)
>> ==25520==by 0x6972F7: PortalRun (pquery.c:768)
>> ==25520==by 0x68FA1F: exec_simple_query (postgres.c:1274)
>> ==25520==
>> {
>>
>>Memcheck:Value4
>>fun:wrapper_handler
>>obj:/usr/lib/arm-linux-gnueabihf/libc.so.6
>> }
>> **25520** Valgrind detected 1 error(s) during execution of "select
>> count(*) from
>> **25520**   (select * from tenk1 x order by x.thousand, x.twothousand,
>> x.fivethous) x
>> **25520**   left join
>> **25520**   (select * from tenk1 y order by y.unique2) y
>> **25520**   on x.thousand = y.unique2 and x.twothousand = y.hundred and
>> x.fivethous = y.unique2;"
>>
>>
>> I'm mostly used to weird valgrind stuff on this platform, but it's
>> usually about libarmmmem and (possibly) thinking it might access
>> undefined stuff when calculating checksums etc.
>>
>> This seems somewhat different, so I wonder if it's something real?
> 
> It seems like a false positive to me.
> 
> According to valgrind's documentation:
> https://valgrind.org/docs/manual/mc-manual.html#mc-manual.value
> 
> " This can lead to false positive errors, as the shared memory can be
> initialised via a first mapping, and accessed via another mapping. The
> access via this other mapping will have its own V bits, which have not been
> changed when the memory was initialised via the first mapping. The bypass
> for these false positives is to use Memcheck's client requests
> VALGRIND_MAKE_MEM_DEFINED and VALGRIND_MAKE_MEM_UNDEFINED to inform
> Memcheck about what your program does (or what another process does) to
> these shared memory mappings. "
> 

But that's about shared memory, and the report has nothing to do with
shared memory AFAICS.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: jsonapi type fixups

2024-06-20 Thread Andrew Dunstan




On 2024-06-18 Tu 7:48 AM, Peter Eisentraut wrote:
I have this patch series that fixes up the types of the new 
incremental JSON API a bit.  Specifically, it uses "const" throughout 
so that the top-level entry points such as pg_parse_json_incremental() 
can declare their arguments as const char * instead of just char *.  
This just works, it doesn't require any new casting tricks.  In fact, 
it removes a few unconstify() calls.


Also, a few arguments and variables that relate to object sizes should 
be size_t rather than int.  At this point, this mainly makes the API 
better self-documenting.  I don't think it actually works to parse 
larger than 2 GB chunks (not tested).




I think this is mostly OK.

The change at line 1857 of jsonapi.c looks dubious, though. The pointer 
variable p looks anything but constant. Perhaps I'm misunderstanding.


It would also be nice to reword the comment at line 3142 of jsonfuncs.c, 
so it can still fit on one line.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: confusing valgrind report about tuplestore+wrapper_handler (?) on 32-bit arm

2024-06-20 Thread Ranier Vilela

Em qui., 20 de jun. de 2024 às 08:54, Tomas Vondra <
tomas.von...@enterprisedb.com> escreveu:

>
>
> On 6/20/24 13:32, Ranier Vilela wrote:
> > Em qui., 20 de jun. de 2024 às 07:28, Tomas Vondra <
> > tomas.von...@enterprisedb.com> escreveu:
> >
> >> Hi,
> >>
> >> While running valgrind on 32-bit ARM (rpi5 with debian), I got this
> >> really strange report:
> >>
> >>
> >> ==25520== Use of uninitialised value of size 4
> >> ==25520==at 0x94A550: wrapper_handler (pqsignal.c:108)
> >> ==25520==by 0x4D7826F: ??? (sigrestorer.S:64)
> >> ==25520==  Uninitialised value was created by a heap allocation
> >> ==25520==at 0x8FB780: palloc (mcxt.c:1340)
> >> ==25520==by 0x913067: tuplestore_begin_common (tuplestore.c:289)
> >> ==25520==by 0x91310B: tuplestore_begin_heap (tuplestore.c:331)
> >> ==25520==by 0x3EA717: ExecMaterial (nodeMaterial.c:64)
> >> ==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
> >> ==25520==by 0x3EF73F: ExecProcNode (executor.h:274)
> >> ==25520==by 0x3F0637: ExecMergeJoin (nodeMergejoin.c:703)
> >> ==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
> >> ==25520==by 0x3C47DB: ExecProcNode (executor.h:274)
> >> ==25520==by 0x3C4D4F: fetch_input_tuple (nodeAgg.c:561)
> >> ==25520==by 0x3C8233: agg_retrieve_direct (nodeAgg.c:2364)
> >> ==25520==by 0x3C7E07: ExecAgg (nodeAgg.c:2179)
> >> ==25520==by 0x3B2FF7: ExecProcNodeFirst (execProcnode.c:464)
> >> ==25520==by 0x3A5EC3: ExecProcNode (executor.h:274)
> >> ==25520==by 0x3A8FBF: ExecutePlan (execMain.c:1646)
> >> ==25520==by 0x3A6677: standard_ExecutorRun (execMain.c:363)
> >> ==25520==by 0x3A644B: ExecutorRun (execMain.c:304)
> >> ==25520==by 0x6976D3: PortalRunSelect (pquery.c:924)
> >> ==25520==by 0x6972F7: PortalRun (pquery.c:768)
> >> ==25520==by 0x68FA1F: exec_simple_query (postgres.c:1274)
> >> ==25520==
> >> {
> >>
> >>Memcheck:Value4
> >>fun:wrapper_handler
> >>obj:/usr/lib/arm-linux-gnueabihf/libc.so.6
> >> }
> >> **25520** Valgrind detected 1 error(s) during execution of "select
> >> count(*) from
> >> **25520**   (select * from tenk1 x order by x.thousand, x.twothousand,
> >> x.fivethous) x
> >> **25520**   left join
> >> **25520**   (select * from tenk1 y order by y.unique2) y
> >> **25520**   on x.thousand = y.unique2 and x.twothousand = y.hundred and
> >> x.fivethous = y.unique2;"
> >>
> >>
> >> I'm mostly used to weird valgrind stuff on this platform, but it's
> >> usually about libarmmmem and (possibly) thinking it might access
> >> undefined stuff when calculating checksums etc.
> >>
> >> This seems somewhat different, so I wonder if it's something real?
> >
> > It seems like a false positive to me.
> >
> > According to valgrind's documentation:
> > https://valgrind.org/docs/manual/mc-manual.html#mc-manual.value
> >
> > " This can lead to false positive errors, as the shared memory can be
> > initialised via a first mapping, and accessed via another mapping. The
> > access via this other mapping will have its own V bits, which have not
> been
> > changed when the memory was initialised via the first mapping. The bypass
> > for these false positives is to use Memcheck's client requests
> > VALGRIND_MAKE_MEM_DEFINED and VALGRIND_MAKE_MEM_UNDEFINED to inform
> > Memcheck about what your program does (or what another process does) to
> > these shared memory mappings. "
> >
>
> But that's about shared memory, and the report has nothing to do with
> shared memory AFAICS.
>
You can try once:
Selecting --expensive-definedness-checks=yes causes Memcheck to use the
most accurate analysis possible. This minimises false error rates but can
cause up to 30% performance degradation.

I did a search through my reports and none refer to this particular source.

best regards,
Ranier Vilela

Custom TupleTableSlotOps while Initializing Custom Scan

2024-06-20 Thread V N G Samba Siva Reddy Chinta

Hello Team,
Good Day,

I have been working on adding a CustomScanState object in the executor
state in my project. As part of CustomScanState, I execute queries and
store their results in the Tuplestorestate object. After storing all tuples
in the Tuplestorestate, I retrieve each tuple and place it in the
TupleTableSlot using the tuplestore_gettupleslot() function.

However, I encounter an error: *"trying to store a minimal tuple into the
wrong type of slot."* Upon debugging, I discovered that the TupleTableSlot
only holds virtual tuples (tupleTableSlot->tts_ops is set to TTSOpsVirtual).
In contrast, tuplestore_gettupleslot() calls ExecStoreMinimalTuple(), which
expects TupleTableSlotOps of type TTSOpsMinimalTuple.

Further investigation revealed that in the ExecInitCustomScan() function
within the nodeCustom.c source file, where ScanTupleSlot and
ResultTupleSlots are initialized, users can choose custom slots by setting
slotOps in CustomScanState. We initialize the ScanTupleSlot based on
user-specified slotOps, but for ResultTupleSlot, we proceed with
TTSOpsVirtual instead of the custom slotOps, which is causing the issue.

Is this behavior expected? Is there a way to store tuples in slots
according to the TupleTableSlot type?

I found a function ExecForceStoreMinimalTuple() which can be used in my
case. We need to pass the MinimalTuple to this function, but I was unable
to find a way to fetch the tuple from tuple storestate. We do have
tuplestore_gettuple()
function to get the minimal tuple but it is a static function, is there any
other function like that?

Below is the code snippet of ExecInitCustomScan() , for simplicity I
removed some code in the function. I took it from the nodeCustom.c file in
the PG source.
CustomScanState *
ExecInitCustomScan(CustomScan *cscan, EState *estate, int eflags)
{
CustomScanState *css;
const TupleTableSlotOps *slotOps;

css = castNode(CustomScanState,
cscan->methods->CreateCustomScanState(cscan));
// --- CODE STARTED 

/*
* Use a custom slot if specified in CustomScanState or use virtual slot
* otherwise.
*/
slotOps = css->slotOps;
if (!slotOps)
slotOps = &TTSOpsVirtual;

if (cscan->custom_scan_tlist != NIL || scan_rel == NULL)
{
ExecInitScanTupleSlot(estate, &css->ss, scan_tupdesc, slotOps); // Here we
are using slotOps provided by user
}
else
{
ExecInitScanTupleSlot(estate, &css->ss, RelationGetDescr(scan_rel),
slotOps); // Here we are using slotOps provided by user
}

ExecInitResultTupleSlotTL(&css->ss.ps, &*TTSOpsVirtual*); // Here we have
hard coded TTSOpsVirtual
// -- CODE ENDED ---
}

Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-20 Thread Masahiko Sawada

On Thu, Jun 20, 2024 at 4:58 PM John Naylor  wrote:
>
> On Thu, Jun 20, 2024 at 8:12 AM Masahiko Sawada  wrote:
>
> > On Thu, Jun 20, 2024 at 7:54 AM Tom Lane  wrote:
> > >
> > > I don't know if there's any reason why the current order
> > > is preferable.)
> >
> > IIUC there is no particular reason for the current order in RT_NODE_48.
>
> Yeah. I found that simply swapping them enables clang to avoid
> double-initialization, but gcc still can't figure it out and must be
> told to stop at slot_idxs[]. I'd prefer to do it that way and document
> that slot_idxs is purposefully the last member of the fixed part of
> the struct.

+1

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: jsonapi type fixups

2024-06-20 Thread Andrew Dunstan




On 2024-06-20 Th 8:05 AM, Andrew Dunstan wrote:


On 2024-06-18 Tu 7:48 AM, Peter Eisentraut wrote:
I have this patch series that fixes up the types of the new 
incremental JSON API a bit.  Specifically, it uses "const" throughout 
so that the top-level entry points such as 
pg_parse_json_incremental() can declare their arguments as const char 
* instead of just char *.  This just works, it doesn't require any 
new casting tricks.  In fact, it removes a few unconstify() calls.


Also, a few arguments and variables that relate to object sizes 
should be size_t rather than int.  At this point, this mainly makes 
the API better self-documenting.  I don't think it actually works to 
parse larger than 2 GB chunks (not tested).




I think this is mostly OK.

The change at line 1857 of jsonapi.c looks dubious, though. The 
pointer variable p looks anything but constant. Perhaps I'm 
misunderstanding.



Ignore this comment, moment of brain fade. Of course it's the string 
that's constant, not the pointer.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Pgoutput not capturing the generated columns

2024-06-20 Thread Shlok Kyal

On Thu, 20 Jun 2024 at 12:52, vignesh C  wrote:
>
> On Wed, 19 Jun 2024 at 21:43, Peter Eisentraut  wrote:
> >
> > On 19.06.24 13:22, Shubham Khanna wrote:
> > > All the comments are handled.
> > >
> > > The attached Patch contains all the suggested changes.
> >
> > Please also take a look at the proposed patch for virtual generated
> > columns [0] and consider how that would affect your patch.  I think your
> > feature can only replicate *stored* generated columns.  So perhaps the
> > documentation and terminology in your patch should reflect that.
>
> This patch is unable to manage virtual generated columns because it
> stores NULL values for them. Along with documentation the initial sync
> command being generated also should be changed to sync data
> exclusively for stored generated columns, omitting virtual ones. I
> suggest treating these changes as a separate patch(0003) for future
> merging or a separate commit, depending on the order of patch
> acceptance.
>

I have addressed the issue and updated the documentation accordingly.
And created a new 0003 patch.
Please refer to v9-0003 patch for the same in [1].

[1]: 
https://www.postgresql.org/message-id/CANhcyEXmjLEPNgOSAtjS4YGb9JvS8w-SO9S%2BjRzzzXo2RavNWw%40mail.gmail.com

Thanks and Regards,
Shlok Kyal

Re: Pluggable cumulative statistics

2024-06-20 Thread Bertrand Drouvot

Hi,

On Thu, Jun 13, 2024 at 04:59:50PM +0900, Michael Paquier wrote:
> Hi all,
> 
> 2) Make the shmem pgstats pluggable so as it is possible for extensions
> to register their own stats kinds.

Thanks for the patch! I like the idea of having custom stats (it has also been
somehow mentioned in [1]).

> 2) is actually something that can be used for more things than
> just pg_stat_statements, because people love extensions and
> statistics (spoiler: I do).

+1

> * Making custom stats data persistent is an interesting problem, and
> there are a couple of approaches I've considered:
> ** Allow custom kinds to define callbacks to read and write data from
> a source they'd want, like their own file through a fd.  This has the
> disadvantage to remove the benefit of c) above.
> ** Store everything in the existing stats file, adding one type of
> entry like 'S' and 'N' with a "custom" type, where the *name* of the
> custom stats kind is stored instead of its ID computed from shared
> memory.

What about having 2 files?

- One is the existing stats file
- One "predefined" for all the custom stats (so what you've done minus the
in-core stats). This one would not be configurable and the extensions will
not need to know about it.

Would that remove the benefit from c) that you mentioned up-thread?

[1]: 
https://www.postgresql.org/message-id/20220818195124.c7ipzf6c5v7vxymc%40awork3.anarazel.de

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Conflict Detection and Resolution

On Thu, Jun 20, 2024 at 5:06 PM Ashutosh Bapat
 wrote:
>
> On Thu, Jun 20, 2024 at 3:21 PM Amit Kapila  wrote:
>>
>> On Wed, Jun 19, 2024 at 2:51 PM Ashutosh Bapat
>>  wrote:
>> >
>> > On Wed, Jun 19, 2024 at 12:03 PM Amit Kapila  
>> > wrote:
>> >>
>> >> > I doubt that it would be that simple. The application will have to 
>> >> > intervene and tell one of the employees that their reservation has 
>> >> > failed. It looks natural that the first one to reserve the room should 
>> >> > get the reservation, but implementing that is more complex than 
>> >> > resolving a conflict in the database. In fact, mostly it will be 
>> >> > handled outside database.
>> >> >
>> >>
>> >> Sure, the application needs some handling but I have tried to explain
>> >> with a simple way that comes to my mind and how it can be realized
>> >> with db involved. This is a known conflict detection method but note
>> >> that I am not insisting to have "earliest_timestamp_wins". Even, if we
>> >> want this we can have a separate discussion on this and add it later.
>> >>
>> >
>> > It will be good to add a minimal set of conflict resolution strategies to 
>> > begin with, while designing the feature for extensibility. I imagine the 
>> > first version might just detect the conflict and throw error or do 
>> > nothing. That's already two simple conflict resolution strategies with 
>> > minimal efforts. We can add more complicated ones incrementally.
>> >
>>
>> Agreed, splitting the work into multiple patches would help us to
>> finish the easier ones first.
>>
>> I have thought to divide it such that in the first patch, we detect
>> conflicts like 'insert_exists', 'update_differ', 'update_missing', and
>> 'delete_missing' (the definition of each could be found in the initial
>> email [1]) and throw an ERROR or write them in LOG. Various people
>> agreed to have this as a separate committable work [2]. This can help
>> users to detect and monitor the conflicts in a better way. I have
>> intentionally skipped update_deleted as it would require more
>> infrastructure and it would be helpful even without that.
>
>
> Since we are in the initial months of release, it will be good to take a 
> stock of whether the receiver receives all the information needed for most 
> (if not all) of the conflict detection and resolution strategies. If there 
> are any missing pieces, we may want to add those in PG18 so that improved 
> conflict detection and resolution on a higher version receiver can still work.
>

Good point. This can help us to detect conflicts if required even when
we move to a higher version. As we continue to discuss/develop the
features, I hope we will be able to see any missing pieces.

>>
>>
>> In the second patch, we can implement simple built-in resolution
>> strategies like apply and skip (which can be named as remote_apply and
>> keep_local, see [3][4] for details on these strategies) with ERROR or
>> LOG being the default strategy. We can allow these strategies to be
>> configured at the global and table level.
>>
>> In the third patch, we can add monitoring capability for conflicts and
>> resolutions as mentioned by Jonathan [5]. Here, we can have stats like
>> how many conflicts of a particular type have happened.
>
>
> That looks like a plan. Thanks for chalking it out.
>

Thanks!

-- 
With Regards,
Amit Kapila.

Re: Logical Replication of sequences

On Wed, Jun 19, 2024 at 8:33 PM vignesh C  wrote:
>
> On Tue, 18 Jun 2024 at 16:10, Amit Kapila  wrote:
> >
> >
> > Agreed and I am not sure which is better because there is a value in
> > keeping the state name the same for both sequences and tables. We
> > probably need more comments in code and doc updates to make the
> > behavior clear. We can start with the sequence state as 'init' for
> > 'needs-to-be-sycned' and 'ready' for 'synced' and can change if others
> > feel so during the review.
>
> Here is a patch which does the sequence synchronization in the
> following lines from the above discussion:
>

Thanks for summarizing the points discussed. I would like to confirm
whether the patch replicates new sequences that are created
implicitly/explicitly for a publication defined as ALL SEQUENCES.

-- 
With Regards,
Amit Kapila.

Visibility bug with prepared transaction with subtransactions on standby


Hi,

Konstantin and I found an MVCC bug with:

- a prepared transaction,
- which has a subtransaction,
- on a hot standby,
- after starting the standby from a shutdown checkpoint.

See the test case in the attached patch to demonstrate this. The last 
query in the new test returns incorrect result on master, causing the 
test to fail.


The problem
---

When you shut down a primary with a prepared transaction, and start a 
hot standby server from the shutdown checkpoint, the hot standby server 
goes through this code at startup:



if (wasShutdown)
oldestActiveXID = 
PrescanPreparedTransactions(&xids, &nxids);
else
oldestActiveXID = checkPoint.oldestActiveXid;
Assert(TransactionIdIsValid(oldestActiveXID));

/* Tell procarray about the range of xids it has to 
deal with */

ProcArrayInitRecovery(XidFromFullTransactionId(TransamVariables->nextXid));

/*
 * Startup subtrans only.  CLOG, MultiXact and commit 
timestamp
 * have already been started up and other SLRUs are not 
maintained
 * during recovery and need not be started yet.
 */
StartupSUBTRANS(oldestActiveXID);

/*
 * If we're beginning at a shutdown checkpoint, we know 
that
 * nothing was running on the primary at this point. So 
fake-up an
 * empty running-xacts record and use that here and 
now. Recover
 * additional standby state for prepared transactions.
 */
if (wasShutdown)
{
RunningTransactionsData running;
TransactionId latestCompletedXid;

/*
 * Construct a RunningTransactions snapshot 
representing a
 * shut down server, with only prepared 
transactions still
 * alive. We're never overflowed at this point 
because all
 * subxids are listed with their parent 
prepared transactions.
 */
running.xcnt = nxids;
running.subxcnt = 0;
running.subxid_overflow = false;
running.nextXid = 
XidFromFullTransactionId(checkPoint.nextXid);
running.oldestRunningXid = oldestActiveXID;
latestCompletedXid = 
XidFromFullTransactionId(checkPoint.nextXid);
TransactionIdRetreat(latestCompletedXid);

Assert(TransactionIdIsNormal(latestCompletedXid));
running.latestCompletedXid = latestCompletedXid;
running.xids = xids;

ProcArrayApplyRecoveryInfo(&running);

StandbyRecoverPreparedTransactions();
}


The problem is that the RunningTransactions snapshot constructed here 
does not include subtransaction XIDs of the prepared transactions, only 
the main XIDs. Because of that, snapshots taken in the standby will 
consider the sub-XIDs as aborted rather than in-progress. That leads to 
two problems if the prepared transaction is later committed:


- We will incorrectly set hint bits on tuples inserted/deleted by the 
subtransactions, which leads to incorrect query results later if the 
prepared transaction is committed.


- If you acquire an MVCC snapshot and hold to it while the prepared 
transaction commits, the subtransactions will suddenly become visible to 
the old snapshot.


History
---

StandbyRecoverPreparedTransactions has this comment:


 * The lack of calls to SubTransSetParent() calls here is by design;
 * those calls are made by RecoverPreparedTransactions() at the end of recovery
 * for those xacts that need this.


I think that's wrong; it really should update pg_subtrans. It used to, a 
long time ago, but commit 49e92815497 changed it. Reading the 
discussions that led to that change, seems that we somehow didn't 
realize that it's important to distinguish between in-progress and 
aborted transactions in a standby. On that thread, Nikhil posted [1] a 
test case that is almost exactly the same test case that I used to find 
this, but apparently the visibility in standby in that scenario was not 
tested thoroughly back then.


[1] 
https://www.postgresql.org/message-id/CAMGcDxde4XjDyTjGvZCPVQROpXw1opfpC0vjpCkzc1pcQBqvrg%40mail.gmail.com


Fix
---

Attached is a patch to fix this, wi

Re: Visibility bug with prepared transaction with subtransactions on standby


On 20/06/2024 16:41, Heikki Linnakangas wrote:

Attached is a patch to fix this, with a test case.


The patch did not compile, thanks to a last-minute change in a field 
name. Here's a fixed version.


--
Heikki Linnakangas
Neon (https://neon.tech)
From ccb0bd36619f03055382205eddd7a0e9b64bba95 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 20 Jun 2024 15:49:05 +0300
Subject: [PATCH v2 1/2] tests: Trim newline from result returned by
 BackgroundPsql->query

This went unnoticed, because only a few existing callers of
BackgroundPsql->query used the result, and the ones that did were not
bothered by an extra newline. I noticed because I was about to add a
new test that checks the result.
---
 src/test/perl/PostgreSQL/Test/BackgroundPsql.pm | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index 4091c311b8..041b3dfa7d 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -217,13 +217,13 @@ sub query
 	my $banner = "background_psql: QUERY_SEPARATOR";
 	$self->{stdin} .= "$query\n;\n\\echo $banner\n";
 
-	pump_until($self->{run}, $self->{timeout}, \$self->{stdout}, qr/$banner/);
+	pump_until($self->{run}, $self->{timeout}, \$self->{stdout}, qr/$banner\n/);
 
 	die "psql query timed out" if $self->{timeout}->is_expired;
 	$output = $self->{stdout};
 
 	# remove banner again, our caller doesn't care
-	$output =~ s/\n$banner$//s;
+	$output =~ s/\n$banner\n$//s;
 
 	# clear out output for the next query
 	$self->{stdout} = '';
-- 
2.39.2

From f44c5ec2d3383240c495c36e05da812791e3f5d4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 20 Jun 2024 17:09:41 +0300
Subject: [PATCH v2 2/2] Fix MVCC bug with prepared xact with subxacts on
 standby

We did not recover the subtransaction IDs of prepared transactions
when starting a hot standby from a shutdown checkpoint. As a result,
such subtransactions were considered as aborted, rather than
in-progress. That would lead to hint bits being set incorrectly, and
the subtransactions suddenly becoming visible to old snapshots when
the prepared transaction was committed.

To fix, update pg_subtrans with prepared transactions's subxids when
starting hot standby from a shutdown checkpoint. The snapshots taken
from that state need to be marked as "suboverflowed", so that we also
check the pg_subtrans.

Discussion: XXX
---
 src/backend/access/transam/twophase.c |  7 ++---
 src/backend/access/transam/xlog.c | 14 +
 src/backend/storage/ipc/procarray.c   | 18 +--
 src/backend/storage/ipc/standby.c |  6 ++--
 src/include/storage/standby.h | 10 +-
 src/test/recovery/t/009_twophase.pl   | 45 +++
 src/tools/pgindent/typedefs.list  |  1 +
 7 files changed, 84 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index bf451d42ff..9a8257fcaf 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2035,9 +2035,8 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
  * This is never called at the end of recovery - we use
  * RecoverPreparedTransactions() at that point.
  *
- * The lack of calls to SubTransSetParent() calls here is by design;
- * those calls are made by RecoverPreparedTransactions() at the end of recovery
- * for those xacts that need this.
+ * This updates pg_subtrans, so that any subtransactions will be correctly
+ * seen as in-progress in snapshots taken during recovery.
  */
 void
 StandbyRecoverPreparedTransactions(void)
@@ -2057,7 +2056,7 @@ StandbyRecoverPreparedTransactions(void)
 
 		buf = ProcessTwoPhaseBuffer(xid,
 	gxact->prepare_start_lsn,
-	gxact->ondisk, false, false);
+	gxact->ondisk, true, false);
 		if (buf != NULL)
 			pfree(buf);
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f..3d084d1c38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5777,6 +5777,9 @@ StartupXLOG(void)
 RunningTransactionsData running;
 TransactionId latestCompletedXid;
 
+/* Update pg_subtrans entries for any prepared transactions */
+StandbyRecoverPreparedTransactions();
+
 /*
  * Construct a RunningTransactions snapshot representing a
  * shut down server, with only prepared transactions still
@@ -5785,7 +5788,7 @@ StartupXLOG(void)
  */
 running.xcnt = nxids;
 running.subxcnt = 0;
-running.subxid_overflow = false;
+running.subxid_status = SUBXIDS_IN_SUBTRANS;
 running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
 running.oldestRunningXid = oldestActiveXID;
 latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
@@ -5795,8 +5798,6 @@ StartupXLOG(void)
 running.xids = xids;
 
 ProcA

Re: Pluggable cumulative statistics

2024-06-20 Thread Bertrand Drouvot

Hi,

On Thu, Jun 20, 2024 at 09:46:30AM +0900, Michael Paquier wrote:
> On Thu, Jun 13, 2024 at 04:59:50PM +0900, Michael Paquier wrote:
> > - How should the persistence of the custom stats be achieved?
> > Callbacks to give custom stats kinds a way to write/read their data,
> > push everything into a single file, or support both?
> > - Should this do like custom RMGRs and assign to each stats kinds ID
> > that are set in stone rather than dynamic ones?

> These two questions have been itching me in terms of how it would work
> for extension developers, after noticing that custom RMGRs are used
> more than I thought:
> https://wiki.postgresql.org/wiki/CustomWALResourceManagers
> 
> The result is proving to be nicer, shorter by 300 lines in total and
> much simpler when it comes to think about the way stats are flushed
> because it is possible to achieve the same result as the first patch
> set without manipulating any of the code paths doing the read and
> write of the pgstats file.

I think it makes sense to follow the same "behavior" as the custom
wal resource managers. That, indeed, looks much more simpler than v1.

> In terms of implementation, pgstat.c's KindInfo data is divided into
> two parts, for efficiency:
> - The exiting in-core stats with designated initializers, renamed as
> built-in stats kinds.
> - The custom stats kinds are saved in TopMemoryContext,

Agree that a backend lifetime memory area is fine for that purpose.

> and can only
> be registered with shared_preload_libraries.  The patch reserves a set
> of 128 harcoded slots for all the custom kinds making the lookups for
> the KindInfos quite cheap.

+   MemoryContextAllocZero(TopMemoryContext,
+  
sizeof(PgStat_KindInfo *) * PGSTAT_KIND_CUSTOM_SIZE);

and that's only 8 * PGSTAT_KIND_CUSTOM_SIZE bytes in total.

I had a quick look at the patches (have in mind to do more):

> With that in mind, the patch set is more pleasant to the eye, and the
> attached v2 consists of:
> - 0001 and 0002 are some cleanups, same as previously to prepare for
> the backend-side APIs.

0001 and 0002 look pretty straightforward at a quick look.

> - 0003 adds the backend support to plug-in custom stats.

1 ===

It looks to me that there is a mix of "in core" and "built-in" to name the
non custom stats. Maybe it's worth to just use one?
 
As I can see (and as you said above) this is mainly inspired by the custom
resource manager and 2 === and 3 === are probably copy/paste consequences.

2 ===

+   if (pgstat_kind_custom_infos[idx] != NULL &&
+   pgstat_kind_custom_infos[idx]->name != NULL)
+   ereport(ERROR,
+   (errmsg("failed to register custom cumulative 
statistics \"%s\" with ID %u", kind_info->name, kind),
+errdetail("Custom resource manager \"%s\" 
already registered with the same ID.",
+  
pgstat_kind_custom_infos[idx]->name)));

s/Custom resource manager/Custom cumulative statistics/

3 ===

+   ereport(LOG,
+   (errmsg("registered custom resource manager \"%s\" with 
ID %u",
+   kind_info->name, kind)));

s/custom resource manager/custom cumulative statistics/

> - 0004 includes documentation.

Did not look yet.

> - 0005 is an example of how to use them, with a TAP test providing
> coverage.

Did not look yet.

As I said, I've in mind to do a more in depth review. I've noted the above while
doing a quick read of the patches so thought it makes sense to share them
now while at it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Remove distprep

2024-06-20 Thread Noah Misch

On Thu, Jun 20, 2024 at 09:29:45AM +0200, Peter Eisentraut wrote:
> On 16.06.24 21:34, Noah Misch wrote:
> > On Thu, Oct 05, 2023 at 05:46:46PM +0200, Peter Eisentraut wrote:
> > > --- a/src/backend/Makefile
> > > +++ b/src/backend/Makefile
> > 
> > >   $(top_builddir)/src/include/storage/lwlocknames.h: 
> > > storage/lmgr/lwlocknames.h
> > > - prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
> > > -   cd '$(dir $@)' && rm -f $(notdir $@) && \
> > > -   $(LN_S) "$$prereqdir/$(notdir $<)" .
> > > + rm -f '$@'
> > > + $(LN_S) ../../backend/$< '$@'
> > >   $(top_builddir)/src/include/utils/wait_event_types.h: 
> > > utils/activity/wait_event_types.h
> > > - prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
> > > -   cd '$(dir $@)' && rm -f $(notdir $@) && \
> > > -   $(LN_S) "$$prereqdir/$(notdir $<)" .
> > > + rm -f '$@'
> > > + $(LN_S) ../../backend/$< '$@'
> > 
> > These broke the
> > https://www.postgresql.org/docs/17/installation-platform-notes.html#INSTALLATION-NOTES-MINGW
> > build, where LN_S='cp -pR'.  On other platforms, "make LN_S='cp -pR'"
> > reproduces this.  Reverting the above lines fixes things.  The buildfarm has
> > no coverage for that build scenario (fairywren uses Meson).
> 
> Is it just these two instances?

Yes.

> Commit 721856ff24b contains a few more hunks that change something about
> LN_S.  Are those ok?

I'm guessing "make LN_S='cp -pR'" didn't have a problem with those because
they have "." as the second argument.  "cp -pR ../foo ." is compatible with
"ln -s ../foo ." in that both are interpreting "../bar" relative to the same
directory.  Not so for "cp -pR ../foo bar/baz" vs. "ln -s ../foo bar/baz".

Re: jsonpath Time and Timestamp Special Cases

On Apr 29, 2024, at 20:45, David E. Wheeler  wrote:

> I noticed that the jsonpath date/time functions (.time() and timestamp(), et 
> al.) don’t support some valid but special-case PostgreSQL values, notably 
> `infinity`, `-infinity`, and, for times, '24:00:00`:

Looking at ECMA-404[1], “The JSON data interchange syntax”, it says, of numbers:

> Numeric values that cannot be represented as sequences of digits (such as 
> Infinity and NaN) are not
permitted.

So it makes sense that the JSON path standard would be the same, since such 
JSON explicitly cannot represent these values as numbers.

Still not sure about `24:00:00` as a time, though. I presume the jsonpath 
standard disallows it.

Best,

David

[1]: 
https://ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf

Re: datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE

2024-06-20 Thread Noah Misch

On Thu, Jun 20, 2024 at 12:17:44PM +0500, Andrey M. Borodin wrote:
> On 20 Jun 2024, at 06:29, Noah Misch  wrote:
> > This resolves the last inplace update defect known to me.
> 
> That’s a huge amount of work, thank you!
> 
> Do I get it right, that inplace updates are catalog-specific and some other 
> OOM corruptions [0] and Standby corruptions [1] are not related to this fix. 
> Bot cases we observed on regular tables.

In core code, inplace updates are specific to pg_class and pg_database.
Adding PGXN modules, only the citus extension uses them on some other table.
[0] definitely looks unrelated.

> Or that might be effect of vacuum deepening corruption after observing wrong 
> datfrozenxid?

Wrong datfrozenxid can cause premature clog truncation, which can cause "could
not access status of transaction".  While $SUBJECT could cause that, I think
it would happen on both primary and standby.  [1] seems to be about a standby
lacking clog present on the primary, which is unrelated.

> [0] 
> https://www.postgresql.org/message-id/flat/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE%40yandex-team.ru#1266dd8b898ba02686c2911e0a50ab47
> [1] 
> https://www.postgresql.org/message-id/flat/CAFj8pRBEFMxxFSCVOSi-4n0jHzSaxh6Ze_cZid5eG%3Dtsnn49-A%40mail.gmail.com

Re: Special-case executor expression steps for common combinations

2024-06-20 Thread Andreas Karlsson


On 10/12/23 11:48 AM, Daniel Gustafsson wrote:

Thoughts?


I have looked at the patch and it still applies, builds and passes the 
test cases and I personally think these optimizations are pretty much 
no-brainers that we should do and it is a pity nobody has had the time 
to review this patch.


1) The no-return case should help with the JIT, making jitted code faster.

2) The specialized strict steps helps with many common queries in the 
interpreted mode.


The code itself looks really good (great work!) but I have two comments 
on it.


1) I think the patch should be split into two. The two different 
optimizations are not related at all other than that they create 
specialized versions of expressions steps. Having them as separate makes 
the commit history easier to read for future developers.


2) We could generate functions which return void rather than NULL and 
therefore not have to do a return at all but I am not sure that small 
optimization and extra clarity would be worth the hassle. The current 
approach with adding Assert() is ok with me. Daniel, what do you think?


Andreas

Re: Special-case executor expression steps for common combinations

2024-06-20 Thread Andreas Karlsson


On 6/20/24 5:22 PM, Andreas Karlsson wrote:

On 10/12/23 11:48 AM, Daniel Gustafsson wrote:

Thoughts?


I have looked at the patch and it still applies, builds and passes the 
test cases and I personally think these optimizations are pretty much 
no-brainers that we should do and it is a pity nobody has had the time 
to review this patch.


Forgot to write that I am planning to also try to do so benchmarks to 
see if I can reproduce the speedups. :)


Andreas

Re: suspicious valgrind reports about radixtree/tidstore on arm64

Ranier Vilela  writes:
> Em qua., 19 de jun. de 2024 às 20:52, Tom Lane  escreveu:
>> Hah: it's the second case.  If I patch radixtree.h as attached,
>> then x86_64 valgrind complains about
>> ==00:00:00:32.759 247596== Conditional jump or move depends on
>> uninitialised value(s)
>> ==00:00:00:32.759 247596==at 0x52F668: local_ts_node_16_search_eq
>> (radixtree.h:1018)
>> showing that it knows that the result of vector8_highbit_mask is
>> only partly defined.

> I wouldn't be surprised if *RT_NODE_16_GET_INSERTPOS*
> (src/include/lib/radixtree.h),
> does not suffer from the same problem?

Dunno, I only saw valgrind complaints in local_ts_node_16_search_eq,
and Tomas reported the same.

It seems moderately likely to me that this is a bug in aarch64
valgrind.  Still, if it is that then people will have to deal with it
for awhile yet.  It won't cost us anything meaningful to work around
it (he says, without having done actual measurements...)

regards, tom lane

Re: DROP OWNED BY fails to clean out pg_init_privs grants

Hannu Krosing  writes:
> Is there anything that could be back-patched with reasonable effort ?

Afraid not.  The whole thing is dependent on pg_shdepend entries
that won't exist in older branches.

regards, tom lane

Re: jsonpath Time and Timestamp Special Cases

2024-06-20 Thread Chapman Flack

On 06/20/24 10:54, David E. Wheeler wrote:
> Still not sure about `24:00:00` as a time, though. I presume the jsonpath 
> standard disallows it.

In 9075-2 9.46 "SQL/JSON path language: syntax and semantics", the behavior
of the .time() and .time_tz() and similar item methods defers to the
behavior of SQL's CAST.

For example, .time(PS) (where PS is the optional precision spec) expects
to be applied to a character string X from the JSON source, and its
success/failure and result are the same as for CAST(X AS TIME PS).

The fact that our CAST(X AS TIME) will succeed for '24:00:00' might be
its own extension (or violation) of the spec (I haven't checked that),
but given that it does, we could debate whether it violates the jsonpath
spec for our jsonpath .time() to behave the same way.

The same argument may also apply for ±infinity.

Regards,
-Chap

Re: Extension security improvement: Add support for extensions with an owned schema

On Wed, Jun 19, 2024 at 1:50 PM Jelte Fennema-Nio  wrote:
> I do think, even if we have this, there would be other good reasons to
> use "owned schemas" for extension authors. At least the following two:
> 1. To have a safe search_path that can be used in SET search_path on a
> function (see also [1]).
> 2. To make it easy for extension authors to avoid conflicts with other
> extensions/UDFs.

(1) is a very good point. (2) I don't know about one way or the other.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SQL/JSON query functions context_item doc entry and type requirement

2024-06-20 Thread jian he

On Thu, Jun 20, 2024 at 5:46 PM Amit Langote  wrote:
>
> On Thu, Jun 20, 2024 at 1:03 AM David G. Johnston
>  wrote:
> > On Wed, Jun 19, 2024 at 8:29 AM jian he  wrote:
> >>
> >> On Mon, Jun 17, 2024 at 9:05 PM Chapman Flack  wrote:
> >> >
> >> > Hi,
> >> >
> >> > On 06/17/24 02:43, Amit Langote wrote:
> >> > > context_item expression can be a value of
> >> > > any type that can be cast to jsonb. This includes types
> >> > > such as char,  text, bpchar,
> >> > > character varying, and bytea (with
> >> > > ENCODING UTF8), as well as any domains over these types.
> >> >
> >> > Reading this message in conjunction with [0] makes me think that we are
> >> > really talking about a function that takes a first parameter of type 
> >> > jsonb,
> >> > and behaves exactly that way (so any cast required is applied by the 
> >> > system
> >> > ahead of the call). Under those conditions, this seems like an unusual
> >> > sentence to add in the docs, at least until we have also documented that
> >> > tan's argument can be of any type that can be cast to double precision.
> >> >
> >>
> >> I guess it would be fine to add an unusual sentence to the docs.
> >>
> >> imagine a function: array_avg(anyarray) returns anyelement.
> >> array_avg calculate an array's elements's avg. like
> >> array('{1,2,3}'::int[]) returns 2.
> >> but array_avg won't make sense if the input argument is a date array.
> >> so mentioning in the doc: array_avg can accept anyarray, but anyarray
> >> cannot date array.
> >> seems ok.
> >
> >
> > There is existing wording for this:
> >
> > "The expression can be of any JSON type, any character string type, or 
> > bytea in UTF8 encoding."
> >
> > If you add this sentence to the paragraph the link that already exists, 
> > which simply points the reader to this sentence, becomes redundant and 
> > should be removed.
>
> I've just posted a patch in the other thread [1] to restrict
> context_item to be of jsonb type, which users would need to ensure by
> adding an explicit cast if needed.  I think that makes this
> clarification unnecessary.
>
> > As for table 9.16.3 - it is unwieldy already.  Lets try and make the core 
> > syntax shorter, not longer.  We already have precedence in the subsequent 
> > json_table section - give each major clause item a name then below the 
> > table define the syntax and meaning for those names.  Unlike in that 
> > section - which probably should be modified too - context_item should have 
> > its own description line.
>
> I had posted a patch a little while ago at [1] to render the syntax a
> bit differently with each function getting its own syntax synopsis.
> Resending it here; have addressed Jian He's comments.
>
> --

@@ -18746,6 +18752,7 @@ ERROR:  jsonpath array subscript is out of bounds
 PASSING values.


+Returns the result of applying the SQL/JSON
 If the path expression returns multiple SQL/JSON items, it might be
 necessary to wrap the result using the WITH WRAPPER
 clause to make it a valid JSON string.  If the wrapper is


+Returns the result of applying the SQL/JSON
 is redundant?


playing around with it.
found some minor issues:

json_exists allow:  DEFAULT expression ON ERROR, which is not
mentioned in the doc.
for example:
select JSON_EXISTS(jsonb '{"a": [1,2,3]}', 'strict $.a[5]' default
true ON ERROR);
select JSON_EXISTS(jsonb '{"a": [1,2,3]}', 'strict $.a[5]' default 0 ON ERROR);
select JSON_EXISTS(jsonb '{"a": [1,2,3]}', 'strict $.a[5]' default 11 ON ERROR);



JSON_VALUE on error, on empty semantics should be the same as json_query.
like:
[ { ERROR | NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression }
ON EMPTY ]
[ { ERROR | NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression }
ON ERROR ])

examples:
select JSON_value(jsonb '[]' , '$'  empty array on error);
select JSON_value(jsonb '[]' , '$'  empty object on error);

Re: ON ERROR in json_query and the like

On Thu, Jun 20, 2024 at 2:19 AM Amit Langote 
wrote:

> Hi,
>
> On Mon, Jun 17, 2024 at 10:07 PM Markus Winand 
> wrote:
> > > On 17.06.2024, at 08:20, Amit Langote  wrote:
> > > Agree that the documentation needs to be clear about this. I'll update
> > > my patch at [1] to add a note next to table 9.16.3. SQL/JSON Query
> > > Functions.
> >
> > Considering another branch of this thread [1] I think the
> > "Supported Features” appendix of the docs should mention that as well.
> >
> > The way I see it is that the standards defines two overloaded
> > JSON_QUERY functions, of which PostgreSQL will support only one.
> > In case of valid JSON, the implied CAST makes it look as though
> > the second variant of these function was supported as well but that
> > illusion totally falls apart once the JSON is not valid anymore.
> >
> > I think it affects the following feature IDs:
> >
> >   - T821, Basic SQL/JSON query operators
> >  For JSON_VALUE, JSON_TABLE and JSON_EXISTS
> >   - T828, JSON_QUERY
> >
> > Also, how hard would it be to add the functions that accept
> > character strings? Is there, besides the effort, any thing else
> > against it? I’m asking because I believe once released it might
> > never be changed — for backward compatibility.
>
> Hmm, I'm starting to think that adding the implied cast to json wasn't
> such a great idea after all, because it might mislead the users to
> think that JSON parsing is transparent (respects ON ERROR), which is
> what you are saying, IIUC.
>
>
Actually, the implied cast is exactly the correct thing to do here - the
issue is that we aren't using the newly added soft errors infrastructure
[1] to catch the result of that cast and instead produce whatever output on
error tells us to produce.  This seems to be in the realm of doability so
we should try in the interest of being standard conforming.  I'd even argue
to make this an open item unless and until the attempt is agreed upon to
have failed (or it succeeds of course).

David J.

[1] d9f7f5d32f201bec61fef8104aafcb77cecb4dcb

Re: Extension security improvement: Add support for extensions with an owned schema

On Jun 19, 2024, at 11:28, Robert Haas  wrote:

> But I wonder if there might also be another possible approach: could
> we, somehow, prevent object references in extension scripts from
> resolving to anything other than the system catalogs and the contents
> of that extension? Perhaps with a control file setting to specify a
> list of trusted extensions which we're also allowed to reference?

It would also have to allow access to other extensions it depends upon.

D

Re: Direct SSL connection and ALPN loose ends

2024-06-20 Thread Jacob Champion

On Mon, Jun 17, 2024 at 9:23 AM Jacob Champion
 wrote:
> > I think the behavior with v2 and v3 errors should be the same. And I
> > think an immediate failure is appropriate on any v2/v3 error during
> > negotiation, assuming we don't use those errors for things like "TLS not
> > supported", which would warrant a fallback.
>
> For GSS encryption, it was my vague understanding that older servers
> respond with an error rather than the "not supported" indication. For
> TLS, though, the decision in a49fbaaf (immediate failure) seemed
> reasonable.

Would an open item for this be appropriate?

Thanks,
--Jacob

Re: Extension security improvement: Add support for extensions with an owned schema

On Jun 19, 2024, at 13:50, Jelte Fennema-Nio  wrote:

> This indeed does sound like the behaviour that pretty much every
> existing extension wants to have. One small addition/clarification
> that I would make to your definition: fully qualified references to
> other objects should still be allowed.

Would be tricky for referring to objects from other extensions  with no defined 
schema, or are relatable.

> 1. To have a safe search_path that can be used in SET search_path on a
> function (see also [1]).
> 2. To make it easy for extension authors to avoid conflicts with other
> extensions/UDFs.

These would indeed be nice improvements IMO.

Best,

David

call for applications: mentoring program for code contributors

Hi,

I'm working to start a mentoring program where code contributors can
be mentored by current committers. Applications are now open:
https://forms.gle/dgjmdxtHYXCSg6aB7

Nine committers have volunteered to mentor one person each; hence, the
anticipated number of acceptances is less than or equal to nine. In
the future, we may have more mentors, or some mentors may be willing
to take more than one mentee, or some mentoring relationships may end,
opening up spots for new people, but right now I have nine slots
maximum. Even if less than nine people apply initially, that doesn't
guarantee that your application will be accepted, because the way this
works is you can only be matched to a committer if you want to be
matched with them and they want to be matched with you. If you don't
already have a significant track record on pgsql-hackers, it is
probably unlikely that you will find a mentor in this program at this
time. Even if you do, you may not match with a mentor for any number
of reasons: not enough slots, time zone, language issues, your
particular interests as contrasted with those of the mentors, etc.

The basic expectation around mentorship is that your mentor will have
a voice call with you at least once per month for at least one hour.
Before that call, you should give them some idea what you'd like to
talk about and they should do some non-zero amount of preparation.
During that call, they'll try to give you some useful advice. Maybe
they'll be willing to do other things, too, like review and commit
your patches, or email back and forth with you off-list, or chat using
an instant messaging service, but if they do any of that stuff, that's
extra. Either the mentor or the mentee is free to end the mentoring
relationship at any time for any reason, or for no reason. If that
happens, please let me know, whether it's because of an explicit
decision on someone's part, or because somehow the monthly voice calls
have ceased to occur.

Periodically, someone -- most likely not me, since a few people have
been kind enough to offer help -- will contact mentors and mentees to
get feedback on how things are going. We'll use this feedback to
improve the program, which might involve adjusting mentoring
assignments, or might involve taking such other actions as the
situation may suggest.

In the future, I would like to expand this program to include
non-committer mentors. The idea would be that committers would most
likely want to mentor more senior contributors and senior
non-committers could mentor more junior contributors, so that we pay
it all forward. If this is something you'd be interested in
participating in, whether as a co-organizer, mentor, or mentee, please
let me know. It might also be advantageous to expand this program, or
have a separate program, to mentor people making non-code
contributions e.g. mentoring for conference organizers. I've chosen to
focus on mentorship for code contribution because I know enough about
it to function as an organizer for such an effort.

If you apply for this program, you can expect to receive an email from
me in the next couple of weeks letting you know the result of your
application. If for some reason that does not occur, please feel free
to email me privately, but note that I'll want to give a bit of time
for people to see this email and fill out the form before doing
anything, and then I'll need to talk over possibilities with the
mentors before finalizing anything, so it will take a bit of time.

Finally, I would like to extend a special thanks to the mentors for
volunteering to mentor, and a more general thanks to everyone who
contributes to PostgreSQL in any way or is interested in doing so for
their interest in and hard work on the project.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: call for applications: mentoring program for code contributors

On Jun 20, 2024, at 13:12, Robert Haas  wrote:

> I'm working to start a mentoring program where code contributors can
> be mentored by current committers. Applications are now open:
> https://forms.gle/dgjmdxtHYXCSg6aB7

This is amazing! Thank you for putting it together, Robert.

Best,

David

Re: DROP OWNED BY fails to clean out pg_init_privs grants

Then maybe we should put a query / function in the release notes to
clean up the existing mess.

Thinking of it we should do it anyway, as the patch only prevents new
messiness from happening and does not fix existing issues.

I could share a query to update the pg_init_privs with non-existent
role to replace it with the owner of the object if we figure out a
correct place to publish it.

---
Hannu

On Thu, Jun 20, 2024 at 5:35 PM Tom Lane  wrote:
>
> Hannu Krosing  writes:
> > Is there anything that could be back-patched with reasonable effort ?
>
> Afraid not.  The whole thing is dependent on pg_shdepend entries
> that won't exist in older branches.
>
> regards, tom lane

Re: DROP OWNED BY fails to clean out pg_init_privs grants

Or perhaps we should still also patch pg_dump to ignore the aclentries
which refer to roles that do not exist in the database ?

On Thu, Jun 20, 2024 at 7:41 PM Hannu Krosing  wrote:
>
> Then maybe we should put a query / function in the release notes to
> clean up the existing mess.
>
> Thinking of it we should do it anyway, as the patch only prevents new
> messiness from happening and does not fix existing issues.
>
> I could share a query to update the pg_init_privs with non-existent
> role to replace it with the owner of the object if we figure out a
> correct place to publish it.
>
> ---
> Hannu
>
>
>
>
> On Thu, Jun 20, 2024 at 5:35 PM Tom Lane  wrote:
> >
> > Hannu Krosing  writes:
> > > Is there anything that could be back-patched with reasonable effort ?
> >
> > Afraid not.  The whole thing is dependent on pg_shdepend entries
> > that won't exist in older branches.
> >
> > regards, tom lane

Re: DROP OWNED BY fails to clean out pg_init_privs grants

On Thu, Jun 20, 2024 at 2:25 PM Tom Lane  wrote:
> Hannu Krosing  writes:
> > Or perhaps we should still also patch pg_dump to ignore the aclentries
> > which refer to roles that do not exist in the database ?
>
> I didn't want to do that before, and I still don't.  Given that this
> issue has existed since pg_init_privs was invented (9.6) without
> prior reports, I don't think it's a big enough problem in practice
> to be worth taking extraordinary actions for.

If we don't fix it in the code and we don't document it anywhere, the
next person who hits it is going to have to try to discover the fact
that there's a problem from the pgsql-hackers archives. That doesn't
seem good. I don't have an educated opinion about what we should do
here specifically, and I realize that we don't have any official place
to document known issues, but it can be pretty inconvenient for users
not to know about them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: DROP OWNED BY fails to clean out pg_init_privs grants

Hannu Krosing  writes:
> Or perhaps we should still also patch pg_dump to ignore the aclentries
> which refer to roles that do not exist in the database ?

I didn't want to do that before, and I still don't.  Given that this
issue has existed since pg_init_privs was invented (9.6) without
prior reports, I don't think it's a big enough problem in practice
to be worth taking extraordinary actions for.

regards, tom lane

Re: CREATE INDEX CONCURRENTLY on partitioned index

2024-06-20 Thread Ilya Gladyshev



On 15.06.2024 20:40, Justin Pryzby wrote:

On Thu, May 23, 2024 at 10:14:57PM +0100, Ilya Gladyshev wrote:

Hi,

I think it's well worth the effort to revive the patch, so I rebased it on
master, updated it and will return it back to the commitfest. Alexander,
Justin feel free to add yourselves as authors

Thanks -- I was intending to write about this.

I realized that the patch will need some isolation tests to exercise its
concurrent behavior.



Thanks for the suggestion, added an isolation test that verifies 
behaviour of partitioned CIC with simultaneous partition drop/detach 
going on. Also fixed some issues in the new patch that I found while 
writing the test.


From 45f2ec9ee57a5337b77b66db3c8c5092f305a176 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Tue, 11 Jun 2024 17:48:08 +0100
Subject: [PATCH v5 2/2] Fix progress report for partitioned REINDEX

---
 src/backend/catalog/index.c  | 11 --
 src/backend/commands/indexcmds.c | 63 +---
 src/include/catalog/index.h  |  1 +
 3 files changed, 66 insertions(+), 9 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 55fdde4b24..c5bc72b350 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3559,6 +3559,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 	volatile bool skipped_constraint = false;
 	PGRUsage	ru0;
 	bool		progress = ((params->options & REINDEXOPT_REPORT_PROGRESS) != 0);
+	bool		partition = ((params->options & REINDEXOPT_PARTITION) != 0);
 	bool		set_tablespace = false;
 
 	pg_rusage_init(&ru0);
@@ -3604,8 +3605,9 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 			indexId
 		};
 
-		pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX,
-	  heapId);
+		if (!partition)
+			pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX,
+		  heapId);
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
@@ -3845,8 +3847,11 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 	index_close(iRel, NoLock);
 	table_close(heapRelation, NoLock);
 
-	if (progress)
+	if (progress && !partition)
+	{
+		/* progress for partitions is tracked in the caller */
 		pgstat_progress_end_command();
+	}
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index dcb4ea89e9..6abe1f017c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -118,6 +118,7 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 		const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
 static inline void set_indexsafe_procflags(void);
+static inline void progress_index_partition_done(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -1550,7 +1551,7 @@ DefineIndex(Oid tableId,
  * Update progress for an intermediate partitioned index
  * itself
  */
-pgstat_progress_incr_param(PROGRESS_CREATEIDX_PARTITIONS_DONE, 1);
+progress_index_partition_done();
 			}
 
 			return address;
@@ -1577,7 +1578,7 @@ DefineIndex(Oid tableId,
 		if (!OidIsValid(parentIndexId))
 			pgstat_progress_end_command();
 		else if (!concurrent)
-			pgstat_progress_incr_param(PROGRESS_CREATEIDX_PARTITIONS_DONE, 1);
+			progress_index_partition_done();
 
 		return address;
 	}
@@ -1686,7 +1687,7 @@ DefineIndex(Oid tableId,
 		  heaplocktag, heaprelid);
 
 			PushActiveSnapshot(GetTransactionSnapshot());
-			pgstat_progress_incr_param(PROGRESS_CREATEIDX_PARTITIONS_DONE, 1);
+			progress_index_partition_done();
 		}
 
 		/* Set as valid all partitioned indexes, including the parent */
@@ -3331,6 +3332,14 @@ ReindexPartitions(const ReindexStmt *stmt, Oid relid, const ReindexParams *param
 	ListCell   *lc;
 	ErrorContextCallback errcallback;
 	ReindexErrorInfo errinfo;
+	ReindexParams newparams;
+	int			progress_params[3] = {
+		PROGRESS_CREATEIDX_COMMAND,
+		PROGRESS_CREATEIDX_PHASE,
+		PROGRESS_CREATEIDX_PARTITIONS_TOTAL
+	};
+	int64		progress_values[3];
+	Oid			heapId = relid;
 
 	Assert(RELKIND_HAS_PARTITIONS(relkind));
 
@@ -3392,11 +3401,28 @@ ReindexPartitions(const ReindexStmt *stmt, Oid relid, const ReindexParams *param
 		MemoryContextSwitchTo(old_context);
 	}
 
+	if (relkind == RELKIND_PARTITIONED_INDEX)
+	{
+		heapId = IndexGetRelation(relid, true);
+	}
+
+	progress_values[0] = (params->options & REINDEXOPT_CONCURRENTLY) != 0 ?
+		PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY :
+		PROGRESS_CREATEIDX_COMMAND_REINDEX;
+	progress_values[1] = 0;
+	progress_values[2] = list_length(partitions);
+	pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, heapId);
+	pgstat_progress_update_multi_param(3, progress_params, progress_values);
+
 	/*
 	 * Process each partition listed in a separate transaction.  Note that
 	 * this commits and then starts a new transaction immediately.
 	 */
-	ReindexMultipleInternal(stmt, partitions, params);
+	newparams = *

Re: DROP OWNED BY fails to clean out pg_init_privs grants

It does happen with some regularity.

At least one large cloud database provider I know of saw this more
than once a month until the mitigations were integrated in the major
version upgrade process.

It is possible that making database upgrades easier via better
automation is what made this turn up more, as now less experienced /
non-DBA types are more comfortable doing the version upgrades, whereas
before it would be something done by a person who can also diagnose it
and manually fix pg_init_privs.

Still it would be nice to have some public support for users of
non-managed PostgreSQL databases as well

On Thu, Jun 20, 2024 at 8:25 PM Tom Lane  wrote:
>
> Hannu Krosing  writes:
> > Or perhaps we should still also patch pg_dump to ignore the aclentries
> > which refer to roles that do not exist in the database ?
>
> I didn't want to do that before, and I still don't.  Given that this
> issue has existed since pg_init_privs was invented (9.6) without
> prior reports, I don't think it's a big enough problem in practice
> to be worth taking extraordinary actions for.
>
> regards, tom lane

Re: DROP OWNED BY fails to clean out pg_init_privs grants

On Thu, Jun 20, 2024 at 3:43 PM Hannu Krosing  wrote:
> Still it would be nice to have some public support for users of
> non-managed PostgreSQL databases as well

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Meson far from ready on Windows

2024-06-20 Thread Andres Freund

Hi,

On 2024-06-19 14:47:50 +0100, Dave Page wrote:

> > I'm *NOT* sure that vcpkg is the way to go, fwiw. It does seem
> > advantageous to
> > use one of the toolkits thats commonly built for building dependencies on
> > windows, which seems to mean vcpkg or conan.
> >
>
> I don't think requiring or expecting vcpkg or conan is reasonable at all,
> for a number of reasons:
>
> - Neither supports all the dependencies at present.
> - There are real supply chain verification concerns for vendors.
> - That would be a huge change from what we've required in the last 19
> years, with no deprecation notices or warnings for packagers etc.

I don't think we should hard-require one specifically. I do think it'd be good
if we provided an easy recipe for dependencies to be installed though.

I think such flexibility acually means it becomes *more* important to abstract
away some of the concrete ways of using the various dependencies. It doesn't
make sense for postgres to understand the internals of each dependency on all
platforms to a detail that it can cope with all the different ways of linking
against them.

E.g. libxml can be built with icu, lzma, zlib support. If libxml is built
statically, postgres needs to link to all those libraries as well.  How can we
know which of those dependencies are enabled?

Even if we can make that somehow work, it's not reasonable for postgres
developers adding a dependency to have to figure out how to probe all of this,
when literally no other platform works that way anymore.

If you look around at recipes for building postgres on windows, they all have
to patch src/tools/msvc (see links at the bottom), because the builtin paths
and flags just don't work outside of a single way of acquiring the dependency.

The fact that this thread started only now is actually a good example for how
broken the approach to internalize all knowledge about building against
libraries into postgres is. This could all have been figured out 1+ years ago
- but wasn't.

Unless you want to require postgres devs to get constantly in the muck on
windows, we'll never get that right until just before the release.

I don't particularly care how we abstract away the low level linking details
on windows. We can use pkgconf, we can use cmake, we can invent our own thing.
But it has to be something other than hardcoding windows library paths and
compiler flags into our buildsystem.

And yes, that might make it a bit harder for a packager on windows, but
windows is already a *massive* drag on PG developers, it has to be somewhat
manageable.

I do think we can make the effort of windows dependency management a lot more
reasonable than it is now though, by providing a recipe for acquiring the
dependency in some form. It's a lot easier to for packagers and developers to
customize ontop of something like that.

Frankly, the fact that there's pretty much no automated testing of the various
dependencies that's accessible to non-windows devs is not a sustainable
situation.

> > Btw, I've been working with Bilal to add a few of the dependencies to the
> > CI
> > images so we can test those automatically. Using vcpkg. We got that nearly
> > working, but he's on vacation this week...  That does ensure both cmake and
> > .pc files are generated, fwiw.
> >
> > Currently builds gettext, icu, libxml2, libxslt, lz4, openssl, pkgconf,
> > python3, tcl, zlib, zstd.
>
>
> That appears to be using Mingw/Msys, which is quite different from a VC++
> build, in part because it's a full environment with its own package manager
> and packages that people have put a lot of effort into making work as they
> do on unix.

Err, that was a copy-paste mistake on my end and doesn't even use the vcpkg
generated stuff.

Here's an example build with most dependencies enabled (see below for more
details):

https://cirrus-ci.com/task/6497321108635648?logs=configure#L323

I started hacking a bit further on testing all dependencies, which led me down
a few rabbitholes:

- kerberos: When built linking against a debug runtime, it spews *ginormous*
  amounts of information onto stderr. Unfortunately its buildsystem doesn't
  seperate out debugging output and linking against a debug runtime. Argh.

  The tests fail even with a non-debug runtime though, due to debugging output
  in some cases, not sure why:
  https://cirrus-ci.com/task/5872684519653376?logs=check_world#L502

  Separately, the kerberos tests don't seem to be prepared to work on windows
  :(.

  So I disabled using it in CI for now.

- Linking the backend dynamically against lz4, icu, ssl, xml, xslt, zstd, zlib
  slows down the tests noticeably (~20%).  So I ended up building those
  statically.

  I ran into some issue with using a static libintl. I made it work, but for
  now reverted to a dynamic one.

- Enabling nls slows down the tests by about 15%, somewhat painful. This is
  when statically linking, it's a bit worse when linked dynamically :(.

- readline: Instead of the old

Issue with "start another primitive scan" logic during nbtree array advancement

2024-06-20 Thread Peter Geoghegan

While working on skip scan, I stumbled upon a bug on HEAD. This is an
issue in my commit 5bf748b8, "Enhance nbtree ScalarArrayOp execution".
The attached test case (repro_wrong_prim.sql) causes an assertion
failure on HEAD. Here's the stack trace:

TRAP: failed Assert("so->keyData[opsktrig].sk_strategy !=
BTEqualStrategyNumber"), File:
"../source/src/backend/access/nbtree/nbtutils.c", Line: 2475, PID:
1765589
[0x55942a24db8f] _bt_advance_array_keys:
/mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/access/nbtree/nbtutils.c:2475
[0x55942a24bf22] _bt_checkkeys:
/mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/access/nbtree/nbtutils.c:3797
[0x55942a244160] _bt_readpage:
/mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/access/nbtree/nbtsearch.c:2221
[0x55942a2434ca] _bt_first:
/mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/access/nbtree/nbtsearch.c:1888
[0x55942a23ef88] btgettuple:
/mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/access/nbtree/nbtree.c:259

The problem is that _bt_advance_array_keys() doesn't take sufficient
care at the point where it decides whether its call to
_bt_check_compare against finaltup (with the scan direction flipped
around) indicates that another primitive index scan is required. The
final decision is conditioned on rules about how the scan key offset
sktrig that triggered the call to _bt_advance_array_keys() relates to
the scan key offset that was set by the _bt_check_compare finaltup
comparison. This was fragile. It breaks with this test case because of
fairly subtle conditions around when and how the arrays advance, the
layout of the relevant leaf page, and the placement of inequality scan
keys.

When assertions are disabled, we do multiple primitive index scans
that land on the same leaf page, which isn't supposed to be possible
anymore. The query gives correct answers, but this behavior is
definitely wrong (it is simply supposed to be impossible now, per
5bf748b8's commit message).

Attached is a draft bug fix patch. It nails down the test by simply
testing "so->keyData[opsktrig].sk_strategy != BTEqualStrategyNumber"
directly, rather than comparing scan key offsets. This is a far
simpler and far more direct approach.

You might wonder why I didn't do it like this in the first place. It
just worked out that way. The code in question was written before I
changed the design of _bt_check_compare (in the draft patch that
became commit 5bf748b8). Up until not that long before the patch was
committed, _bt_check_compare would set "continuescan=false" for
non-required arrays. That factor made detecting whether or not the
relevant _bt_check_compare call had in fact encountered a required
inequality of the kind we need to detect (to decide on whether to
start another primitive index scan) difficult and messy. However, the
final committed patch simplified _bt_check_compare, making the
approach I've taken in the bug fix patch possible. I just never made
the connection before now.

--
Peter Geoghegan

v1-0001-Fix-nbtree-array-unsatisfied-inequality-check.patch
Description: Binary data

repro_wrong_prim.sql
Description: Binary data

Re: pg_combinebackup --clone doesn't work

Here's a fix adding the missing headers to pg_combinebackup, and fixing
some compile-time issues in the ifdef-ed block.

I've done some basic manual testing today - I plan to test this a bit
more tomorrow, and I'll also look at integrating this into the existing
tests.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom ae712aa6316c8f7035edee9fb49e1bfe1ea30b94 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Thu, 20 Jun 2024 23:50:22 +0200
Subject: [PATCH] fix clone headers

---
 src/bin/pg_combinebackup/copy_file.c| 12 +++-
 src/bin/pg_combinebackup/pg_combinebackup.c |  8 
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
index 08c3b875420..8b50e937346 100644
--- a/src/bin/pg_combinebackup/copy_file.c
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -13,6 +13,10 @@
 #ifdef HAVE_COPYFILE_H
 #include 
 #endif
+#ifdef __linux__
+#include 
+#include 
+#endif
 #include 
 #include 
 #include 
@@ -214,6 +218,9 @@ copy_file_clone(const char *src, const char *dest,
 		pg_fatal("error while cloning file \"%s\" to \"%s\": %m", src, dest);
 #elif defined(__linux__) && defined(FICLONE)
 	{
+		int			src_fd;
+		int			dest_fd;
+
 		if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
 			pg_fatal("could not open file \"%s\": %m", src);
 
@@ -228,8 +235,11 @@ copy_file_clone(const char *src, const char *dest,
 			unlink(dest);
 
 			pg_fatal("error while cloning file \"%s\" to \"%s\": %s",
-	 src, dest);
+	 src, dest, strerror(save_errno));
 		}
+
+		close(src_fd);
+		close(dest_fd);
 	}
 #else
 	pg_fatal("file cloning not supported on this platform");
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 01d7fb98e79..f67ccf76ea2 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -16,6 +16,14 @@
 #include 
 #include 
 
+#ifdef HAVE_COPYFILE_H
+#include 
+#endif
+#ifdef __linux__
+#include 
+#include 
+#endif
+
 #include "backup_label.h"
 #include "common/blkreftable.h"
 #include "common/checksum_helper.h"
-- 
2.45.2

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"


On 19/06/2024 23:00, Alexander Lakhin wrote:

Please look at a new anomaly, that I've discovered in master.

...

triggers a segfault:
2024-06-19 19:22:49.009 UTC [1607210:6] LOG:  server process (PID 1607671) was 
terminated by signal 11: Segmentation fault

...

server.log might also contain:
2024-06-19 19:25:38.060 UTC [1618682:5] psql ERROR:  could not read blocks 3..3 in file 
"base/28531/28840": read only 0
of 8192 bytes


Thanks for the report! I was not able to reproduce the segfault, but I 
do see the "could not read blocks" error very quickly with the script.


In commit af0e7deb4a, I removed the call to RelationCloseSmgr() from 
RelationCacheInvalidate(). I thought it was no longer needed, because we 
no longer free the underlying SmgrRelation.


However, it meant that if the relfilenode of the relation was changed, 
the relation keeps pointing to the SMgrRelation of the old relfilenode. 
So we still need the RelationCloseSmgr() call, in case the relfilenode 
has changed.


Per attached patch.

--
Heikki Linnakangas
Neon (https://neon.tech)
From 17a1e68a636edd87a34acf34dace2a696af81a2d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 21 Jun 2024 01:45:53 +0300
Subject: [PATCH 1/1] Fix relcache invalidation when relfilelocator is updated

In commit af0e7deb4a, I removed this call to RelationCloseSmgr(),
because the dangling SMgrRelation was no longer an issue. However, we
still need to call in case the relation's relfilelocator has changed,
so that we open the new relfile on the next access.

Reported-by: Alexander Lakhin 
Discussion: https://www.postgresql.org/message-id/987b1c8c-8c91-4847-ca0e-879f421680ff%40gmail.com
---
 src/backend/utils/cache/relcache.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 35dbb87ae3..f45880d96d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3032,6 +3032,9 @@ RelationCacheInvalidate(bool debug_discard)
 	{
 		relation = idhentry->reldesc;
 
+		/* Close all smgr references, in case the relfilelocator has changed */
+		RelationCloseSmgr(relation);
+
 		/*
 		 * Ignore new relations; no other backend will manipulate them before
 		 * we commit.  Likewise, before replacing a relation's relfilelocator,
-- 
2.39.2

Re: Pluggable cumulative statistics

On Thu, Jun 20, 2024 at 01:05:42PM +, Bertrand Drouvot wrote:
> On Thu, Jun 13, 2024 at 04:59:50PM +0900, Michael Paquier wrote:
>> * Making custom stats data persistent is an interesting problem, and
>> there are a couple of approaches I've considered:
>> ** Allow custom kinds to define callbacks to read and write data from
>> a source they'd want, like their own file through a fd.  This has the
>> disadvantage to remove the benefit of c) above.
>> ** Store everything in the existing stats file, adding one type of
>> entry like 'S' and 'N' with a "custom" type, where the *name* of the
>> custom stats kind is stored instead of its ID computed from shared
>> memory.
> 
> What about having 2 files?
> 
> - One is the existing stats file
> - One "predefined" for all the custom stats (so what you've done minus the
> in-core stats). This one would not be configurable and the extensions will
> not need to know about it.

Another thing that can be done here is to add a few callbacks to
control how an entry should be written out when the dshash is scanned
or read when the dshash is populated depending on the KindInfo.
That's not really complicated to do as the populate part could have a
cleanup phase if an error is found.  I just did not do it yet because
this patch set is already covering a lot, just to get the basics in.

> Would that remove the benefit from c) that you mentioned up-thread?

Yes, that can be slightly annoying.  Splitting the stats across
multiple files would mean that each stats file would have to store the
redo LSN.  That's not really complicated to implement, but really easy
to miss.  Perhaps folks implementing their own stats kinds would be
aware anyway because we are going to need a callback to initialize the
file to write if we do that, and the redo LSN should be provided in
input of it.  Giving more control to extension developers here would
be OK for me, especially since they could use their own format for
their output file(s).
--
Michael


signature.asc
Description: PGP signature

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

Heikki Linnakangas  writes:
> In commit af0e7deb4a, I removed the call to RelationCloseSmgr() from 
> RelationCacheInvalidate(). I thought it was no longer needed, because we 
> no longer free the underlying SmgrRelation.

> However, it meant that if the relfilenode of the relation was changed, 
> the relation keeps pointing to the SMgrRelation of the old relfilenode. 
> So we still need the RelationCloseSmgr() call, in case the relfilenode 
> has changed.

Ouch.  How come we did not see this immediately in testing?  I'd have
thought surely such a bug would be exposed by any command that
rewrites a heap.

regards, tom lane

Re: Direct SSL connection and ALPN loose ends


On 20/06/2024 20:02, Jacob Champion wrote:

On Mon, Jun 17, 2024 at 9:23 AM Jacob Champion
 wrote:

I think the behavior with v2 and v3 errors should be the same. And I
think an immediate failure is appropriate on any v2/v3 error during
negotiation, assuming we don't use those errors for things like "TLS not
supported", which would warrant a fallback.


For GSS encryption, it was my vague understanding that older servers
respond with an error rather than the "not supported" indication. For
TLS, though, the decision in a49fbaaf (immediate failure) seemed
reasonable.


Would an open item for this be appropriate?


Added.


By "negotiation" I mean the server's response to the startup packet.
I.e. "supported"/"not supported"/"error".


Ok, I'm still a little confused, probably a terminology issue. The 
server doesn't respond with "supported" or "not supported" to the 
startup packet, that happens earlier. I think you mean the SSLRequst / 
GSSRequest packet, which is sent *before* the startup packet?



I think the behavior with v2 and v3 errors should be the same. And I
think an immediate failure is appropriate on any v2/v3 error during
negotiation, assuming we don't use those errors for things like "TLS not
supported", which would warrant a fallback.


For GSS encryption, it was my vague understanding that older servers
respond with an error rather than the "not supported" indication. For
TLS, though, the decision in a49fbaaf (immediate failure) seemed
reasonable.


Hmm, right, GSS encryption was introduced in v12, and older versions 
respond with an error to a GSSRequest.


We probably could make the same assumption for GSS as we did for TLS in 
a49fbaaf, i.e. that an error means that something's wrong with the 
server, rather than that it's just very old and doesn't support GSS. But 
the case for that is a lot weaker case than with TLS. There are still 
pre-v12 servers out there in the wild.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Pluggable cumulative statistics

On Thu, Jun 20, 2024 at 02:27:14PM +, Bertrand Drouvot wrote:
> On Thu, Jun 20, 2024 at 09:46:30AM +0900, Michael Paquier wrote:
> I think it makes sense to follow the same "behavior" as the custom
> wal resource managers. That, indeed, looks much more simpler than v1.

Thanks for the feedback.

>> and can only
>> be registered with shared_preload_libraries.  The patch reserves a set
>> of 128 harcoded slots for all the custom kinds making the lookups for
>> the KindInfos quite cheap.
> 
> +   MemoryContextAllocZero(TopMemoryContext,
> +  
> sizeof(PgStat_KindInfo *) * PGSTAT_KIND_CUSTOM_SIZE);
> 
> and that's only 8 * PGSTAT_KIND_CUSTOM_SIZE bytes in total.

Enlarging that does not worry me much.  Just not too much.

>> With that in mind, the patch set is more pleasant to the eye, and the
>> attached v2 consists of:
>> - 0001 and 0002 are some cleanups, same as previously to prepare for
>> the backend-side APIs.
> 
> 0001 and 0002 look pretty straightforward at a quick look.

0002 is quite independentn.  Still, 0001 depends a bit on the rest.
Anyway, the Kind is already 4 bytes and it cleans up some APIs that
used int for the Kind, so enforcing signedness is just cleaner IMO.

>> - 0003 adds the backend support to plug-in custom stats.
> 
> 1 ===
> 
> It looks to me that there is a mix of "in core" and "built-in" to name the
> non custom stats. Maybe it's worth to just use one?

Right.  Perhaps better to remove "in core" and stick to "builtin", as
I've used the latter for the variables and such.

> As I can see (and as you said above) this is mainly inspired by the custom
> resource manager and 2 === and 3 === are probably copy/paste consequences.
> 
> 2 ===
> 
> +   if (pgstat_kind_custom_infos[idx] != NULL &&
> +   pgstat_kind_custom_infos[idx]->name != NULL)
> +   ereport(ERROR,
> +   (errmsg("failed to register custom cumulative 
> statistics \"%s\" with ID %u", kind_info->name, kind),
> +errdetail("Custom resource manager \"%s\" 
> already registered with the same ID.",
> +  
> pgstat_kind_custom_infos[idx]->name)));
> 
> s/Custom resource manager/Custom cumulative statistics/
> 
> 3 ===
> 
> +   ereport(LOG,
> +   (errmsg("registered custom resource manager \"%s\" 
> with ID %u",
> +   kind_info->name, kind)));
> 
> s/custom resource manager/custom cumulative statistics/

Oops.  Will fix.
--
Michael


signature.asc
Description: PGP signature

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"


On 21/06/2024 02:12, Tom Lane wrote:

Heikki Linnakangas  writes:

In commit af0e7deb4a, I removed the call to RelationCloseSmgr() from
RelationCacheInvalidate(). I thought it was no longer needed, because we
no longer free the underlying SmgrRelation.



However, it meant that if the relfilenode of the relation was changed,
the relation keeps pointing to the SMgrRelation of the old relfilenode.
So we still need the RelationCloseSmgr() call, in case the relfilenode
has changed.


Ouch.  How come we did not see this immediately in testing?  I'd have
thought surely such a bug would be exposed by any command that
rewrites a heap.


There is a RelationCloseSmgr() call in RelationClearRelation(), which 
covers the common cases. This only occurs during 
RelationCacheInvalidate(), when pg_class's relfilenumber was changed.


Hmm, looking closer, I think this might be a more appropriate place for 
the RelationCloseSmgr() call:



/*
 * If it's a mapped relation, immediately update its 
rd_locator in
 * case its relfilenumber changed.  We must do this 
during phase 1
 * in case the relation is consulted during rebuild of 
other
 * relcache entries in phase 2.  It's safe since 
consulting the
 * map doesn't involve any access to relcache entries.
 */
if (RelationIsMapped(relation))
RelationInitPhysicalAddr(relation);


That's where we change the relfilenumber, before the 
RelationClearRelation() call.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Direct SSL connection and ALPN loose ends

2024-06-20 Thread Jacob Champion

On Thu, Jun 20, 2024 at 4:13 PM Heikki Linnakangas  wrote:
> > By "negotiation" I mean the server's response to the startup packet.
> > I.e. "supported"/"not supported"/"error".
>
> Ok, I'm still a little confused, probably a terminology issue. The
> server doesn't respond with "supported" or "not supported" to the
> startup packet, that happens earlier. I think you mean the SSLRequst /
> GSSRequest packet, which is sent *before* the startup packet?

Yes, sorry. (I'm used to referring to those as startup packets too, ha.)

> Hmm, right, GSS encryption was introduced in v12, and older versions
> respond with an error to a GSSRequest.
>
> We probably could make the same assumption for GSS as we did for TLS in
> a49fbaaf, i.e. that an error means that something's wrong with the
> server, rather than that it's just very old and doesn't support GSS. But
> the case for that is a lot weaker case than with TLS. There are still
> pre-v12 servers out there in the wild.

Right. Since we default to gssencmode=prefer, if you have Kerberos
creds in your environment, I think this could potentially break
existing software that connects to v11 servers once you upgrade libpq.

Thanks,
--Jacob

Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-20 Thread Melanie Plageman

Hi,

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.

On master, after 1ccc1e05ae removed the retry loop in
lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang
could no longer happen, but we can still attempt to freeze dead tuples
visibly killed before OldestXmin -- resulting in an ERROR.

Pruning may fail to remove dead tuples with xmax before OldestXmin if
the tuple is not considered removable by GlobalVisState.

For vacuum, the GlobalVisState is initially calculated at the
beginning of vacuuming the relation -- at the same time and with the
same value as VacuumCutoffs->OldestXmin.

A backend's GlobalVisState may be updated again when it is accessed if
a new snapshot is taken or if something caused ComputeXidHorizons() to
be called.

This can happen, for example, at the end of a round of index vacuuming
when GetOldestNonRemovableTransactionId() is called.

Normally this may result in GlobalVisState's horizon moving forward --
potentially allowing more dead tuples to be removed.

However, if a disconnected standby with a running transaction older
than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin but before
GlobalVisState is updated, the value of GlobalVisState->maybe_needed
could go backwards.

If this happens in the middle of vacuum's first pruning and freezing
pass, it is possible that pruning/freezing could then encounter a
tuple whose xmax is younger than GlobalVisState->maybe_needed and
older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum()
would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it.
But the heap_pre_freeze_checks() would ERROR out with "cannot freeze
committed xmax". This check is to avoid freezing dead tuples.

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

If you want to run the repro with meson, you'll have to add
't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with:

meson test postgresql:recovery / recovery/099_vacuum_hang

If you use autotools, you can run it with:
make check PROVE_TESTS="t/099_vacuum_hang.pl"

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
and pruning because I have a cursor on the page keeping VACUUM FREEZE
from getting a cleanup lock.

See the repro for step-by-step explanations of how it works.

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

- Melanie

[1] 
https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee
[2] 
https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag%40mail.gmail.com
From fb74a07d98d19147556fca1dc911a22d50a496e5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 20 Jun 2024 16:38:30 -0400
Subject: [PATCH v1 1/2] Repro for vacuum ERROR out trying to freeze old dead
 tuple

This repro is not stable enough to be added as a test to the regression
suite. It is for demonstration purposes only.
---
 src/test/recovery/t/099_vacuum_hang.pl | 267 +
 1 file changed, 267 insertions(+)
 create mode 100644 src/test/recovery/t/099_vacuum_hang.pl

diff --git a/src/test/recovery/t/099_vacuum_hang.pl b/src/test/recovery/t/099_vacuum_hang.pl
new file mo

PG 17 and GUC variables

2024-06-20 Thread Bruce Momjian

FYI, looking at the release notes, I see 15 GUC variables added in this
release, and two removed.  That 15 number seemed unusually high so I
thought I would report it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-20 Thread Peter Geoghegan

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
 wrote:
> If vacuum fails to remove a tuple with xmax older than
> VacuumCutoffs->OldestXmin and younger than
> GlobalVisState->maybe_needed, it will ERROR out when determining
> whether or not to freeze the tuple with "cannot freeze committed
> xmax".
>
> In back branches starting with 14, failing to remove tuples older than
> OldestXmin during pruning caused vacuum to infinitely loop in
> lazy_scan_prune(), as investigated on this [1] thread.

This is a great summary.

> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
> has a transaction that sees that tuple as alive, because it will
> simply wait to replay the removal until it would be correct to do so
> or recovery conflict handling will cancel the transaction that sees
> the tuple as alive and allow replay to continue.

I think that this is the right general approach.

> The repro forces a round of index vacuuming after the standby
> reconnects and before pruning a dead tuple whose xmax is older than
> OldestXmin.
>
> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
> calls GetOldestNonRemovableTransactionId(), thereby updating the
> backend's GlobalVisState and moving maybe_needed backwards.

Right. I saw details exactly consistent with this when I used GDB
against a production instance.

I'm glad that you were able to come up with a repro that involves
exactly the same basic elements, including index page deletion.

-- 
Peter Geoghegan

Re: ON ERROR in json_query and the like

On Fri, Jun 21, 2024 at 1:11 AM David G. Johnston
 wrote:
> On Thu, Jun 20, 2024 at 2:19 AM Amit Langote  wrote:
>> On Mon, Jun 17, 2024 at 10:07 PM Markus Winand  
>> wrote:
>> > > On 17.06.2024, at 08:20, Amit Langote  wrote:
>> > > Agree that the documentation needs to be clear about this. I'll update
>> > > my patch at [1] to add a note next to table 9.16.3. SQL/JSON Query
>> > > Functions.
>> >
>> > Considering another branch of this thread [1] I think the
>> > "Supported Features” appendix of the docs should mention that as well.
>> >
>> > The way I see it is that the standards defines two overloaded
>> > JSON_QUERY functions, of which PostgreSQL will support only one.
>> > In case of valid JSON, the implied CAST makes it look as though
>> > the second variant of these function was supported as well but that
>> > illusion totally falls apart once the JSON is not valid anymore.
>> >
>> > I think it affects the following feature IDs:
>> >
>> >   - T821, Basic SQL/JSON query operators
>> >  For JSON_VALUE, JSON_TABLE and JSON_EXISTS
>> >   - T828, JSON_QUERY
>> >
>> > Also, how hard would it be to add the functions that accept
>> > character strings? Is there, besides the effort, any thing else
>> > against it? I’m asking because I believe once released it might
>> > never be changed — for backward compatibility.
>>
>> Hmm, I'm starting to think that adding the implied cast to json wasn't
>> such a great idea after all, because it might mislead the users to
>> think that JSON parsing is transparent (respects ON ERROR), which is
>> what you are saying, IIUC.
>>
>
> Actually, the implied cast is exactly the correct thing to do here - the 
> issue is that we aren't using the newly added soft errors infrastructure [1] 
> to catch the result of that cast and instead produce whatever output on error 
> tells us to produce.  This seems to be in the realm of doability so we should 
> try in the interest of being standard conforming.

Soft error handling *was* used for catching cast errors in the very
early versions of this patch (long before I got involved and the
infrastructure you mention got added).  It was taken out after Pavel
said [1] that he didn't like producing NULL instead of throwing an
error.  Not sure if Pavel's around but it would be good to know why he
didn't like it at the time.

I can look into making that work again, but that is not going to make beta2.

>  I'd even argue to make this an open item unless and until the attempt is 
> agreed upon to have failed (or it succeeds of course).

OK, adding an open item.

-- 
Thanks, Amit Langote
[1] 
https://www.postgresql.org/message-id/CAFj8pRCnzO2cnHi5ebXciV%3DtuGVvAQOW9uPU%2BDQV1GkL31R%3D-g%40mail.gmail.com

Re: SQL/JSON query functions context_item doc entry and type requirement

On Thu, Jun 20, 2024 at 9:01 AM jian he  wrote:

> On Thu, Jun 20, 2024 at 5:46 PM Amit Langote 
> wrote:
> >
> > On Thu, Jun 20, 2024 at 1:03 AM David G. Johnston
> >  wrote:
> > > On Wed, Jun 19, 2024 at 8:29 AM jian he 
> wrote:
> > >>
> > >> On Mon, Jun 17, 2024 at 9:05 PM Chapman Flack 
> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > On 06/17/24 02:43, Amit Langote wrote:
> > >> > > context_item expression can be a value
> of
> > >> > > any type that can be cast to jsonb. This includes
> types
> > >> > > such as char,  text,
> bpchar,
> > >> > > character varying, and bytea (with
> > >> > > ENCODING UTF8), as well as any domains over these
> types.
> > >> >
> > >> > Reading this message in conjunction with [0] makes me think that we
> are
> > >> > really talking about a function that takes a first parameter of
> type jsonb,
> > >> > and behaves exactly that way (so any cast required is applied by
> the system
> > >> > ahead of the call). Under those conditions, this seems like an
> unusual
> > >> > sentence to add in the docs, at least until we have also documented
> that
> > >> > tan's argument can be of any type that can be cast to double
> precision.
> > >> >
> > >>
> > >> I guess it would be fine to add an unusual sentence to the docs.
> > >>
> > >> imagine a function: array_avg(anyarray) returns anyelement.
> > >> array_avg calculate an array's elements's avg. like
> > >> array('{1,2,3}'::int[]) returns 2.
> > >> but array_avg won't make sense if the input argument is a date array.
> > >> so mentioning in the doc: array_avg can accept anyarray, but anyarray
> > >> cannot date array.
> > >> seems ok.
> > >
> > >
> > > There is existing wording for this:
> > >
> > > "The expression can be of any JSON type, any character string type, or
> bytea in UTF8 encoding."
> > >
> > > If you add this sentence to the paragraph the link that already
> exists, which simply points the reader to this sentence, becomes redundant
> and should be removed.
> >
> > I've just posted a patch in the other thread [1] to restrict
> > context_item to be of jsonb type, which users would need to ensure by
> > adding an explicit cast if needed.  I think that makes this
> > clarification unnecessary.
> >
> > > As for table 9.16.3 - it is unwieldy already.  Lets try and make the
> core syntax shorter, not longer.  We already have precedence in the
> subsequent json_table section - give each major clause item a name then
> below the table define the syntax and meaning for those names.  Unlike in
> that section - which probably should be modified too - context_item should
> have its own description line.
> >
> > I had posted a patch a little while ago at [1] to render the syntax a
> > bit differently with each function getting its own syntax synopsis.
> > Resending it here; have addressed Jian He's comments.
> >
> > --
>

I was thinking more like:

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c324906b22..b9d157663a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18692,8 +18692,10 @@ $.* ? (@ like_regex "^\\d+$")
   
 json_exists
 json_exists (
-context_item,
path_expression 
PASSING { value
AS varname } ,
...
- { TRUE | FALSE
| UNKNOWN | ERROR } ON
ERROR )
+context_item,
+path_expression
+variable_definitions
+on_error_boolean)


 Returns true if the SQL/JSON
path_expression
@@ -18732,12 +18734,14 @@ ERROR:  jsonpath array subscript is out of bounds
   
 json_query
 json_query (
-context_item,
path_expression 
PASSING { value
AS varname } ,
...
- RETURNING
data_type  FORMAT
JSON  ENCODING UTF8 
 
- { WITHOUT | WITH
{ CONDITIONAL |
UNCONDITIONAL } } 
ARRAY  WRAPPER 
- { KEEP | OMIT }
QUOTES  ON SCALAR STRING
 
- { ERROR | NULL |
EMPTY {  ARRAY 
| OBJECT } | DEFAULT
expression } ON EMPTY

- { ERROR | NULL |
EMPTY {  ARRAY 
| OBJECT } | DEFAULT
expression } ON ERROR
)
+context_item,
+path_expression
+variable_definitions
+return_clause
+wrapping_clause
+quoting_clause
+on_empty_set
+on_error_set)
   

 Returns the result of applying the SQL/JSON
@@ -18809,11 +18813,12 @@ DETAIL:  Missing "]" after array dimensions.
   
 json_value
 json_value (
-context_item,
path_expression
- PASSING {
value AS
varname } , ...
- RETURNING
data_type 
- { ERROR | NULL |
DEFAULT expression }
ON EMPTY 
- { ERROR | NULL |
DEFAULT expression }
ON ERROR )
+context_item,
+path_expression
+variable_definitions
+return_type
+on_empty_value
+on_error_value)


 Returns the result of applying the SQL/JSON

Then defining each of those below the table - keeping the on_error variants
together.




> playing around with it.
> found some minor issues:
>
>

Re: ON ERROR in json_query and the like

On Thu, Jun 20, 2024 at 5:22 PM Amit Langote 
wrote:

>
> Soft error handling *was* used for catching cast errors in the very
> early versions of this patch (long before I got involved and the
> infrastructure you mention got added).  It was taken out after Pavel
> said [1] that he didn't like producing NULL instead of throwing an
> error.  Not sure if Pavel's around but it would be good to know why he
> didn't like it at the time.
>
>
I'm personally in the "make it error" camp but "make it conform to the
standard" is a stronger membership (in general).

I see this note in your linked thread:

> By the standard, it is implementation-defined whether JSON parsing errors
> should be caught by ON ERROR clause.

Absent someone contradicting that claim I retract my position here and am
fine with failing if these "functions" are supplied with something that
cannot be cast to json.  I'd document them like functions that accept json
with the implications that any casting to json happens before the function
is called and thus its arguments do not apply to that step.

Given that we have "expression IS JSON" the ability to hack together a case
expression to get non-erroring behavior exists.

David J.

Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

On Wed, Jun 19, 2024 at 05:14:50AM +, Hayato Kuroda (Fujitsu) wrote:
> I have an unclear point. According to the comment atop GetInsertRecPtr(), it 
> just
> returns the approximated value - the position of the last full WAL page [1].
> If there is a continuation WAL record which across a page, will it return the
> halfway point of the WAL record (end of the first WAL page)? If so, the 
> proposed
> fix seems not sufficient. We have to point out the exact the end of the 
> record.

Yeah, that a thing of the patch I am confused with.  How are we sure
that this is the correct LSN to rely on?  If that it the case, the
patch does not offer an explanation about why it is better.

WalSndWaitForWal() is called only in the context of page callback for a
logical WAL sender.  Shouldn't we make the flush conditional on what's
stored in XLogReaderState.missingContrecPtr?  Aka, if we know that
we're in the middle of the decoding of a continuation record, we
should wait until we've dealt with it, no?

In short, I would imagine that WalSndWaitForWal() should know more
about XLogReaderState is doing.  But perhaps I'm missing something.
--
Michael

signature.asc
Description: PGP signature

Re: PG 17 and GUC variables

On Thu, Jun 20, 2024 at 08:01:19PM -0400, Bruce Momjian wrote:
> FYI, looking at the release notes, I see 15 GUC variables added in this
> release, and two removed.  That 15 number seemed unusually high so I
> thought I would report it.

Scanning pg_settings across the two versions, I'm seeing:
- Removed GUCs between 16 and 17:
db_user_namespace
old_snapshot_threshold
trace_recovery_messages

- Added GUCs between 16 and 17:
allow_alter_system
commit_timestamp_buffers
enable_group_by_reordering
event_triggers
huge_pages_status
io_combine_limit
max_notify_queue_pages
multixact_member_buffers
multixact_offset_buffers
notify_buffers
serializable_buffers
standby_slot_names
subtransaction_buffers
summarize_wal
sync_replication_slots
trace_connection_negotiation
transaction_buffers
transaction_timeout
wal_summary_keep_time

So that makes for 3 removed, 19 additions and a +16.
--
Michael


signature.asc
Description: PGP signature

configure error when CFLAGS='-Wall -Werror

2024-06-20 Thread Andy Fan



Hi,

I relies on some compiler's check to reduce some simple coding issues, I
use clang 18.1.6 for now. however "CFLAGS='-Wall -Werror ' ./configure"
would fail, and if I run ' ./configure' directly, it is OK. I'm not sure
why it happens. More details is below:

(master)> echo $CC
clang
(master)> clang --version
clang version 18.1.6 (https://gitee.com/mirrors/llvm-project.git 
1118c2e05e67a36ed8ca250524525cdb66a55256)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/bin

(master)> CFLAGS='-Wall -Werror ' ./configure

checking for clang option to accept ISO C89... unsupported
checking for clang option to accept ISO C99... unsupported
configure: error: C compiler "clang" does not support C99

In config.log, we can see:

configure:4433: clang -qlanglvl=extc89 -c -Wall -Werror   conftest.c >&5
clang: error: unknown argument: '-qlanglvl=extc89'

and clang does doesn't support -qlanglvl.

in 'configure', we can see the related code is:

"""
for ac_arg in '' -std=gnu99 -std=c99 -c99 -AC99 -D_STDC_C99= -qlanglvl=extc99
do
  CC="$ac_save_CC $ac_arg"
  if ac_fn_c_try_compile "$LINENO"; then :
  ac_cv_prog_cc_c99=$ac_arg
fi
rm -f core conftest.err conftest.$ac_objext
  test "x$ac_cv_prog_cc_c99" != "xno" && break
done
rm -f conftest.$ac_ext
CC=$ac_save_CC



# Error out if the compiler does not support C99, as the codebase
# relies on that.
if test "$ac_cv_prog_cc_c99" = no; then
as_fn_error $? "C compiler \"$CC\" does not support C99" "$LINENO" 5
fi
"""

So my questions are:
1. based on the fact clang doesn't support '-qlanglvl' all the time, why
removing the CFLAGS matters.

2. If you are using clang as well, what CFLAGS you use and it works?
for example: IIRC, clang doesn't report error when a variable is set
but no used by default, we have to add some extra flags to make it.

-- 
Best Regards
Andy Fan

minor doc issue in 9.16.2.1.1. Boolean Predicate Check Expressions

2024-06-20 Thread jian he

hi.
-
9.16.2.1.1. Boolean Predicate Check Expressions
As an extension to the SQL standard, a PostgreSQL path expression can
be a Boolean predicate, whereas the SQL standard allows predicates
only within filters. While SQL-standard path expressions return the
relevant element(s) of the queried JSON value, predicate check
expressions return the single three-valued result of the predicate:
true, false, or unknown. For example, we could write this SQL-standard
filter expression:

-
slight inconsistency, "SQL-standard" versus "SQL standard"
"path expression can be a Boolean predicate", why capital "Boolean"?

"predicate check expressions return the single three-valued result of
the predicate: true, false, or unknown."
"unknown" is wrong, because `select 'unknown'::jsonb;` will fail.
here "unknown" should be "null"? see jsonb_path_query doc entry also.

Re: An inefficient query caused by unnecessary PlaceHolderVar

2024-06-20 Thread Richard Guo

On Mon, Jan 15, 2024 at 1:50 PM Richard Guo  wrote:
> Updated this patch over 29f114b6ff, which indicates that we should apply
> the same rules for PHVs.

Here is a new rebase of this patch, with some tweaks to comments.  I've
also updated the commit message to better explain the context.

To recap, this patch tries to avoid wrapping Vars and PHVs from subquery
output that are lateral references to something outside the subquery.
Typically this kind of wrapping is necessary when the Var/PHV references
the non-nullable side of the outer join from the nullable side, because
we need to ensure that it is evaluated at the right place and hence is
forced to null when the outer join should do so.  But if the referenced
rel is under the same lowest nulling outer join, we can actually omit
the wrapping.  The PHVs generated from such kind of wrapping imply
lateral dependencies and force us to resort to nestloop joins, so we'd
better get rid of them.

This patch performs this check by remembering the relids of the nullable
side of the lowest outer join the subquery is within.  So I think it
improves the overall plan in the related cases with very little extra
cost.

Thanks
Richard

v4-0001-Avoid-unnecessary-wrapping-for-Vars-and-PHVs.patch
Description: Binary data

Re: Pgoutput not capturing the generated columns

2024-06-20 Thread Peter Smith

Hi Shubham, here are some more patch v8-0001 comments that I missed yesterday.

==
src/test/subscription/t/011_generated.pl

1.
Are the PRIMARY KEY qualifiers needed for the new tab2, tab3 tables? I
don't think so.

~~~

2.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab2");
+is( $result, qq(4|8
+5|10), 'confirm generated columns ARE replicated when the
subscriber-side column is not generated');
+
+$node_publisher->safe_psql('postgres', "INSERT INTO tab3 VALUES (4), (5)");
+
+$node_publisher->wait_for_catchup('sub3');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab3");
+is( $result, qq(4|24
+5|25), 'confirm generated columns are NOT replicated when the
subscriber-side column is also generated');
+

It would be prudent to do explicit "SELECT a,b" instead of "SELECT *",
for readability and to avoid any surprises.

==
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: improve predefined roles documentation

On Tue, Jun 18, 2024 at 9:52 AM Nathan Bossart 
wrote:

> On Mon, Jun 17, 2024 at 02:10:22PM -0400, Robert Haas wrote:
> > On Thu, Jun 13, 2024 at 3:48 PM Nathan Bossart 
> wrote:
> >> I think we could improve matters by abandoning the table and instead
> >> documenting these roles more like we document GUCs, i.e., each one has a
> >> section below it where we can document it in as much detail as we want.
> >> Some of these roles should probably be documented together (e.g.,
> >> pg_read_all_data and pg_write_all_data), so the ordering is unlikely to
> be
> >> perfect, but I'm hoping it would still be a net improvement.
> >
> > +1. I'm not sure about all of the details, but I like the general idea.
>
> Here is a first try.  I did pretty much exactly what I proposed in the
> quoted text, so I don't have much else to say about it.  I didn't see an
> easy way to specify multiple ids and xreflabels for a given entry, so the
> entries that describe multiple roles just use the name of the first role
> listed.  In practice, I think this just means you need to do a little extra
> work when linking to one of the other roles from elsewhere in the docs,
> which doesn't seem too terrible.
>
>
I like this.  Losing the table turned out to be ok.  Thank you.

I would probably put pg_monitor first in the list.

+ A user granted this role cannot however send signals to a backend owned
by a superuser.

Remove "however", or put commas around it.  I prefer the first option.

Do we really need to repeat "even without having it explicitly" everywhere?

+ This role does not have the role attribute BYPASSRLS set.

Even if it did, that attribute isn't inherited anyway...

"This role is still governed by any row level security policies that may be
in force.  Consider setting the BYPASSRLS attribute on member roles."

(assuming they intend it to be ALL data then doing the bypassrls even if
they are not today using it doesn't hurt)

pg_stat_scan_tables - This explanation leaves me wanting more.  Maybe give
an example of such a function?  I think the bar is set a bit too high just
talking about a specific lock level.

"As these roles are able to access any file on the server file system,"

We forbid running under root so this isn't really true.  They do have
operating system level access logged in as the database process owner.
They are able to access all PostgreSQL files on the server file system and
usually can run a wide-variety of commands on the server.

"access, therefore great care should be taken"

I would go with:

"access.  Great care should be taken"

Seems more impactful as its own sentence then at the end of a long
multi-part sentence.

"server with COPY any other file-access functions." - s/with/using/

David J.

Re: minor doc issue in 9.16.2.1.1. Boolean Predicate Check Expressions

On Thu, Jun 20, 2024 at 7:30 PM jian he  wrote:

> "predicate check expressions return the single three-valued result of
>
the predicate: true, false, or unknown."
> "unknown" is wrong, because `select 'unknown'::jsonb;` will fail.
> here "unknown" should be "null"? see jsonb_path_query doc entry also.
>
>
The syntax for json_exists belies this claim (assuming our docs are
accurate there).  Its "on error" options are true/false/unknown.
Additionally, the predicate test operator is named "is unknown" not "is
null".

The result of the predicate test, which is never produced as a value, only
a concept, is indeed "unknown" - which then devolves to false when it is
practically applied to determining whether to output the path item being
tested.  As it does also when used in a parth expression.

postgres=# select json_value('[null]','$[0] < 1');
 json_value

 f

postgres=# select json_value('[null]','$[0] == null');
 json_value

 t

Not sure how to peek inside the jsonpath system here though...

postgres=# select json_value('[null]','($[0] < 1) == null');
ERROR:  syntax error at or near "==" of jsonpath input
LINE 1: select json_value('[null]','($[0] < 1) == null');

I am curious if that produces true (the unknown is left as null) or false
(the unknown becomes false immediately).

David J.

Re: configure error when CFLAGS='-Wall -Werror

Andy Fan  writes:
> I relies on some compiler's check to reduce some simple coding issues, I
> use clang 18.1.6 for now. however "CFLAGS='-Wall -Werror ' ./configure"
> would fail,

Nope, you cannot do that: -Werror breaks many of configure's tests.
See

https://www.postgresql.org/docs/current/install-make.html#CONFIGURE-ENVVARS

for the standard workaround.

regards, tom lane

Re: Add pg_get_acl() function get the ACL for a database object

On Thu, Jun 20, 2024 at 08:32:57AM +0200, Joel Jacobson wrote:
> I've added overloaded versions for regclass and regproc so far:
> 
> \df pg_get_acl
>  List of functions
>Schema   |Name| Result data type |  Argument data types   | Type
> ++--++--
>  pg_catalog | pg_get_acl | aclitem[]| classid oid, objid oid | func
>  pg_catalog | pg_get_acl | aclitem[]| objid regclass | func
>  pg_catalog | pg_get_acl | aclitem[]| objid regproc  | func
> (3 rows)

Interesting idea.

I am not really convinced that the regproc and regclass overloads are
really necessary, considering the fact that one of the goals
mentioned, as far as I understand, is to be able to get an idea of the
ACLs associated to an object with its dependencies in pg_depend and/or 
pg_shdepend.  Another one is to reduce the JOIN burden when querying
a set of them, like attribute ACLs.

Perhaps the documentation should add one or two examples to show this
point?

+tup = get_catalog_object_by_oid(rel, Anum_oid, objectId);
+if (!HeapTupleIsValid(tup))
+elog(ERROR, "cache lookup failed for object %u of catalog \"%s\"",
+objectId, RelationGetRelationName(rel));

get_catalog_object_by_oid() is handled differently here than in
functions line pg_identify_object().  Shouldn't we return NULL for
this case?  That would be more useful when using this function with
one or more large scans.
--
Michael

signature.asc
Description: PGP signature

Re: Pgoutput not capturing the generated columns

2024-06-20 Thread Peter Smith

Hi, here are some review comments for patch v9-0002.

==
src/backend/replication/logical/relation.c

1. logicalrep_rel_open

- if (attr->attisdropped)
+ if (attr->attisdropped ||
+ (!MySubscription->includegencols && attr->attgenerated))

You replied to my question from the previous review [1, #2] as follows:
In case 'include_generated_columns' is 'true'. column list in
remoterel will have an entry for generated columns. So, in this case
if we skip every attr->attgenerated, we will get a missing column
error.

~

TBH, the reason seems very subtle to me. Perhaps that
"(!MySubscription->includegencols && attr->attgenerated))" condition
should be coded as a separate "if", so then you can include a comment
similar to your answer, to explain it.

==
src/backend/replication/logical/tablesync.c

make_copy_attnamelist:

NITPICK - punctuation in function comment
NITPICK - add/reword some more comments
NITPICK - rearrange comments to be closer to the code they are commenting

~

2. make_copy_attnamelist.

+ /*
+ * Construct column list for COPY.
+ */
+ for (int i = 0; i < rel->remoterel.natts; i++)
+ {
+ if(!gencollist[i])
+ attnamelist = lappend(attnamelist,
+   makeString(rel->remoterel.attnames[i]));
+ }

IIUC isn't this assuming that the attribute number (aka column order)
is the same on the subscriber side (e.g. gencollist idx) and on the
publisher side (e.g. remoterel.attnames[i]).  AFAIK logical
replication does not require this ordering must be match like that,
therefore I am suspicious this new logic is accidentally imposing new
unwanted assumptions/restrictions. I had asked the same question
before [1-#4] about this code, but there was no reply.

Ideally, there would be more test cases for when the columns
(including the generated ones) are all in different orders on the
pub/sub tables.

~~~

3. General - varnames.

It would help with understanding if the 'attgenlist' variables in all
these functions are re-named to make it very clear that this is
referring to the *remote* (publisher-side) table genlist, not the
subscriber table genlist.

~~~

4.
+ int i = 0;
+
  appendStringInfoString(&cmd, "COPY (SELECT ");
- for (int i = 0; i < lrel.natts; i++)
+ foreach_ptr(ListCell, l, attnamelist)
  {
- appendStringInfoString(&cmd, quote_identifier(lrel.attnames[i]));
- if (i < lrel.natts - 1)
+ appendStringInfoString(&cmd, quote_identifier(strVal(l)));
+ if (i < attnamelist->length - 1)
  appendStringInfoString(&cmd, ", ");
+ i++;
  }

4a.
I think the purpose of the new macros is to avoid using ListCell, and
also 'l' is an unhelpful variable name. Shouldn't this code be more
like:
foreach_node(String, att_name, attnamelist)

~

4b.
The code can be far simpler if you just put the comma (", ") always
up-front except the *first* iteration, so you can avoid checking the
list length every time. For example:

if (i++)
  appendStringInfoString(&cmd, ", ");

==
src/test/subscription/t/011_generated.pl

5. General.

Hmm. This patch 0002 included many formatting changes to tables tab2
and tab3 and subscriptions sub2 and sub3 but they do not belong here.
The proper formatting for those needs to be done back in patch 0001
where they were introduced. Patch 0002 should just concentrate only on
the new stuff for patch 0002.

~

6. CREATE TABLES would be better in pairs

IMO it will be better if the matching CREATE TABLE for pub and sub are
kept together, instead of separating them by doing all pub then all
sub. I previously made the same comment for patch 0001, so maybe it
will be addressed next time...

~

7. SELECT *

+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT * FROM tab4 ORDER BY a");

It will be prudent to do explicit "SELECT a,b,c" instead of "SELECT
*", for readability and so there are no surprises.

==

99.
Please also refer to my attached nitpicks diff for numerous cosmetic
changes, and apply if you agree.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPtAsEc3PEB1KUk1kFF5tcCrDCCTcbboougO29vP1B4E2Q%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/src/backend/replication/logical/tablesync.c 
b/src/backend/replication/logical/tablesync.c
index 0fbf661..ddeb6a8 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -694,7 +694,7 @@ process_syncing_tables(XLogRecPtr current_lsn)
 
 /*
  * Create list of columns for COPY based on logical relation mapping. Do not
- * include generated columns, of the subscription table, in the column list.
+ * include generated columns of the subscription table in the column list.
  */
 static List *
 make_copy_attnamelist(LogicalRepRelMapEntry *rel, bool *attgenlist)
@@ -706,6 +706,7 @@ make_copy_attnamelist(LogicalRepRelMapEntry *rel, bool 
*attgenlist)
desc = RelationGetDescr(rel->localrel);
gencollist = palloc0(MaxTupleAttributeNumber * sizeof(bool));
 
+   /* Loop to handle subscription table generated columns. */
for (int i = 0; i < des

Re: Add pg_get_acl() function get the ACL for a database object

Michael Paquier  writes:
> On Thu, Jun 20, 2024 at 08:32:57AM +0200, Joel Jacobson wrote:
>> I've added overloaded versions for regclass and regproc so far:
>> 
>> \df pg_get_acl
>> List of functions
>> Schema   |Name| Result data type |  Argument data types   | Type
>> ++--++--
>> pg_catalog | pg_get_acl | aclitem[]| classid oid, objid oid | func
>> pg_catalog | pg_get_acl | aclitem[]| objid regclass | func
>> pg_catalog | pg_get_acl | aclitem[]| objid regproc  | func
>> (3 rows)

> Interesting idea.

Doesn't that result in "cannot resolve ambiguous function call"
failures?

regards, tom lane

Re: Add pg_get_acl() function get the ACL for a database object

2024-06-20 Thread Isaac Morland

On Thu, 20 Jun 2024 at 02:33, Joel Jacobson  wrote:

> On Wed, Jun 19, 2024, at 16:23, Isaac Morland wrote:
> > I have no idea how often this would be useful, but I wonder if it could
> > work to have overloaded single-parameter versions for each of
> > regprocedure (pg_proc.proacl), regclass (pg_class.relacl), …. To call,
> > just cast the OID to the appropriate reg* type.
> >
> > For example: To get the ACL for table 'example_table', call pg_get_acl
> > ('example_table'::regclass)
>
> +1
>
> New patch attached.
>
> I've added overloaded versions for regclass and regproc so far:
>
> \df pg_get_acl
>  List of functions
>Schema   |Name| Result data type |  Argument data types   | Type
>
> ++--++--
>  pg_catalog | pg_get_acl | aclitem[]| classid oid, objid oid | func
>  pg_catalog | pg_get_acl | aclitem[]| objid regclass | func
>  pg_catalog | pg_get_acl | aclitem[]| objid regproc  | func
> (3 rows)

Those were just examples. I think for completeness there should be 5
overloads:

[input type] → [relation.aclattribute]
regproc/regprocedure → pg_proc.proacl
regtype → pg_type.typacl
regclass → pg_class.relacl
regnamespace → pg_namespace.nspacl

I believe the remaining reg* types don't correspond to objects with ACLs,
and the remaining ACL fields are for objects which don't have a
corresponding reg* type.

In general I believe the reg* types are underutilized. All over the place I
see examples where people write code to generate SQL statements and they
take schema and object name and then format with %I.%I when all that is
needed is a reg* value and then format it with a simple %s (of course, need
to make sure the SQL will execute with the same search_path as when the SQL
was generated, or generate with an empty search_path).

Re: Add pg_get_acl() function get the ACL for a database object

2024-06-20 Thread Isaac Morland

On Thu, 20 Jun 2024 at 23:44, Tom Lane  wrote:

> Michael Paquier  writes:
> > On Thu, Jun 20, 2024 at 08:32:57AM +0200, Joel Jacobson wrote:
> >> I've added overloaded versions for regclass and regproc so far:
> >>
> >> \df pg_get_acl
> >> List of functions
> >> Schema   |Name| Result data type |  Argument data types   | Type
> >>
> ++--++--
> >> pg_catalog | pg_get_acl | aclitem[]| classid oid, objid oid |
> func
> >> pg_catalog | pg_get_acl | aclitem[]| objid regclass |
> func
> >> pg_catalog | pg_get_acl | aclitem[]| objid regproc  |
> func
> >> (3 rows)
>
> > Interesting idea.
>
> Doesn't that result in "cannot resolve ambiguous function call"
> failures?

If you try to pass an oid directly, as a value of type oid, you should get
"function is not unique". But if you cast a string or numeric value to the
appropriate reg* type for the object you are using, it should work fine.

I have functions which reset object permissions on all objects in a
specified schema back to the default state as if they had been freshly
created which rely on this. They work very well, and allow me to have a
privilege-granting script for each project which always fully resets all
the privileges back to a known state.

Re: ON ERROR in json_query and the like

On Fri, Jun 21, 2024 at 10:01 AM David G. Johnston
 wrote:
> On Thu, Jun 20, 2024 at 5:22 PM Amit Langote  wrote:
>>
>>
>> Soft error handling *was* used for catching cast errors in the very
>> early versions of this patch (long before I got involved and the
>> infrastructure you mention got added).  It was taken out after Pavel
>> said [1] that he didn't like producing NULL instead of throwing an
>> error.  Not sure if Pavel's around but it would be good to know why he
>> didn't like it at the time.
>>
>
> I'm personally in the "make it error" camp but "make it conform to the 
> standard" is a stronger membership (in general).
>
> I see this note in your linked thread:
>
> > By the standard, it is implementation-defined whether JSON parsing errors
> > should be caught by ON ERROR clause.
>
> Absent someone contradicting that claim I retract my position here and am 
> fine with failing if these "functions" are supplied with something that 
> cannot be cast to json.  I'd document them like functions that accept json 
> with the implications that any casting to json happens before the function is 
> called and thus its arguments do not apply to that step.

Thanks for that clarification.

So, there are the following options:

1. Disallow anything but jsonb for context_item (the patch I posted yesterday)

2. Continue allowing context_item to be non-json character or utf-8
encoded bytea strings, but document that any parsing errors do not
respect the ON ERROR clause.

3. Go ahead and fix implicit casts to jsonb so that any parsing errors
respect ON ERROR (no patch written yet).

David's vote seems to be 2, which is my inclination too.  Markus' vote
seems to be either 1 or 3.  Anyone else?

-- 
Thanks, Amit Langote

Re: Allow non-superuser to cancel superuser tasks.

On Fri, Jun 14, 2024 at 12:06:36PM +0500, Andrey M. Borodin wrote:
> I’ve tried to dig into the test.
> The problem is CV is allocated in
> 
> inj_state = GetNamedDSMSegment("injection_points”,
> 
> which seems to be destroyed in
> 
> shmem_exit() calling dsm_backend_shutdown()
> 
> This happens before we broadcast that sleep is over.
> I think this might happen with any wait on injection point if it is
> pg_terminate_backend()ed.

Except if I am missing something, this is not a problem for a normal
backend, for example with one using a `SELECT injection_points_run()`.

> Is there way to wake up from CV sleep before processing actual termination?

I am honestly not sure if this is worth complicating the sigjmp path
of the autovacuum worker just for the sake of this test.  It seems to
me that it would be simple enough to move the injection point
autovacuum-worker-start within the transaction block a few lines down
in do_autovacuum(), no?
--
Michael


signature.asc
Description: PGP signature

Re: Pluggable cumulative statistics

2024-06-20 Thread Kyotaro Horiguchi

At Thu, 13 Jun 2024 16:59:50 +0900, Michael Paquier  wrote 
in 
> * The kind IDs may change across restarts, meaning that any stats data 
> associated to a custom kind is stored with the *name* of the custom
> stats kind.  Depending on the discussion happening here, I'd be open
> to use the same concept as custom RMGRs, where custom kind IDs are
> "reserved", fixed in time, and tracked in the Postgres wiki.  It is
> cheaper to store the stats this way, as well, while managing conflicts
> across extensions available in the community ecosystem.

I prefer to avoid having a central database if possible.

If we don't intend to move stats data alone out of a cluster for use
in another one, can't we store the relationship between stats names
and numeric IDs (or index numbers) in a separate file, which is loaded
just before and synced just after extension preloading finishes?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: ON ERROR in json_query and the like

2024-06-20 Thread Pavel Stehule

pá 21. 6. 2024 v 2:22 odesílatel Amit Langote 
napsal:

> On Fri, Jun 21, 2024 at 1:11 AM David G. Johnston
>  wrote:
> > On Thu, Jun 20, 2024 at 2:19 AM Amit Langote 
> wrote:
> >> On Mon, Jun 17, 2024 at 10:07 PM Markus Winand 
> wrote:
> >> > > On 17.06.2024, at 08:20, Amit Langote 
> wrote:
> >> > > Agree that the documentation needs to be clear about this. I'll
> update
> >> > > my patch at [1] to add a note next to table 9.16.3. SQL/JSON Query
> >> > > Functions.
> >> >
> >> > Considering another branch of this thread [1] I think the
> >> > "Supported Features” appendix of the docs should mention that as well.
> >> >
> >> > The way I see it is that the standards defines two overloaded
> >> > JSON_QUERY functions, of which PostgreSQL will support only one.
> >> > In case of valid JSON, the implied CAST makes it look as though
> >> > the second variant of these function was supported as well but that
> >> > illusion totally falls apart once the JSON is not valid anymore.
> >> >
> >> > I think it affects the following feature IDs:
> >> >
> >> >   - T821, Basic SQL/JSON query operators
> >> >  For JSON_VALUE, JSON_TABLE and JSON_EXISTS
> >> >   - T828, JSON_QUERY
> >> >
> >> > Also, how hard would it be to add the functions that accept
> >> > character strings? Is there, besides the effort, any thing else
> >> > against it? I’m asking because I believe once released it might
> >> > never be changed — for backward compatibility.
> >>
> >> Hmm, I'm starting to think that adding the implied cast to json wasn't
> >> such a great idea after all, because it might mislead the users to
> >> think that JSON parsing is transparent (respects ON ERROR), which is
> >> what you are saying, IIUC.
> >>
> >
> > Actually, the implied cast is exactly the correct thing to do here - the
> issue is that we aren't using the newly added soft errors infrastructure
> [1] to catch the result of that cast and instead produce whatever output on
> error tells us to produce.  This seems to be in the realm of doability so
> we should try in the interest of being standard conforming.
>
> Soft error handling *was* used for catching cast errors in the very
> early versions of this patch (long before I got involved and the
> infrastructure you mention got added).  It was taken out after Pavel
> said [1] that he didn't like producing NULL instead of throwing an
> error.  Not sure if Pavel's around but it would be good to know why he
> didn't like it at the time.
>
> I can look into making that work again, but that is not going to make
> beta2.
>
> >  I'd even argue to make this an open item unless and until the attempt
> is agreed upon to have failed (or it succeeds of course).
>
> OK, adding an open item.
>

At this time, when I wrote this mail, I didn't exactly notice the standard,
so broken format should be handled there too.  In this time, there was no
support for soft errors ever in Postgres, so handling broken formats was
inconsistent.

Standard describes format errors, but exactly doesn't describe if this is
error like missing key or broken json format. Maybe wrongly, but
intuitively for me, these errors are of different kinds and broken input
data is a different case than missing key (but fully valid json). I didn't
find the exact sentence in standard when I searched it (but it was four
years ago).

My position in this case is not extra strong. The original patch was
written and tested to be compatible with Oracle (what is a strong argument
and feature). On second hand, some things required subtransactioning what
was wrong (soft errors were introduced later). The compatibility with
Oracle is a strong argument, but Oracle by itself is not fully compatible
with standard, and some cases are special (in Oracle) because empty string
in Oracle is NULL, and then it is handled differently. In this time I had
motivation to reduce the patch to "safe" minimum to be possible to accept
it by committers. The patch was written in 2017 (I think). Handling broken
format (input format) was one issue that I thought could be solved later.

The main reason for my mail is fact, so Postgres and Oracle have DIFFERENT
correct format of JSON!

'{a:10}' is correct on Oracle, but not correct on Postgres. And with
default ON ERROR NULL (what is default), then the Oracle returns 10, and
Postgres NULL. I thought this can be very messy and better to just raise an
exception.

Regards

Pavel

> --
> Thanks, Amit Langote
> [1]
> https://www.postgresql.org/message-id/CAFj8pRCnzO2cnHi5ebXciV%3DtuGVvAQOW9uPU%2BDQV1GkL31R%3D-g%40mail.gmail.com
>

Re: Pluggable cumulative statistics

On Fri, Jun 21, 2024 at 01:09:10PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 13 Jun 2024 16:59:50 +0900, Michael Paquier  
> wrote in 
>> * The kind IDs may change across restarts, meaning that any stats data 
>> associated to a custom kind is stored with the *name* of the custom
>> stats kind.  Depending on the discussion happening here, I'd be open
>> to use the same concept as custom RMGRs, where custom kind IDs are
>> "reserved", fixed in time, and tracked in the Postgres wiki.  It is
>> cheaper to store the stats this way, as well, while managing conflicts
>> across extensions available in the community ecosystem.
> 
> I prefer to avoid having a central database if possible.

I was thinking the same originally, but the experience with custom
RMGRs has made me change my mind.  There are more of these than I
thought originally:
https://wiki.postgresql.org/wiki/CustomWALResourceManagers

> If we don't intend to move stats data alone out of a cluster for use
> in another one, can't we store the relationship between stats names
> and numeric IDs (or index numbers) in a separate file, which is loaded
> just before and synced just after extension preloading finishes?

Yeah, I've implemented a prototype that does exactly something like
that with a restriction on the stats name to NAMEDATALEN, except that
I've added the kind ID <-> kind name mapping at the beginning of the
main stats file.  At the end, it still felt weird and over-engineered
to me, like the v1 prototype of upthread, because we finish with a
strange mix when reloading the dshash where the builtin ID are handled
with fixed values, with more code paths required when doing the
serialize callback dance for stats kinds like replication slots,
because the custom kinds need to update their hash keys to the new
values based on the ID/name mapping stored at the beginning of the
file itself.

The equation complicates itself a bit more once you'd try to add more
ways to write some stats kinds to other places, depending on what a
custom kind wants to achieve.  I can see the benefits of both
approaches, still fixing the IDs in time leads to a lot of simplicity
in this infra, which is very appealing on its own before tackling the
next issues where I would rely on the proposed APIs.
--
Michael

signature.asc
Description: PGP signature

Re: ON ERROR in json_query and the like

2024-06-20 Thread Pavel Stehule

pá 21. 6. 2024 v 6:01 odesílatel Amit Langote 
napsal:

> On Fri, Jun 21, 2024 at 10:01 AM David G. Johnston
>  wrote:
> > On Thu, Jun 20, 2024 at 5:22 PM Amit Langote 
> wrote:
> >>
> >>
> >> Soft error handling *was* used for catching cast errors in the very
> >> early versions of this patch (long before I got involved and the
> >> infrastructure you mention got added).  It was taken out after Pavel
> >> said [1] that he didn't like producing NULL instead of throwing an
> >> error.  Not sure if Pavel's around but it would be good to know why he
> >> didn't like it at the time.
> >>
> >
> > I'm personally in the "make it error" camp but "make it conform to the
> standard" is a stronger membership (in general).
> >
> > I see this note in your linked thread:
> >
> > > By the standard, it is implementation-defined whether JSON parsing
> errors
> > > should be caught by ON ERROR clause.
> >
> > Absent someone contradicting that claim I retract my position here and
> am fine with failing if these "functions" are supplied with something that
> cannot be cast to json.  I'd document them like functions that accept json
> with the implications that any casting to json happens before the function
> is called and thus its arguments do not apply to that step.
>
> Thanks for that clarification.
>
> So, there are the following options:
>
> 1. Disallow anything but jsonb for context_item (the patch I posted
> yesterday)
>
> 2. Continue allowing context_item to be non-json character or utf-8
> encoded bytea strings, but document that any parsing errors do not
> respect the ON ERROR clause.
>
> 3. Go ahead and fix implicit casts to jsonb so that any parsing errors
> respect ON ERROR (no patch written yet).
>
> David's vote seems to be 2, which is my inclination too.  Markus' vote
> seems to be either 1 or 3.  Anyone else?
>

@3 can be possibly messy (although be near Oracle or standard). I don't
think it is safe - one example '{a:10}' is valid for Oracle, but not for
Postgres, and using @3 impacts different results (better to raise an
exception).

The effect of @1 and @2 is similar - @1 is better so the user needs to
explicitly cast, so maybe it is cleaner, so the cast should not be handled,
@2 is more user friendly, because it accepts unknown string literal. From a
developer perspective I prefer @1, from a user perspective I prefer @2.
Maybe @2 is a good compromise.

Regards

Pavel


>
> --
> Thanks, Amit Langote
>

Re: ON ERROR in json_query and the like