date:20240124

On Wed, Jan 24, 2024 at 02:49:36PM +0900, Sutou Kouhei wrote:
> For COPY TO:
> 
> 0001: This adds CopyToRoutine and use it for text/csv/binary
> formats. No implementation change. This just move codes.

10M without this change:

format,elapsed time (ms)
text,1090.763
csv,1136.103
binary,1137.141

10M with this change:

format,elapsed time (ms)
text,1082.654
csv,1196.991
binary,1069.697

These numbers point out that binary is faster by 6%, csv is slower by
5%, while text stays around what looks like noise range.  That's not
negligible.  Are these numbers reproducible?  If they are, that could
be a problem for anybody doing bulk-loading of large data sets.  I am
not sure to understand where the improvement for binary comes from by
reading the patch, but perhaps perf would tell more for each format?
The loss with csv could be blamed on the extra manipulations of the
function pointers, likely.
--
Michael


signature.asc
Description: PGP signature

Re: Add \syncpipeline command to pgbench

On Tue, Jan 23, 2024 at 01:08:24PM +0900, Michael Paquier wrote:
> Anyway, I wanted to study this one and learn a bit more about the
> error stuff that was happening on pgbench side.

Well, I've spend some time studying this part, and the error handling
was looking correct based on the safety measures added in
49f7c6c44a5f, so I've just moved on and applied it.
--
Michael


signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby

2024-01-24 Thread Amit Kapila

On Wed, Jan 24, 2024 at 11:24 AM Masahiko Sawada  wrote:
>
> On Wed, Jan 24, 2024 at 2:43 PM Amit Kapila  wrote:
> >
> > >
> > > +/* GUC variable */
> > > +bool   enable_syncslot = false;
> > >
> > > Is enable_syncslot a really good name? We use "enable" prefix only for
> > > planner parameters such as enable_seqscan, and it seems to me that
> > > "slot" is not specific. Other candidates are:
> > >
> > > * synchronize_replication_slots = on|off
> > > * synchronize_failover_slots = on|off
> > >
> >
> > I would prefer the second one. Would it be better to just say
> > sync_failover_slots?
>
> Works for me. But if we want to extend this option for non-failover
> slots as well in the future, synchronize_replication_slots (or
> sync_replication_slots) seems better. We can extend it by having an
> enum later. For example, the values can be on, off, or failover etc.
>

I see your point. Let us see if others have any suggestions on this.

-- 
With Regards,
Amit Kapila.

Re: Remove unused fields in ReorderBufferTupleBuf

2024-01-24 Thread Masahiko Sawada

Sorry for being absent for a while.

On Mon, Jan 22, 2024 at 11:19 PM Aleksander Alekseev
 wrote:
>
> Hi,
>
> > > But why didn't you pursue your idea of getting rid of the wrapper
> > > structure ReorderBufferTupleBuf which after this patch will have just
> > > one member? I think there could be hassles in backpatching bug-fixes
> > > in some cases but in the longer run it would make the code look clean.

+1

> >
> > Indeed. In fact turned out that I suggested the same above but
> > apparently forgot:
> >
> > > On top of that IMO it doesn't make much sense to keep a one-field
> > > wrapper structure. Perhaps we should get rid of it entirely and just
> > > use HeapTupleData instead.
> >
> > After actually trying the refactoring I agree that the code becomes
> > cleaner and it's going to be beneficial in the long run. Here is the
> > patch.
>
> I did a mistake in v4:
>
> ```
> -alloc_len = tuple_len + SizeofHeapTupleHeader;
> +alloc_len = tuple_len + HEAPTUPLESIZE;
> ```
>
> Here is the corrected patch.

Thank you for updating the patch! I have some comments:

-tuple = (ReorderBufferTupleBuf *)
+tuple = (HeapTuple)
 MemoryContextAlloc(rb->tup_context,
-
sizeof(ReorderBufferTupleBuf) +
+   HEAPTUPLESIZE +
MAXIMUM_ALIGNOF +
alloc_len);
-tuple->alloc_tuple_size = alloc_len;
-tuple->tuple.t_data = ReorderBufferTupleBufData(tuple);
+tuple->t_data = (HeapTupleHeader)((char *)tuple + HEAPTUPLESIZE);

Why do we need to add MAXIMUM_ALIGNOF? since HEAPTUPLESIZE is the
MAXALIGN'd size, it seems we don't need it. heap_form_tuple() does a
similar thing but it doesn't add it.

---
 xl_parameter_change *xlrec =
-(xl_parameter_change *)
XLogRecGetData(buf->record);
+(xl_parameter_change *)
XLogRecGetData(buf->record);

Unnecessary change.

---
-/*
- * Free a ReorderBufferTupleBuf.
- */
-void
-ReorderBufferReturnTupleBuf(ReorderBuffer *rb, ReorderBufferTupleBuf *tuple)
-{
-pfree(tuple);
-}
-

Why does ReorderBufferReturnTupleBuf need to be moved from
reorderbuffer.c to reorderbuffer.h? It seems not related to this
refactoring patch so I think we should do it in a separate patch if we
really want it. I'm not sure it's necessary, though.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2024-01-24 Thread Bertrand Drouvot

Hi,

On Wed, Jan 24, 2024 at 01:51:54PM +0530, Amit Kapila wrote:
> On Wed, Jan 24, 2024 at 11:24 AM Masahiko Sawada  
> wrote:
> >
> > On Wed, Jan 24, 2024 at 2:43 PM Amit Kapila  wrote:
> > >
> > > >
> > > > +/* GUC variable */
> > > > +bool   enable_syncslot = false;
> > > >
> > > > Is enable_syncslot a really good name? We use "enable" prefix only for
> > > > planner parameters such as enable_seqscan, and it seems to me that
> > > > "slot" is not specific. Other candidates are:
> > > >
> > > > * synchronize_replication_slots = on|off
> > > > * synchronize_failover_slots = on|off
> > > >
> > >
> > > I would prefer the second one. Would it be better to just say
> > > sync_failover_slots?
> >
> > Works for me. But if we want to extend this option for non-failover
> > slots as well in the future, synchronize_replication_slots (or
> > sync_replication_slots) seems better. We can extend it by having an
> > enum later. For example, the values can be on, off, or failover etc.
> >
> 
> I see your point. Let us see if others have any suggestions on this.

I also see Sawada-San's point and I'd vote for "sync_replication_slots". Then 
for
the current feature I think "failover" and "on" should be the values to turn the
feature on (assuming "on" would mean "all kind of supported slots").

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: index prefetching

On 1/24/24 01:51, Melanie Plageman wrote:
> On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra
>  wrote:
>>
>> On 1/19/24 22:43, Melanie Plageman wrote:
>>
>>> We fill a queue with blocks from TIDs that we fetched from the index.
>>> The queue is saved in a scan descriptor that is made available to the
>>> streaming read callback. Once the queue is full, we invoke the table
>>> AM specific index_fetch_tuple() function which calls
>>> pg_streaming_read_buffer_get_next(). When the streaming read API
>>> invokes the callback we registered, it simply dequeues a block number
>>> for prefetching.
>>
>> So in a way there are two queues in IndexFetchTableData. One (blk_queue)
>> is being filled from IndexNext, and then the queue in StreamingRead.
> 
> I've changed the name from blk_queue to tid_queue to fix the issue you
> mention in your later remarks.
> I suppose there are two queues. The tid_queue is just to pass the
> block requests to the streaming read API. The prefetch distance will
> be the smaller of the two sizes.
> 

FWIW I think the two queues are a nice / elegant approach. In hindsight
my problems with trying to utilize the StreamingRead were due to trying
to use the block-oriented API directly from places that work with TIDs,
and this just makes that go away.

I wonder what the overhead of shuffling stuff between queues will be,
but hopefully not too high (that's my assumption).

>>> The only change to the streaming read API is that now, even if the
>>> callback returns InvalidBlockNumber, we may not be finished, so make
>>> it resumable.
>>
>> Hmm, not sure when can the callback return InvalidBlockNumber before
>> reaching the end. Perhaps for the first index_fetch_heap call? Any
>> reason not to fill the blk_queue before calling index_fetch_heap?
> 
> The callback will return InvalidBlockNumber whenever the queue is
> empty. Let's say your queue size is 5 and your effective prefetch
> distance is 10 (some combination of the PgStreamingReadRange sizes and
> PgStreamingRead->max_ios). The first time you call index_fetch_heap(),
> the callback returns InvalidBlockNumber. Then the tid_queue is filled
> with 5 tids. Then index_fetch_heap() is called.
> pg_streaming_read_look_ahead() will prefetch all 5 of these TID's
> blocks, emptying the queue. Once all 5 have been dequeued, the
> callback will return InvalidBlockNumber.
> pg_streaming_read_buffer_get_next() will return one of the 5 blocks in
> a buffer and save the associated TID in the per_buffer_data. Before
> index_fetch_heap() is called again, we will see that the queue is not
> full and fill it up again with 5 TIDs. So, the callback will return
> InvalidBlockNumber 3 times in this scenario.
> 

Thanks for the explanation. Yes, I didn't realize that the queues may be
of different length, at which point it makes sense to return invalid
block to signal the TID queue is empty.

>>> Structurally, this changes the timing of when the heap blocks are
>>> prefetched. Your code would get a tid from the index and then prefetch
>>> the heap block -- doing this until it filled a queue that had the
>>> actual tids saved in it. With my approach and the streaming read API,
>>> you fetch tids from the index until you've filled up a queue of block
>>> numbers. Then the streaming read API will prefetch those heap blocks.
>>
>> And is that a good/desirable change? I'm not saying it's not, but maybe
>> we should not be filling either queue in one go - we don't want to
>> overload the prefetching.
> 
> We can focus on the prefetch distance algorithm maintained in the
> streaming read API and then make sure that the tid_queue is larger
> than the desired prefetch distance maintained by the streaming read
> API.
> 

Agreed. I think I wasn't quite right when concerned about "overloading"
the prefetch, because that depends entirely on the StreamingRead API
queue. A lage TID queue can't cause overload of anything.

What could happen is a TID queue being too small, so the prefetch can't
hit the target distance. But that can happen already, e.g. indexes that
are correlated and/or index-only scans with all-visible pages.

>>> There are also table AM layering violations in my sketch which would
>>> have to be worked out (not to mention some resource leakage I didn't
>>> bother investigating [which causes it to fail tests]).
>>>
>>> 0001 is all of Thomas' streaming read API code that isn't yet in
>>> master and 0002 is my rough sketch of index prefetching using the
>>> streaming read API
>>>
>>> There are also numerous optimizations that your index prefetching
>>> patch set does that would need to be added in some way. I haven't
>>> thought much about it yet. I wanted to see what you thought of this
>>> approach first. Basically, is it workable?
>>
>> It seems workable, yes. I'm not sure it's much simpler than my patch
>> (considering a lot of the code is in the optimizations, which are
>> missing from this patch).
>>
>> I think the question is where should the optimizations happen

RE: Documentation to upgrade logical replication cluster

2024-01-24 Thread Hayato Kuroda (Fujitsu)

Dear Vignesh,

Thanks for updating the patch! Basically your patch looks good.
Below lines are my comments for v3.

01.

```
 
  The output plugins referenced by the slots on the old cluster must be
  installed in the new PostgreSQL executable directory.
 
```

PostgreSQL must be marked as .

02.

```

pg_ctl -D /opt/PostgreSQL/data1 stop -l logfile

```

I checked that found that -l was no-op when `pg_ctl stop` was specified. Can we 
remove?
The documentation is not listed -l for the stop command.
All the similar lines should be fixed as well.

03.

```
On node3, create any tables that were created in
the upgraded node2 between
 and now,
```

If tables are newly defined on node1 between 1 - 11, they are not defined on 
node3.
So they must be defined on node3 as well.

04.

```
  
   
```

Even if the referred steps is correct, ID should be allocated to step, not para.
That's why the rendering is bit a strange.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Synchronizing slots from primary to standby

2024-01-24 Thread Amit Kapila

On Tue, Jan 23, 2024 at 5:13 PM shveta malik  wrote:
>
> Thanks Ajin for testing the patch. PFA v66 which fixes this issue.
>

I think we should try to commit the patch as all of the design
concerns are resolved now. To achieve that, can we split the failover
setting patch into the following: (a) setting failover property via
SQL commands and display it in pg_replication_slots (b) replication
protocol command (c) failover property via subscription commands?

It will make each patch smaller and it would be easier to detect any
problem in the same after commit.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-01-24 Thread shveta malik

On Wed, Jan 24, 2024 at 2:38 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Wed, Jan 24, 2024 at 01:51:54PM +0530, Amit Kapila wrote:
> > On Wed, Jan 24, 2024 at 11:24 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Jan 24, 2024 at 2:43 PM Amit Kapila  
> > > wrote:
> > > >
> > > > >
> > > > > +/* GUC variable */
> > > > > +bool   enable_syncslot = false;
> > > > >
> > > > > Is enable_syncslot a really good name? We use "enable" prefix only for
> > > > > planner parameters such as enable_seqscan, and it seems to me that
> > > > > "slot" is not specific. Other candidates are:
> > > > >
> > > > > * synchronize_replication_slots = on|off
> > > > > * synchronize_failover_slots = on|off
> > > > >
> > > >
> > > > I would prefer the second one. Would it be better to just say
> > > > sync_failover_slots?
> > >
> > > Works for me. But if we want to extend this option for non-failover
> > > slots as well in the future, synchronize_replication_slots (or
> > > sync_replication_slots) seems better. We can extend it by having an
> > > enum later. For example, the values can be on, off, or failover etc.
> > >
> >
> > I see your point. Let us see if others have any suggestions on this.
>
> I also see Sawada-San's point and I'd vote for "sync_replication_slots". Then 
> for
> the current feature I think "failover" and "on" should be the values to turn 
> the
> feature on (assuming "on" would mean "all kind of supported slots").

Even if others agree and we change this GUC name to
"sync_replication_slots", I feel we should keep the values as "on" and
"off" currently, where "on" would mean 'sync failover slots' (docs can
state that clearly).  I do not think we should support sync of "all
kinds of supported slots" in the first version. Maybe we can think
about it for future versions.

thanks
Shveta

s_lock_test no longer works

2024-01-24 Thread Alvaro Herrera

I just discovered that doing "make -C src/backend/storage/lmgr check" no
longer works, because commit 92daeca45df6 ("Add wait event for
pg_usleep() in perform_spin_delay()") added a requirement for
my_wait_event_info to be present at link time:

$ LC_ALL=C make -C src/backend/storage/lmgr/ s_lock_test
make: Entering directory 
'/home/alvherre/Code/pgsql-build/master/src/backend/storage/lmgr'
gcc -I. -I../../../../src/include -I/pgsql/source/master/src/include  
-D_GNU_SOURCE  -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Werror=vla -Wendif-labels 
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type 
-Wshadow=compatible-local -Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g 
-O2 -DS_LOCK_TEST=1 /pgsql/source/master/src/backend/storage/lmgr/s_lock.c \
 -L ../../../../src/common -lpgcommon \
-L ../../../../src/port -lpgport -lm -o s_lock_test
/usr/bin/ld: /run/user/1000/alvherre-tmp/ccMaAvVj.o: warning: relocation 
against `my_wait_event_info' in read-only section `.text'
/usr/bin/ld: /run/user/1000/alvherre-tmp/ccMaAvVj.o: in function 
`pgstat_report_wait_start':
/pgsql/source/master/src/include/utils/wait_event.h:94: undefined reference to 
`my_wait_event_info'
/usr/bin/ld: /run/user/1000/alvherre-tmp/ccMaAvVj.o: in function 
`pgstat_report_wait_end':
/pgsql/source/master/src/include/utils/wait_event.h:107: undefined reference to 
`my_wait_event_info'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status
make: *** [Makefile:35: s_lock_test] Error 1
make: Leaving directory 
'/home/alvherre/Code/pgsql-build/master/src/backend/storage/lmgr'


This is after I added -lm, to fix these other problems:

/home/alvherre/Code/pgsql-build/master/src/common/../../../../../../../pgsql/source/master/src/common/pg_prng.c:269:
 undefined reference to `log'
/usr/bin/ld: 
/home/alvherre/Code/pgsql-build/master/src/common/../../../../../../../pgsql/source/master/src/common/pg_prng.c:269:
 undefined reference to `sin'
/usr/bin/ld: 
/home/alvherre/Code/pgsql-build/master/src/common/../../../../../../../pgsql/source/master/src/common/pg_prng.c:269:
 undefined reference to `sqrt'


On my machine, it's enough to patch s_lock_test.c to have a local definition
for the missing symbol.  Since the file already has a test mode, it
turns out to be quite simple -- attached.


I do wonder if we want to keep this around, given that it's been more
than one year broken and nobody seems to have noticed, and the Meson
build does not support the test as a target.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
 Are you not unsure you want to delete Firefox?
   [Not unsure] [Not not unsure][Cancel]
   http://smylers.hates-software.com/2008/01/03/566e45b2.html
>From 92913e4d8cdde22b03d0b32547718f5ca4170747 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 24 Jan 2024 12:10:27 +0100
Subject: [PATCH] fix s_lock_test compile

---
 src/backend/storage/lmgr/Makefile | 2 +-
 src/backend/storage/lmgr/s_lock.c | 5 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index 81da6ee13a..89cbb8de49 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -34,7 +34,7 @@ endif
 s_lock_test: s_lock.c $(top_builddir)/src/common/libpgcommon.a $(top_builddir)/src/port/libpgport.a
 	$(CC) $(CPPFLAGS) $(CFLAGS) -DS_LOCK_TEST=1 $(srcdir)/s_lock.c \
 		$(TASPATH) -L $(top_builddir)/src/common -lpgcommon \
-		-L $(top_builddir)/src/port -lpgport -o s_lock_test
+		-L $(top_builddir)/src/port -lpgport -lm -o s_lock_test
 
 # see notes in src/backend/parser/Makefile
 lwlocknames.c: lwlocknames.h
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 0e5f7ab0b9..a47027eb63 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -61,6 +61,11 @@
 #define MIN_DELAY_USEC		1000L
 #define MAX_DELAY_USEC		100L
 
+#ifdef S_LOCK_TEST
+static uint32 local_my_wait_event_info;
+uint32 *my_wait_event_info = &local_my_wait_event_info;
+
+#endif
 
 slock_t		dummy_spinlock;
 
-- 
2.39.2

Re: Network failure may prevent promotion

On Tue, Jan 23, 2024 at 6:43 PM Heikki Linnakangas  wrote:
> There's an existing AmWalReceiverProcess() macro too. Let's use that.

+1

> Hmm, but doesn't bgworker_die() have that problem with exit(1)ing in the
> signal handler?

Yes, that's a problem. This issue was raised sometimes so far,
but has not been resolved yet.

> I also wonder if we should replace SignalHandlerForShutdownRequest()
> completely with die(), in all processes? The difference is that
> SignalHandlerForShutdownRequest() uses ShutdownRequestPending, while
> die() uses ProcDiePending && InterruptPending to indicate that the
> signal was received. Or do some of the processes want to check for
> ShutdownRequestPending only at specific places, and don't want to get
> terminated at the any random CHECK_FOR_INTERRUPTS()?

For example, checkpointer seems to want to handle a shutdown request
only when no other checkpoint is in progress because initiating a shutdown
checkpoint while another checkpoint is running could lead to issues.

Also I just wonder if even walreceiver can exit safely at any random
CHECK_FOR_INTERRUPTS()...

Regards,

-- 
Fujii Masao

Re: [PATCH] Add native windows on arm64 support

2024-01-24 Thread Dave Cramer

On Tue, 23 Jan 2024 at 18:32, Michael Paquier  wrote:

> On Tue, Jan 23, 2024 at 04:13:05PM -0500, Dave Cramer wrote:
> > On Tue, 23 Jan 2024 at 08:46, Dave Cramer 
> wrote:
> >> The attached patch works with v17. I will work on getting a buildfarm
> >> animal up shortly
>
> Thanks for doing that!
>
> > I can build using meson manually, but for some reason building with the
> > buildfarm fails.
> >
> > I'm not sure which files to attach to this to help. Andrew, what files do
> > you need ?
>
> Probably build.ninja and the contents of meson-logs/ would offer
> enough information?
> --
> Michael
>

I managed to get it to build the vcvarsall arch needs to be x64. I need to
add some options, but the patch above needs to be applied to build it.

Dave

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

2024-01-24 Thread Matthias van de Meent

On Fri, 19 Jan 2024 at 05:55, Dilip Kumar  wrote:
>
> On Wed, Nov 1, 2023 at 2:42 AM Matthias van de Meent
>  wrote:
> >
> > Hi,
> >
> > Currently, nbtree code compares each and every column of an index
> > tuple during the binary search on the index page. With large indexes
> > that have many duplicate prefix column values (e.g. an index on (bool,
> > bool, uuid) ) that means a lot of wasted time getting to the right
> > column.
> >
> > The attached patch improves on that by doing per-page dynamic prefix
> > truncation: If we know that on both the right and left side there are
> > index tuples where the first two attributes are equal to the scan key,
> > we skip comparing those attributes at the current index tuple and
> > start with comparing attribute 3, saving two attribute compares. We
> > gain performance whenever comparing prefixing attributes is expensive
> > and when there are many tuples with a shared prefix - in unique
> > indexes this doesn't gain much, but we also don't lose much in this
> > case.
> >
> > This patch was originally suggested at [0], but it was mentioned that
> > they could be pulled out into it's own thread. Earlier, the
> > performance gains were not clearly there for just this patch, but
> > after further benchmarking this patch stands on its own for
> > performance: it sees no obvious degradation of performance, while
> > gaining 0-5% across various normal indexes on the cc-complete sample
> > dataset, with the current worst-case index shape getting a 60%+
> > improved performance on INSERTs in the tests at [0].
>
> +1 for the idea, I have some initial comments while reading through the patch.

Thank you for the review.

> 1.
> Commit message refers to a non-existing reference '(see [0])'.

Noted, I'll update that.

> 2.
> +When we do a binary search on a sorted set (such as a BTree), we know that a
> +tuple will be smaller than its left neighbour, and larger than its right
> +neighbour.
>
> I think this should be 'larger than left neighbour and smaller than
> right neighbour' instead of the other way around.

Noted, will be fixed, too.

> 3.
> +With the above optimizations, dynamic prefix truncation improves the worst
> +case complexity of indexing from O(tree_height * natts * log(tups_per_page))
> +to O(tree_height * (3*natts + log(tups_per_page)))
>
> Where do the 3*natts come from?  Is it related to setting up the
> dynamic prefix at each level?

Yes: We need to establish prefixes for both a tuple that's ahead of
the to-be-compared tuple, and one that's after the to-be-compared
tuple. Assuming homogenous random distribution of scan key accesses
across the page (not always the case, but IMO a reasonable starting
point) this would average to 3 unprefixed compares before you have
established both a higher and a lower prefix.

> 4.
> + /*
> + * All tuple attributes are equal to the scan key, only later attributes
> + * could potentially not equal the scan key.
> + */
> + *compareattr = ntupatts + 1;
>
> Can you elaborate on this more? If all tuple attributes are equal to
> the scan key then what do those 'later attributes' mean?

In inner pages, tuples may not have all key attributes, as some may
have been truncated away in page splits. So, tuples that have at least
the same prefix as this (potentially truncated) tuple will need to be
compared starting at the first missing attribute of this tuple, i.e.
ntupatts + 1.

Kind regards,

Matthias van de Meent

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-24 Thread Andrew Dunstan




On 2024-01-24 We 03:11, Michael Paquier wrote:

On Wed, Jan 24, 2024 at 02:49:36PM +0900, Sutou Kouhei wrote:

For COPY TO:

0001: This adds CopyToRoutine and use it for text/csv/binary
formats. No implementation change. This just move codes.

10M without this change:

 format,elapsed time (ms)
 text,1090.763
 csv,1136.103
 binary,1137.141

10M with this change:

 format,elapsed time (ms)
 text,1082.654
 csv,1196.991
 binary,1069.697

These numbers point out that binary is faster by 6%, csv is slower by
5%, while text stays around what looks like noise range.  That's not
negligible.  Are these numbers reproducible?  If they are, that could
be a problem for anybody doing bulk-loading of large data sets.  I am
not sure to understand where the improvement for binary comes from by
reading the patch, but perhaps perf would tell more for each format?
The loss with csv could be blamed on the extra manipulations of the
function pointers, likely.



I don't think that's at all acceptable.

We've spent quite a lot of blood sweat and tears over the years to make 
COPY fast, and we should not sacrifice any of that lightly.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Remove redundant HandleWalWriterInterrupts()

Hi,

Because of commit 1bdd54e662, the code of HandleWalWriterInterrupts()
became the same as HandleMainLoopInterrupts(). So I'd like to propose to
remove HandleWalWriterInterrupts() and make walwriter use
HandleMainLoopInterrupts() instead for improved code simplicity. Thought?

Patch attached.

Regards,

-- 
Fujii Masao


v1-0001-Remove-redundant-HandleWalWriterInterrupts.patch
Description: Binary data

GSoC 2024

2024-01-24 Thread Pavlo Golub

Greetings -hackers and -mentors,

Google Summer of Code is back for 2024! Please review this
announcement blog post:
https://opensource.googleblog.com/2023/11/google-summer-of-code-2024-celebrating-20th-year.html

And please make yourself familiar with the GSoC 2024 timeline:
https://developers.google.com/open-source/gsoc/timeline

Now is the time to work on getting together a set of projects we'd
like to have GSoC students work on over the summer. Like last year, we
must have a good set of projects for students to choose from before
the deadline for mentoring organizations.

The deadline for Mentoring organizations to apply is February 6.
The list of accepted organizations will be published on February 21.

Unsurprisingly, we'll need to have an Ideas page again, so I've gone
ahead and created one (copying last year's):
https://wiki.postgresql.org/wiki/GSoC_2024

Google discusses what makes a good "Ideas" list here:
https://google.github.io/gsocguides/mentor/defining-a-project-ideas-list.html

All the entries are marked with '2023' to indicate they were pulled
from last year. If the project from last year is still relevant,
please update it to '2024'. Make sure to revise all the information
(in particular, list yourself as a mentor and remove the other
mentors, as appropriate). Please also update the project's scope to be
reasonable for the time students are asked.

Having at least two mentors per project is an excellent idea to
decrease the load. Please consider listing yourself as a co-mentor for
projects with only one mentor listed. The program's goals are not
limited solely to code writing. One of the goals is to inspire new
developers to begin participating in open-source communities and to
help open-source projects identify and bring in new developers.

New entries are welcome and encouraged; note them as '2024' when you add them.

Projects from last year that were worked on but had significant
follow-on work to be completed are also welcome - update the
description appropriately and mark it as being for '2024'.

When we get closer to submitting our application, I'll clean out the
'2023' entries that have yet to receive any updates. Also, if any
projects are no longer appropriate (maybe they were completed, for
example, and no longer need work), please feel free to remove them. We
may have missed some updates where a GSoC project was achieved
independently of GSoC.

As a reminder, each idea on the page should be in the format that the
other entries are in and should include:
1 Project Description
2 Skills needed
3 Difficulty Level
4 Project Size
5 Potential Mentors
6 Expected Outcomes
7 References

As with last year, please consider PostgreSQL to be an "Umbrella"
project, and anything that would be regarded as a "PostgreSQL Family"
per the News/Announce policy [1] is likely to be acceptable as a
PostgreSQL GSoC project.

In other words, if you're a contributor or developer on WAL-G, barman,
pgBackRest, pgwatch, pgagroal, pgexporter, pgmoneta, pgpool,
pgbouncer, the PostgreSQL website (pgweb), the PgEU/PgUS website code
(pgeu-system), pgAdmin4, DBeaver, HeidiSQL, pldebugger, pspg, the PG
RPMs (pgrpms), the JDBC driver, the ODBC driver, or any of the many
other PG Family projects, please feel free to add a project for
consideration!

Let's have another great year of GSoC with PostgreSQL!
PGForce be with you!

Best regards,
Pavlo Golub,
on behalf of PostgreSQL GSoC Admins

[1]: https://www.postgresql.org/about/policies/news-and-events/

Re: UUID v7

Hi,

> Cfbot also seems to be happy with the patch so I'm changing the CF
> entry status to RfC.

I've found a bug:

```
=# select now() - interval '5000 years';
?column?

 2977-01-24 15:29:01.779462+02:30:17 BC

Time: 0.957 ms

=# select uuidv7(now() - interval '5000 years');
uuidv7
--
 720c1868-0764-7677-99cd-265b84ea08b9

=# select uuid_extract_time('720c1868-0764-7677-99cd-265b84ea08b9');
 uuid_extract_time

 5943-08-26 21:30:44.836+03
```

-- 
Best regards,
Aleksander Alekseev

Re: UUID v7

> On 24 Jan 2024, at 17:31, Aleksander Alekseev  
> wrote:
> 
> Hi,
> 
>> Cfbot also seems to be happy with the patch so I'm changing the CF
>> entry status to RfC.
> 
> I've found a bug:
> 
> ```
> =# select now() - interval '5000 years';
>?column?
> 
> 2977-01-24 15:29:01.779462+02:30:17 BC
> 
> Time: 0.957 ms
> 
> =# select uuidv7(now() - interval '5000 years');
>uuidv7
> --
> 720c1868-0764-7677-99cd-265b84ea08b9
> 
> =# select uuid_extract_time('720c1868-0764-7677-99cd-265b84ea08b9');
> uuid_extract_time
> 
> 5943-08-26 21:30:44.836+03
> ```

UUIDv7 range does not correspond to timestamp range. But it’s purpose is not in 
storing timestamp, but in being unique identifier. So I don’t think it worth 
throwing an error when overflowing value is given. BTW if you will subtract 
some nanoseconds - you will not get back timestamp you put into UUID too.
UUID does not store timpestamp, it only uses it to generate an identifier. Some 
value can be extracted back, but with limited precision, limited range and only 
if UUID was generated precisely by the specification in standard (and standard 
allows deviation! Most of implementation try to tradeoff something).

Best regards, Andrey Borodin.

Re: Support TZ format code in to_timestamp()

Hi,

> > Anyway, v2-0001 below is the previous patch rebased up to current
> > (only line numbers change), and then v2-0002 responds to your
> > and Daniel's review comments.
>
> LGTM.

```
+SELECT to_timestamp('2011-12-18 11:38 JUNK', '-MM-DD HH12:MI
TZ');  -- error
+ERROR:  invalid value "JUNK" for "TZ"
+DETAIL:  Time zone abbreviation is not recognized.
+SELECT to_timestamp('2011-12-18 11:38 ...', '-MM-DD HH12:MI TZ');  -- error
+ERROR:  invalid value ".." for "TZ"
```

Shouldn't the second error display the full value "..." (three dots)
similarly to the previous one? Also I think we need at least one
negative test for OF.

Other than that v2 looks OK.

-- 
Best regards,
Aleksander Alekseev

Re: UUID v7

Hi,

> UUIDv7 range does not correspond to timestamp range. But it’s purpose is not 
> in storing timestamp, but in being unique identifier. So I don’t think it 
> worth throwing an error when overflowing value is given. BTW if you will 
> subtract some nanoseconds - you will not get back timestamp you put into UUID 
> too.
> UUID does not store timpestamp, it only uses it to generate an identifier. 
> Some value can be extracted back, but with limited precision, limited range 
> and only if UUID was generated precisely by the specification in standard 
> (and standard allows deviation! Most of implementation try to tradeoff 
> something).

I don't claim that UUIDv7 purpose is storing timestamps, but I think
the invariant:

```
uuid_extract_time(uidv7(X)) == X
```

and (!) even more importantly:

```
if X > Y then uuidv7(X) > uuidv7(Y)
```

... should hold. Otherwise you can calculate crc64(X) or sha256(X)
internally in order to generate an unique ID and claim that it's fine.

Values that violate named invariants should be rejected with an error.

-- 
Best regards,
Aleksander Alekseev

Re: Network failure may prevent promotion

On Wed, Jan 24, 2024 at 8:29 PM Fujii Masao  wrote:
>
> On Tue, Jan 23, 2024 at 6:43 PM Heikki Linnakangas  wrote:
> > There's an existing AmWalReceiverProcess() macro too. Let's use that.
>
> +1
>
> > Hmm, but doesn't bgworker_die() have that problem with exit(1)ing in the
> > signal handler?
>
> Yes, that's a problem. This issue was raised sometimes so far,
> but has not been resolved yet.
>
> > I also wonder if we should replace SignalHandlerForShutdownRequest()
> > completely with die(), in all processes? The difference is that
> > SignalHandlerForShutdownRequest() uses ShutdownRequestPending, while
> > die() uses ProcDiePending && InterruptPending to indicate that the
> > signal was received. Or do some of the processes want to check for
> > ShutdownRequestPending only at specific places, and don't want to get
> > terminated at the any random CHECK_FOR_INTERRUPTS()?
>
> For example, checkpointer seems to want to handle a shutdown request
> only when no other checkpoint is in progress because initiating a shutdown
> checkpoint while another checkpoint is running could lead to issues.

This my comment is not right... Sorry for noise.

Regards,

-- 
Fujii Masao

Re: UUID v7




> On 24 Jan 2024, at 18:02, Aleksander Alekseev  
> wrote:
> 
> Hi,
> 
>> UUIDv7 range does not correspond to timestamp range. But it’s purpose is not 
>> in storing timestamp, but in being unique identifier. So I don’t think it 
>> worth throwing an error when overflowing value is given. BTW if you will 
>> subtract some nanoseconds - you will not get back timestamp you put into 
>> UUID too.
>> UUID does not store timpestamp, it only uses it to generate an identifier. 
>> Some value can be extracted back, but with limited precision, limited range 
>> and only if UUID was generated precisely by the specification in standard 
>> (and standard allows deviation! Most of implementation try to tradeoff 
>> something).
> 
> I don't claim that UUIDv7 purpose is storing timestamps, but I think
> the invariant:
> 
> ```
> uuid_extract_time(uidv7(X)) == X
> ```
> 
> and (!) even more importantly:
> 
> ```
> if X > Y then uuidv7(X) > uuidv7(Y)
> ```
> 
> ... should hold.
Function to extract timestamp does not provide any guarantees at all. Standard 
states this, see Kyzer answers upthread.
Moreover, standard urges against relying on that if uuidX was generated before 
uuidY, then uuidX Otherwise you can calculate crc64(X) or sha256(X)
> internally in order to generate an unique ID and claim that it's fine.
> 
> Values that violate named invariants should be rejected with an error.

Think about the value that you pass to uuid generation function as an entropy. 
It’s there to ensure uniqueness and promote ordering (but not guarantee).


Best regards, Andrey Borodin.

Re: UUID v7

Hi,

> Values that violate named invariants should be rejected with an error.

To clarify, I don't think we should bother about the precision part.
"Equals" in the example above means "equal within UUIDv7 precision",
same for "more" and "less". However, years 2977 BC and 5943 AC are
clearly not equal, thus 2977 BC should be rejected as an invalid value
for UUIDv7.

-- 
Best regards,
Aleksander Alekseev

Re: pgbnech: allow to cancel queries during benchmark

2024-01-24 Thread Yugo NAGATA

On Fri, 19 Jan 2024 17:46:03 +0900 (JST)
Tatsuo Ishii  wrote:

> >> +/* send cancel requests to all connections */
> >> +static void
> >> +cancel_all()
> >> +{
> >> +  for (int i = 0; i < nclients; i++)
> >> +  {
> >> +  char errbuf[1];
> >> +  if (client_states[i].cancel != NULL)
> >> +  (void) PQcancel(client_states[i].cancel, errbuf, 
> >> sizeof(errbuf));
> >> +  }
> >> +}
> >> +
> >> 
> >> Why in case of errors from PQCancel the error message is neglected? I
> >> think it's better to print out the error message in case of error.
> > 
> > Is the message useful for pgbench users? I saw the error is ignored
> > in pg_dump, for example in bin/pg_dump/parallel.c
> 
> I think the situation is different from pg_dump.  Unlike pg_dump, if
> PQcancel does not work, users can fix the problem by using
> pg_terminate_backend or kill command. In order to make this work, an
> appropriate error message is essential.

Makes sense. I fixed to emit an error message when PQcancel fails.

Also, I added some comments about the signal handling on Windows
to explain why the different way than non-Windows is required;

+* On Windows, a callback function is set in which query cancel requests
+* are sent to all benchmark queries running in the backend. This is
+* required because all threads running queries continue to run without
+* interrupted even when the signal is received.
+*

Attached is the updated patch, v6.

> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
> 
> 


-- 
Yugo NAGATA 
>From 52579f3d31a2927d8818953fabf8a908466e4fcf Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Mon, 24 Jul 2023 21:53:28 +0900
Subject: [PATCH v6] Allow pgbnech to cancel queries during benchmark

Previously, Ctrl+C during benchmark killed pgbench immediately,
but queries were not cancelled nd they keep on running on the
backend until they tried to send the result to pgbench.
The commit fixes this so that cancel requests are sent to all
connections before pgbench exits.

In thread #0, setup_cancel_handler is called before the benchmark
so that CancelRequested is set when SIGINT is sent. When SIGINT
is sent during the benchmark, on non-Windows, thread #0 will be
interrupted, return from I/O wait, and send cancel requests to
all connections. After queries are cancelled, other threads also
be interrupted and pgbench will exit at the end. On Windows, cancel
requests are sent in the callback function specified by
setup_cancel_hander.
---
 src/bin/pgbench/pgbench.c| 94 
 src/bin/pgbench/t/001_pgbench_with_server.pl | 42 +
 2 files changed, 136 insertions(+)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 2574454839..e69c4af68a 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -596,6 +596,7 @@ typedef enum
 typedef struct
 {
 	PGconn	   *con;			/* connection handle to DB */
+	PGcancel   *cancel;			/* query cancel */
 	int			id;/* client No. */
 	ConnectionStateEnum state;	/* state machine's current state. */
 	ConditionalStack cstack;	/* enclosing conditionals state */
@@ -638,6 +639,8 @@ typedef struct
  * here */
 } CState;
 
+CState	*client_states;		/* status of all clients */
+
 /*
  * Thread state
  */
@@ -837,6 +840,10 @@ static void add_socket_to_set(socket_set *sa, int fd, int idx);
 static int	wait_on_socket_set(socket_set *sa, int64 usecs);
 static bool socket_has_input(socket_set *sa, int fd, int idx);
 
+#ifdef WIN32
+static void pgbench_cancel_callback(void);
+#endif
+
 /* callback used to build rows for COPY during data loading */
 typedef void (*initRowMethod) (PQExpBufferData *sql, int64 curr);
 
@@ -3646,6 +3653,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 		st->state = CSTATE_ABORTED;
 		break;
 	}
+	st->cancel = PQgetCancel(st->con);
 
 	/* reset now after connection */
 	now = pg_time_now();
@@ -4677,6 +4685,21 @@ disconnect_all(CState *state, int length)
 		finishCon(&state[i]);
 }
 
+/* send cancel requests to all connections */
+static void
+cancel_all()
+{
+	for (int i = 0; i < nclients; i++)
+	{
+		char errbuf[256];
+		if (client_states[i].cancel != NULL)
+		{
+			if (!PQcancel(client_states[i].cancel, errbuf, sizeof(errbuf)))
+pg_log_error("Could not send cancel request: %s", errbuf);
+		}
+	}
+}
+
 /*
  * Remove old pgbench tables, if any exist
  */
@@ -7153,6 +7176,9 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* enable threads to access the status of all clients */
+	client_states = state;
+
 	/* other CState initializations */
 	for (i = 0; i < nclients; i++)
 	{
@@ -7365,6 +7391,39 @@ threadRun(void *arg)
 	StatsData	last,
 aggs;
 
+	/*
+	 * Query cancellation is handled only in thread #0.
+	 *
+	 * On Windows, a callback function is set in which query cancel requests
+	 * are sent to all benchmark queries running in

Re: UUID v7

Hi,

> Function to extract timestamp does not provide any guarantees at all. 
> Standard states this, see Kyzer answers upthread.
> Moreover, standard urges against relying on that if uuidX was generated 
> before uuidY, then uuidX happen, but does not guaranty that.
> All what is guaranteed is the uniqueness at certain conditions.
>
> > Otherwise you can calculate crc64(X) or sha256(X)
> > internally in order to generate an unique ID and claim that it's fine.
> >
> > Values that violate named invariants should be rejected with an error.
>
> Think about the value that you pass to uuid generation function as an 
> entropy. It’s there to ensure uniqueness and promote ordering (but not 
> guarantee).

If the standard doesn't guarantee something it doesn't mean it forbids
us to give stronger guarantees. I'm convinced that these guarantees
will be useful in real-world applications, at least the ones acting
exclusively within Postgres.

This being said, I understand your point of view too. Let's see what
other people think.

-- 
Best regards,
Aleksander Alekseev

Re: Remove redundant HandleWalWriterInterrupts()

2024-01-24 Thread Bharath Rupireddy

On Wed, Jan 24, 2024 at 5:50 PM Fujii Masao  wrote:
>
> Hi,
>
> Because of commit 1bdd54e662, the code of HandleWalWriterInterrupts()
> became the same as HandleMainLoopInterrupts(). So I'd like to propose to
> remove HandleWalWriterInterrupts() and make walwriter use
> HandleMainLoopInterrupts() instead for improved code simplicity. Thought?
>
> Patch attached.

Nice catch. Indeed they both are the same after commit 1bdd54e662. The
patch LGTM.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: make BuiltinTrancheNames less ugly

2024-01-24 Thread Alvaro Herrera

On 2024-Jan-23, Alvaro Herrera wrote:

> I'm a total newbie to Meson, so it's likely that there are better ways
> to implement this.  I'll leave this here for a little bit in case
> anybody wants to comment.

OK, I pushed the array definition, and here's the other bits as a
followup patch.  I'll add it to the next commitfest, though I hope to
get it committed before then, either in this form or whatever different
Meson trickery is recommended.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801
>From 3d24b89855888a6650ec1aafb3579d810bfec5ac Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 23 Jan 2024 10:36:14 +0100
Subject: [PATCH] Remove IndividualLWLockNames

We can just merge the lwlock names into the BuiltinTrancheNames array.
This requires that Meson compiles the file with -I. in CPPFLAGS, which
in turn requires some additional contortions for DTrace support in
FreeBSD.
---
 src/backend/meson.build  |  4 +++-
 src/backend/storage/lmgr/Makefile|  3 ++-
 src/backend/storage/lmgr/generate-lwlocknames.pl | 10 ++
 src/backend/storage/lmgr/lwlock.c| 13 -
 src/backend/storage/lmgr/meson.build | 12 ++--
 5 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/src/backend/meson.build b/src/backend/meson.build
index 8767aaba67..57a52c37e0 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -127,7 +127,9 @@ backend_objs = [postgres_lib.extract_all_objects(recursive: false)]
 if dtrace.found() and host_system != 'darwin'
   backend_input += custom_target(
 'probes.o',
-input: ['utils/probes.d', postgres_lib.extract_objects(backend_sources, timezone_sources)],
+input: ['utils/probes.d',
+  postgres_lib.extract_objects(backend_sources, timezone_sources),
+  lwlock_lib.extract_objects(lwlock_source)],
 output: 'probes.o',
 command: [dtrace, '-C', '-G', '-o', '@OUTPUT@', '-s', '@INPUT@'],
 install: false,
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index 504480e847..81da6ee13a 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -12,13 +12,14 @@ subdir = src/backend/storage/lmgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
+override CPPFLAGS := -I. $(CPPFLAGS)
+
 OBJS = \
 	condition_variable.o \
 	deadlock.o \
 	lmgr.o \
 	lock.o \
 	lwlock.o \
-	lwlocknames.o \
 	predicate.o \
 	proc.o \
 	s_lock.o \
diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl
index 7b93ecf6c1..a679a4ff54 100644
--- a/src/backend/storage/lmgr/generate-lwlocknames.pl
+++ b/src/backend/storage/lmgr/generate-lwlocknames.pl
@@ -10,7 +10,6 @@ use Getopt::Long;
 my $output_path = '.';
 
 my $lastlockidx = -1;
-my $continue = "\n";
 
 GetOptions('outdir:s' => \$output_path);
 
@@ -29,8 +28,6 @@ print $h $autogen;
 print $h "/* there is deliberately not an #ifndef LWLOCKNAMES_H here */\n\n";
 print $c $autogen, "\n";
 
-print $c "const char *const IndividualLWLockNames[] = {";
-
 #
 # First, record the predefined LWLocks listed in wait_event_names.txt.  We'll
 # cross-check those with the ones in lwlocknames.txt.
@@ -97,12 +94,10 @@ while (<$lwlocknames>)
 	while ($lastlockidx < $lockidx - 1)
 	{
 		++$lastlockidx;
-		printf $c "%s	\"\"", $continue, $lastlockidx;
-		$continue = ",\n";
+		printf $c "[%s] = \"\",\n", $lastlockidx, $lastlockidx;
 	}
-	printf $c "%s	\"%s\"", $continue, $trimmedlockname;
+	printf $c "[%s] = \"%s\",\n", $lockidx, $trimmedlockname;
 	$lastlockidx = $lockidx;
-	$continue = ",\n";
 
 	print $h "#define $lockname (&MainLWLockArray[$lockidx].lock)\n";
 }
@@ -112,7 +107,6 @@ die
   . "lwlocknames.txt"
   if $i < scalar @wait_event_lwlocks;
 
-printf $c "\n};\n";
 print $h "\n";
 printf $h "#define NUM_INDIVIDUAL_LWLOCKS		%s\n", $lastlockidx + 1;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 98fa6035cc..8aad9c6690 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -115,8 +115,8 @@ StaticAssertDecl(LW_VAL_EXCLUSIVE > (uint32) MAX_BACKENDS,
  * There are three sorts of LWLock "tranches":
  *
  * 1. The individually-named locks defined in lwlocknames.h each have their
- * own tranche.  The names of these tranches appear in IndividualLWLockNames[]
- * in lwlocknames.c.
+ * own tranche.  The names of these tranches come from lwlocknames.c into
+ * BuiltinTranchNames[] below.
  *
  * 2. There are some predefined tranches for built-in groups of locks.
  * These are listed in enum BuiltinTrancheIds in lwlock.h, and their names
@@ -129,9 +129,8 @@ StaticAssertDecl(LW_VAL_EXCLUSIVE > (uint32) MAX_BACKENDS,
  * All these names are user-visible as wait event names, so choose wit

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-24 Thread Sutou Kouhei

Hi,

In <10025bac-158c-ffe7-fbec-32b426291...@dunslane.net>
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Wed, 24 Jan 2024 07:15:55 -0500,
  Andrew Dunstan  wrote:

> 
> On 2024-01-24 We 03:11, Michael Paquier wrote:
>> On Wed, Jan 24, 2024 at 02:49:36PM +0900, Sutou Kouhei wrote:
>>> For COPY TO:
>>>
>>> 0001: This adds CopyToRoutine and use it for text/csv/binary
>>> formats. No implementation change. This just move codes.
>> 10M without this change:
>>
>>  format,elapsed time (ms)
>>  text,1090.763
>>  csv,1136.103
>>  binary,1137.141
>>
>> 10M with this change:
>>
>>  format,elapsed time (ms)
>>  text,1082.654
>>  csv,1196.991
>>  binary,1069.697
>>
>> These numbers point out that binary is faster by 6%, csv is slower by
>> 5%, while text stays around what looks like noise range.  That's not
>> negligible.  Are these numbers reproducible?  If they are, that could
>> be a problem for anybody doing bulk-loading of large data sets.  I am
>> not sure to understand where the improvement for binary comes from by
>> reading the patch, but perhaps perf would tell more for each format?
>> The loss with csv could be blamed on the extra manipulations of the
>> function pointers, likely.
> 
> 
> I don't think that's at all acceptable.
> 
> We've spent quite a lot of blood sweat and tears over the years to make COPY
> fast, and we should not sacrifice any of that lightly.

These numbers aren't reproducible. Because these benchmarks
executed on my normal machine not a machine only for
benchmarking. The machine runs another processes such as
editor and Web browser.

For example, here are some results with master
(94edfe250c6a200d2067b0debfe00b4122e9b11e):

Format,N records,Elapsed time (ms)
csv,1000,1073.715
csv,1000,1022.830
csv,1000,1073.584
csv,1000,1090.651
csv,1000,1052.259

Here are some results with master + the 0001 patch:

Format,N records,Elapsed time (ms)
csv,1000,1025.356
csv,1000,1067.202
csv,1000,1014.563
csv,1000,1032.088
csv,1000,1058.110


I uploaded my benchmark script so that you can run the same
benchmark on your machine:

https://gist.github.com/kou/be02e02e5072c91969469dbf137b5de5

Could anyone try the benchmark with master and master+0001?


Thanks,
-- 
kou

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-24 Thread Sutou Kouhei

Hi,

In <20240124.144936.67229716500876806@clear-code.com>
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Wed, 24 Jan 2024 14:49:36 +0900 (JST),
  Sutou Kouhei  wrote:

> I've implemented custom COPY format feature based on the
> current design discussion. See the attached patches for
> details.

I forgot to mention one note. Documentation isn't included
in these patches. I'll write it after all (or some) patches
are merged. Is it OK?


Thanks,
-- 
kou

Re: WIP Incremental JSON Parser

2024-01-24 Thread Andrew Dunstan

On 2024-01-22 Mo 21:02, Andrew Dunstan wrote:

On 2024-01-22 Mo 18:01, Andrew Dunstan wrote:

On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:

On 2024-01-22 Mo 01:29, Peter Smith wrote:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

Thanks.

Let's see if the attached does better.

This time for sure! (Watch me pull a rabbit out of my hat!)

It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed
it, and I think the test will now pass.

Fixed one problem but there are some others. I'm hoping this will
satisfy the cfbot.

The cfbot reports an error on a 32 bit build
:

# Running: pg_basebackup -D
/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2
--no-sync -cfast --incremental
/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest
pg_basebackup: error: could not upload manifest: ERROR: could not parse backup
manifest: file size is not an integer
pg_basebackup: removing data directory
"/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2"
[02:41:07.830](0.073s) not ok 2 - incremental backup from node1
[02:41:07.830](0.000s) # Failed test 'incremental backup from node1'

I have set up a Debian 12 EC2 instance following the recipe at
,
and ran what I think are the same tests dozens of times, but the failure
did not reappear in my setup. Unfortunately, the test doesn't show the
failing manifest or log the failing field, so trying to divine what
happened here is more than difficult.

Not sure how to address this.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: UUID v7




> On 24 Jan 2024, at 18:29, Aleksander Alekseev  
> wrote:
> 
> Hi,
> 
>> Function to extract timestamp does not provide any guarantees at all. 
>> Standard states this, see Kyzer answers upthread.
>> Moreover, standard urges against relying on that if uuidX was generated 
>> before uuidY, then uuidX> happen, but does not guaranty that.
>> All what is guaranteed is the uniqueness at certain conditions.
>> 
>>> Otherwise you can calculate crc64(X) or sha256(X)
>>> internally in order to generate an unique ID and claim that it's fine.
>>> 
>>> Values that violate named invariants should be rejected with an error.
>> 
>> Think about the value that you pass to uuid generation function as an 
>> entropy. It’s there to ensure uniqueness and promote ordering (but not 
>> guarantee).
> 
> If the standard doesn't guarantee something it doesn't mean it forbids
> us to give stronger guarantees.
No, the standard makes these guarantees impossible.
If we insist that uuid_extract_time(uuidv7(time))==time, we won't be able to 
generate uuidv7 most of the time. uuidv7(now()) will always ERROR-out.
Standard implies more coarse-grained timestamp that we have.

Also, please not that uuidv7(time+1us) and uuidv7(time) will have the same 
internal timestamp, so despite time+1us > time, still second uuid will be 
greater.

Both invariants you proposed cannot be reasonably guaranteed. Upholding any of 
them greatly reduces usability of UUID v7.


Best regards, Andrey Borodin.

Re: UUID v7

Hi,

> Also, please not that uuidv7(time+1us) and uuidv7(time) will have the same 
> internal timestamp, so despite time+1us > time, still second uuid will be 
> greater.
>
> Both invariants you proposed cannot be reasonably guaranteed. Upholding any 
> of them greatly reduces usability of UUID v7.

Again, personally I don't insist on the 1us precision [1]. Only the
fact that timestamp from the far past generates UUID from the future
bothers me.

[1]: 
https://postgr.es/m/CAJ7c6TPCSprWwVNdOB%3D%3DpgKZPqO5q%3DHRgmU7zmYqz9Dz5ffVYw%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev

Re: SSL tests fail on OpenSSL v3.2.0

2024-01-24 Thread Jelte Fennema-Nio

I ran into an SSL issue when using the MSYS2/MINGW build of Postgres
for the PgBouncer test suite. Postgres crashed whenever you tried to
open an ssl connection to it.
https://github.com/msys2/MINGW-packages/issues/19851

I'm wondering if the issue described in this thread could be related
to the issue I ran into. Afaict the merged patch has not been released
yet.

Re: Fixing backslash dot for COPY FROM...CSV

2024-01-24 Thread Daniel Verite

Robert Haas wrote:

> Those links unfortunately seem not to be entirely specific to this
> issue. Other, related things seem to be discussed there, and it's not
> obvious that everyone agrees on what to do, or really that anyone
> agrees on what to do. The best link that I found for this exact issue
> is
> https://www.postgresql.org/message-id/e1tdnvq-0001ju...@wrigleys.postgresql.org
> but the thread isn't very conclusive and is so old that any
> conclusions reached then might no longer be considered valid today.

To refresh the problem statement, 4 cases that need fixing as
of HEAD can be distinguished:

#1. copy csv from file, single column, no quoting involved.
COPY will stop at \. and ignore the rest of the file without
any error or warning.

$ cat >/tmp/file.csv

Re: Support TZ format code in to_timestamp()

Aleksander Alekseev  writes:
> +SELECT to_timestamp('2011-12-18 11:38 JUNK', '-MM-DD HH12:MI TZ');  -- 
> error
> +ERROR:  invalid value "JUNK" for "TZ"
> +DETAIL:  Time zone abbreviation is not recognized.
> +SELECT to_timestamp('2011-12-18 11:38 ...', '-MM-DD HH12:MI TZ');  -- 
> error
> +ERROR:  invalid value ".." for "TZ"

> Shouldn't the second error display the full value "..." (three dots)
> similarly to the previous one?

That's coming from the pre-existing OF code, which is looking for
a integer of at most two digits.  I'm not especially inclined to
mess with that, and even if I were I'd think it should be a separate
patch.

> Also I think we need at least one
> negative test for OF.

OK.

regards, tom lane

Re: make dist using git archive

2024-01-24 Thread Tristan Partin

On Tue Jan 23, 2024 at 3:30 AM CST, Peter Eisentraut wrote:

On 22.01.24 21:04, Tristan Partin wrote:
> I am not really following why we can't use the builtin Meson dist 
> command. The only difference from my testing is it doesn't use a 
> --prefix argument.

Here are some problems I have identified:

1. meson dist internally runs gzip without the -n option.  That makes 
the tar.gz archive include a timestamp, which in turn makes it not 
reproducible.

2. Because gzip includes a platform indicator in the archive, the 
produced tar.gz archive is not reproducible across platforms.  (I don't 
know if gzip has an option to avoid that.  git archive uses an internal 
gzip implementation that handles this.)

3. Meson does not support tar.bz2 archives.

4. Meson uses git archive internally, but then unpacks and repacks the 
archive, which loses the ability to use git get-tar-commit-id.

5. I have found that the tar archives created by meson and git archive 
include the files in different orders.  I suspect that the Python 
tarfile module introduces some either randomness or platform dependency.

6. meson dist is also slower because of the additional work.

7. meson dist produces .sha256sum files but we have called them .sha256. 
  (This is obviously trivial, but it is something that would need to be 
dealt with somehow nonetheless.)

Most or all of these issues are fixable, either upstream in Meson or by 
adjusting our own requirements.  But for now this route would have some 
significant disadvantages.

Thanks Peter. I will bring these up with upstream!

--
Tristan Partin
Neon (https://neon.tech)

Re: SSL tests fail on OpenSSL v3.2.0

2024-01-24 Thread Tristan Partin


On Wed Jan 24, 2024 at 9:58 AM CST, Jelte Fennema-Nio wrote:

I ran into an SSL issue when using the MSYS2/MINGW build of Postgres
for the PgBouncer test suite. Postgres crashed whenever you tried to
open an ssl connection to it.
https://github.com/msys2/MINGW-packages/issues/19851

I'm wondering if the issue described in this thread could be related
to the issue I ran into. Afaict the merged patch has not been released
yet.


Do you have a backtrace? Given that the version is 3.2.0, seems likely.

--
Tristan Partin
Neon (https://neon.tech)

Re: SQL:2011 application time

2024-01-24 Thread Peter Eisentraut


On 18.01.24 04:59, Paul Jungwirth wrote:

Here are new patches consolidating feedback from several emails.


I have committed 0001 and 0002 (the primary key support).

The only significant tweak I did was the error messages in 
GetOperatorFromWellKnownStrategy(), to make the messages translatable 
better and share wording with other messages.  These messages are 
difficult to reach, so we'll probably have to wait for someone to 
actually encounter them to see if they are useful.


I would like to work on 0003 and 0004 (the foreign key support) during 
February/March.  The patches beyond that are probably too optimistic for 
PG17.  I recommend you focus getting 0003/0004 in good shape soon.

Re: s_lock_test no longer works

Alvaro Herrera  writes:
> I do wonder if we want to keep this around, given that it's been more
> than one year broken and nobody seems to have noticed, and the Meson
> build does not support the test as a target.

The last time it was broken, it took us multiple years to notice, too.
I'm not sure that that's a reason to remove the test scaffolding,
though.  You'd probably only really use it to smoke-test some new
spinlock assembly code, and how often does anyone do that anymore?

regards, tom lane

Re: UUID v7

> On 24 Jan 2024, at 20:46, Aleksander Alekseev  
> wrote:
> 
> Only the
> fact that timestamp from the far past generates UUID from the future
> bothers me.

PFA implementation of guard checks, but I'm afraid that this can cause failures 
in ID generation unexpected to the user...
See tests

+-- errors in edge cases of UUID v7
+SELECT 1 FROM uuidv7('1970-01-01 00:00:00+00'::timestamptz - interval '0ms');
+SELECT uuidv7('1970-01-01 00:00:00+00'::timestamptz - interval '1ms'); -- 
ERROR expected
+SELECT 1 FROM 
uuidv7(uuid_extract_time('--7FFF-B000-'));
+SELECT 
uuidv7(uuid_extract_time('--7FFF-B000-')+'1ms'); -- 
ERROR expected

Range is from 1970-01-01 00:00:00 to 10889-08-02 05:31:50.655. I'm not sure we 
should give this information in error message...
Thanks!

Best regards, Andrey Borodin.

v13-0001-Implement-UUID-v7.patch
Description: Binary data

Re: UUID v7

2024-01-24 Thread Marcos Pegoraro

 Is enough from 1970 ?
How about if user wants to have an UUID of his birth date ?

regards
Marcos

Em qua., 24 de jan. de 2024 às 13:54, Andrey M. Borodin <
x4...@yandex-team.ru> escreveu:

>
>
> > On 24 Jan 2024, at 20:46, Aleksander Alekseev 
> wrote:
> >
> > Only the
> > fact that timestamp from the far past generates UUID from the future
> > bothers me.
>
> PFA implementation of guard checks, but I'm afraid that this can cause
> failures in ID generation unexpected to the user...
> See tests
>
> +-- errors in edge cases of UUID v7
> +SELECT 1 FROM uuidv7('1970-01-01 00:00:00+00'::timestamptz - interval
> '0ms');
> +SELECT uuidv7('1970-01-01 00:00:00+00'::timestamptz - interval '1ms'); --
> ERROR expected
> +SELECT 1 FROM
> uuidv7(uuid_extract_time('--7FFF-B000-'));
> +SELECT
> uuidv7(uuid_extract_time('--7FFF-B000-')+'1ms'); --
> ERROR expected
>
> Range is from 1970-01-01 00:00:00 to 10889-08-02 05:31:50.655. I'm not
> sure we should give this information in error message...
> Thanks!
>
>
> Best regards, Andrey Borodin.
>

Re: cleanup patches for incremental backup

I'm seeing some recent buildfarm failures for pg_walsummary:


https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-01-14%2006%3A21%3A58

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2024-01-17%2021%3A10%3A36

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-20%2018%3A58%3A49

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-01-23%2002%3A46%3A57

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-23%2020%3A23%3A36

The signature looks nearly identical in each:

#   Failed test 'WAL summary file exists'
#   at t/002_blocks.pl line 79.

#   Failed test 'stdout shows block 0 modified'
#   at t/002_blocks.pl line 85.
#   ''
# doesn't match '(?^m:FORK main: block 0$)'

I haven't been able to reproduce the issue on my machine, and I haven't
figured out precisely what is happening yet, but I wanted to make sure
there is awareness.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Set log_lock_waits=on by default

2024-01-24 Thread Michael Banck

Hi,

On Thu, Jan 11, 2024 at 03:24:55PM +0100, Michael Banck wrote:
> On Thu, Dec 21, 2023 at 02:29:05PM +0100, Laurenz Albe wrote:
> > Here is a patch to implement this.
> > Being stuck behind a lock for more than a second is almost
> > always a problem, so it is reasonable to turn this on by default.
> 
> I also think that this should be set to on.
> 
> I had a look at the patch and it works fine. 
>
> Regarding the documentation, maybe the back-reference at
> deadlock_timeout could be made a bit more explicit then as well, as in
> the attached patch, but this is mostly bikeshedding.

I've marked it ready for committer now, as the above really is
bikeshedding.


Michael

Re: Add system identifier to backup manifest

On Mon, Jan 22, 2024 at 2:22 AM Amul Sul  wrote:
> Thinking a bit more on this, I realized parse_manifest_file() has many out
> parameters. Instead parse_manifest_file() should simply return manifest data
> like load_backup_manifest().  Attached 0001 patch doing the same, and removed
> parser_context structure, and added manifest_data, and did the required
> adjustments to pg_verifybackup code.

InitializeBackupManifest(&manifest, opt->manifest,
-
opt->manifest_checksum_type);
+
opt->manifest_checksum_type,
+GetSystemIdentifier());

InitializeBackupManifest() can just call GetSystemIdentifier() itself,
instead of passing another parameter, I think.

+   if (manifest_version == 1)
+   context->error_cb(context,
+ "%s: backup manifest
doesn't support incremental changes",
+
private_context->backup_directory);

I think this is weird. First, there doesn't seem to be any reason to
bounce through error_cb() here. You could just call pg_fatal(), as we
do elsewhere in this file. Second, there doesn't seem to be any need
to include the backup directory in this error message. We include the
file name (not the directory name) in errors that pertain to the file
itself, like if we can't open or read it. But we don't do that for
semantic errors about the manifest contents (cf.
combinebackup_per_file_cb). This file would need a lot fewer charges
if you didn't feed the backup directory name through here. Third, the
error message is not well-chosen, because a user who looks at it won't
understand WHY the manifest doesn't support incremental changes. I
suggest "backup manifest version 1 does not support incremental
backup".

+   /* Incremental backups supported on manifest version 2 or later */
+   if (manifest_version == 1)
+   context->error_cb(context,
+ "incremental backups
cannot be taken for this backup");

Let's use the same error message as in the previous case here also.

+   for (i = 0; i < n_backups; i++)
+   {
+   if (manifests[i]->system_identifier != system_identifier)
+   {
+   char   *controlpath;
+
+   controlpath = psprintf("%s/%s",
prior_backup_dirs[i], "global/pg_control");
+
+   pg_fatal("manifest is from different database
system: manifest database system identifier is %llu, %s system
identifier is %llu",
+(unsigned long long)
manifests[i]->system_identifier,
+controlpath,
+(unsigned long long)
system_identifier);
+   }
+   }

check_control_files() already verifies that all of the control files
contain the same system identifier as each other, so what we're really
checking here is that the backup manifest in each directory has the
same system identifier as the control file in that same directory. One
problem is that backup manifests are optional here, as per the comment
in load_backup_manifests(), so you need to skip over NULL entries
cleanly to avoid seg faulting if one is missing. I also think the
error message should be changed. How about "%s: manifest system
identifier is %llu, but control file has %llu"?

+   context->error_cb(context,
+ "manifest is from
different database system: manifest database system identifier is
%llu, pg_control database system identifier is %llu",
+ (unsigned long long)
manifest_system_identifier,
+ (unsigned long long)
system_identifier);

And here, while I'm kibitzing, how about "manifest system identifier
is %llu, but this system's identifier is %llu"?

-   qr/could not open directory/,
+   qr/could not open file/,

I don't think that the expected error message here should be changing.
Does it really, with the latest patch version? Why? Can we fix that?

+   else if (!parse->saw_system_identifier_field &&
+
strcmp(parse->manifest_version, "1") != 0)

I don't think this has any business testing the manifest version.
That's something to sort out at some later stage.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: POC: GROUP BY optimization

A recent buildfarm failure [0] seems to indicate a name collision with the
"abc" index in the aggregates.sql test and the "abc" table in
namespace.sql.

[0] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2024-01-24%2014%3A05%3A14

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: cleanup patches for incremental backup

On Wed, Jan 24, 2024 at 12:08 PM Nathan Bossart
 wrote:
> I'm seeing some recent buildfarm failures for pg_walsummary:
>
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-01-14%2006%3A21%3A58
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2024-01-17%2021%3A10%3A36
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-20%2018%3A58%3A49
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-01-23%2002%3A46%3A57
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-23%2020%3A23%3A36
>
> The signature looks nearly identical in each:
>
> #   Failed test 'WAL summary file exists'
> #   at t/002_blocks.pl line 79.
>
> #   Failed test 'stdout shows block 0 modified'
> #   at t/002_blocks.pl line 85.
> #   ''
> # doesn't match '(?^m:FORK main: block 0$)'
>
> I haven't been able to reproduce the issue on my machine, and I haven't
> figured out precisely what is happening yet, but I wanted to make sure
> there is awareness.

This is weird. There's a little more detail in the log file,
regress_log_002_blocks, e.g. from the first failure you linked:

[11:18:20.683](96.787s) # before insert, summarized TLI 1 through 0/14E09D0
[11:18:21.188](0.505s) # after insert, summarized TLI 1 through 0/14E0D08
[11:18:21.326](0.138s) # examining summary for TLI 1 from 0/14E0D08 to 0/155BAF0
# 1
...
[11:18:21.349](0.000s) #  got: 'pg_walsummary: error: could
not open file 
"/home/nm/farm/gcc64/HEAD/pgsql.build/src/bin/pg_walsummary/tmp_check/t_002_blocks_node1_data/pgdata/pg_wal/summaries/0001014E0D08155BAF0
# 1.summary": No such file or directory'

The "examining summary" line is generated based on the output of
pg_available_wal_summaries(). The way that works is that the server
calls readdir(), disassembles the filename into a TLI and two LSNs,
and returns the result. Then, a fraction of a second later, the test
script reassembles those components into a filename and finds the file
missing. If the logic to translate between filenames and TLIs & LSNs
were incorrect, the test would fail consistently. So the only
explanation that seems to fit the facts is the file disappearing out
from under us. But that really shouldn't happen. We do have code to
remove such files in MaybeRemoveOldWalSummaries(), but it's only
supposed to be nuking files more than 10 days old.

So I don't really have a theory here as to what could be happening. :-(

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

On 1/23/24 21:47, Robert Haas wrote:
> On Thu, Jan 11, 2024 at 11:27 AM Tomas Vondra
>  wrote:
>> 1) desirability: We want a built-in way to handle sequences in logical
>> replication. I think everyone agrees this is not a way to do distributed
>> sequences in an active-active setups, but that there are other use cases
>> that need this feature - typically upgrades / logical failover.
> 
> Yeah. I find it extremely hard to take seriously the idea that this
> isn't a valuable feature. How else are you supposed to do a logical
> failover without having your entire application break?
> 
>> 2) performance: There was concern about the performance impact, and that
>> it affects everyone, including those who don't replicate sequences (as
>> the overhead is mostly incurred before calls to output plugin etc.).
>>
>> The agreement was that the best way is to have a CREATE SUBSCRIPTION
>> option that would instruct the upstream to decode sequences. By default
>> this option is 'off' (because that's the no-overhead case), but it can
>> be enabled for each subscription.
> 
> Seems reasonable, at least unless and until we come up with something better.
> 
>> 3) correctness: The last point is about making "transactional" flag
>> correct when the snapshot state changes mid-transaction, originally
>> pointed out by Dilip [4]. Per [5] this however happens to work
>> correctly, because while we identify the change as 'non-transactional'
>> (which is incorrect), we immediately throw it again (so we don't try to
>> apply it, which would error-out).
> 
> I've said this before, but I still find this really scary. It's
> unclear to me that we can simply classify updates as transactional or
> non-transactional and expect things to work. If it's possible, I hope
> we have a really good explanation somewhere of how and why it's
> possible. If we do, can somebody point me to it so I can read it?
> 

I did try to explain how this works (and why) in a couple places:

1) the commit message
2) reorderbuffer header comment
3) ReorderBufferSequenceIsTransactional comment (and nearby)

It's possible this does not meet your expectations, ofc. Maybe there
should be a separate README for this - I haven't found anything like
that for logical decoding in general, which is why I did (1)-(3).

> To be possibly slightly more clear about my concern, I think the scary
> case is where we have transactional and non-transactional things
> happening to the same sequence in close temporal proximity, either
> within the same session or across two or more sessions.  If a
> non-transactional change can get reordered ahead of some transactional
> change upon which it logically depends, or behind some transactional
> change that logically depends on it, then we have trouble. I also
> wonder if there are any cases where the same operation is partly
> transactional and partly non-transactional.
> 

I certainly understand this concern, and to some extent I even share it.
Having to differentiate between transactional and non-transactional
changes certainly confused me more than once. It's especially confusing,
because the decoding implicitly changes the perceived ordering/atomicity
of the events.

That being said, I don't think it get reordered the way you're concerned
about. The "transactionality" is determined by relfilenode change, so
how could the reordering happen? We'd have to misidentify change in
either direction - and for nontransactional->transactional change that's
clearly not possible. There has to be a new relfilenode in that xact.

In the other direction (transactional->nontransactional), it can happen
if we fail to decode the relfilenode record. Which is what we discussed
earlier, but came to the conclusion that it actually works OK.

Of course, there might be bugs. I spent quite a bit of effort reviewing
and testing this, but there still might be something wrong. But I think
that applies to any feature.

What would be worse is some sort of thinko in the approach in general. I
don't have a good answer to that, unfortunately - I think it works, but
how would I know for sure? We explored multiple alternative approaches
and all of them crashed and burned ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: UUID v7

2024-01-24 Thread Andrey Borodin

> On 24 Jan 2024, at 22:00, Marcos Pegoraro  wrote:
> 
> Is enough from 1970 ?
Per standard unix_ts_ms field is a number of milliseconds from UNIX start date 
1970-01-01.

> How about if user wants to have an UUID of his birth date ?

I've claimed my
0078c135-bd00-70b1-865a-63c3741922a5

But again, UUIDs are not designed to store timestamp. They are unique and v7 
promote data locality via time-ordering.

Best regards, Andrey Borodin.

Re: make dist using git archive

2024-01-24 Thread Tristan Partin


On Wed Jan 24, 2024 at 10:18 AM CST, Tristan Partin wrote:

On Tue Jan 23, 2024 at 3:30 AM CST, Peter Eisentraut wrote:
> On 22.01.24 21:04, Tristan Partin wrote:
> > I am not really following why we can't use the builtin Meson dist 
> > command. The only difference from my testing is it doesn't use a 
> > --prefix argument.

>
> Here are some problems I have identified:
>
> 1. meson dist internally runs gzip without the -n option.  That makes 
> the tar.gz archive include a timestamp, which in turn makes it not 
> reproducible.


It doesn't look like Python provides the facilities to affect this.

> 2. Because gzip includes a platform indicator in the archive, the 
> produced tar.gz archive is not reproducible across platforms.  (I don't 
> know if gzip has an option to avoid that.  git archive uses an internal 
> gzip implementation that handles this.)


Same reason as above.


> 3. Meson does not support tar.bz2 archives.


Submitted https://github.com/mesonbuild/meson/pull/12770.

> 4. Meson uses git archive internally, but then unpacks and repacks the 
> archive, which loses the ability to use git get-tar-commit-id.


Because Meson allows projects to distribute arbitrary files via 
meson.add_dist_script(), and can include subprojects via `meson dist 
--include-subprojects`, this doesn't seem like an easily solvable 
problem.


> 5. I have found that the tar archives created by meson and git archive 
> include the files in different orders.  I suspect that the Python 
> tarfile module introduces some either randomness or platform dependency.


Seems likely.


> 6. meson dist is also slower because of the additional work.


Not easily solvable due to 4.

> 7. meson dist produces .sha256sum files but we have called them .sha256. 
>   (This is obviously trivial, but it is something that would need to be 
> dealt with somehow nonetheless.)

>
> Most or all of these issues are fixable, either upstream in Meson or by 
> adjusting our own requirements.  But for now this route would have some 
> significant disadvantages.


Thanks Peter. I will bring these up with upstream!


I think the solution to point 4 is to not unpack/repack if there are no 
dist scripts and/or subprojects to distribute. I can take a look at 
this later. I think this would also solve points 1, 2, 5, and 6 because 
at that point meson is just calling git-archive.


--
Tristan Partin
Neon (https://neon.tech)

Re: cleanup patches for incremental backup

On Wed, Jan 24, 2024 at 12:46:16PM -0500, Robert Haas wrote:
> The "examining summary" line is generated based on the output of
> pg_available_wal_summaries(). The way that works is that the server
> calls readdir(), disassembles the filename into a TLI and two LSNs,
> and returns the result. Then, a fraction of a second later, the test
> script reassembles those components into a filename and finds the file
> missing. If the logic to translate between filenames and TLIs & LSNs
> were incorrect, the test would fail consistently. So the only
> explanation that seems to fit the facts is the file disappearing out
> from under us. But that really shouldn't happen. We do have code to
> remove such files in MaybeRemoveOldWalSummaries(), but it's only
> supposed to be nuking files more than 10 days old.
> 
> So I don't really have a theory here as to what could be happening. :-(

There might be an overflow risk in the cutoff time calculation, but I doubt
that's the root cause of these failures:

/*
 * Files should only be removed if the last modification time precedes 
the
 * cutoff time we compute here.
 */
cutoff_time = time(NULL) - 60 * wal_summary_keep_time;

Otherwise, I think we'll probably need to add some additional logging to
figure out what is happening...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: index prefetching

On 1/22/24 08:21, Konstantin Knizhnik wrote:
> 
> On 22/01/2024 1:39 am, Tomas Vondra wrote:
>>> Why we can prefer covering index  to compound index? I see only two good
>>> reasons:
>>> 1. Extra columns type do not  have comparison function need for AM.
>>> 2. The extra columns are never used in query predicate.
>>>
>> Or maybe you don't want to include the columns in a UNIQUE constraint?
>>
> Do you mean that compound index (a,b) can not be used to enforce
> uniqueness of "a"?
> If so, I agree.
> 

Yes.

>>> If you are going to use this columns in query predicates I do not see
>>> much sense in creating inclusive index rather than compound index.
>>> Do you?
>>>
>> But this is also about conditions that can't be translated into index
>> scan keys. Consider this:
>>
>> create table t (a int, b int, c int);
>> insert into t select 1000 * random(), 1000 * random(), 1000 * random()
>> from generate_series(1,100) s(i);
>> create index on t (a,b);
>> vacuum analyze t;
>>
>> explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
>> 111;
>>     QUERY PLAN
>>
>> -
>>   Index Scan using t_a_b_idx on t  (cost=0.42..3670.74 rows=5 width=12)
>> (actual time=4.562..4.564 rows=0 loops=1)
>>     Index Cond: (a = 10)
>>     Filter: (mod(b, 10) = 111)
>>     Rows Removed by Filter: 974
>>     Buffers: shared hit=980
>>     Prefetches: blocks=901
>>   Planning Time: 0.304 ms
>>   Execution Time: 5.146 ms
>> (8 rows)
>>
>> Notice that this still fetched ~1000 buffers in order to evaluate the
>> filter on "b", because it's complex and can't be transformed into a nice
>> scan key.
> 
> O yes.
> Looks like I didn't understand the logic when predicate is included in
> index condition and when not.
> It seems to be natural that only such predicate which specifies some
> range can be included in index condition.
> But it is not the case:
> 
> postgres=# explain select * from t where a = 10 and b in (10,20,30);
>  QUERY PLAN
> -
>  Index Scan using t_a_b_idx on t  (cost=0.42..25.33 rows=3 width=12)
>    Index Cond: ((a = 10) AND (b = ANY ('{10,20,30}'::integer[])))
> (2 rows)
> 
> So I though ANY predicate using index keys is included in index condition.
> But it is not true (as your example shows).
> 
> But IMHO mod(b,10)=11 or (b+1) < 100 are both quite rare predicates
> this is why I named this use cases "exotic".

Not sure I agree with describing this as "exotic".

The same thing applies to an arbitrary function call. And those are
pretty common in conditions - date_part/date_trunc. Arithmetic
expressions are not that uncommon either. Also, users sometimes have
conditions comparing multiple keys (a 
> In any case, if we have some columns in index tuple it is desired to use
> them for filtering before extracting heap tuple.
> But I afraid it will be not so easy to implement...
> 

I'm not sure what you mean. The patch does that, more or less. There's
issues that need to be solved (e.g. to decide when not to do this), and
how to integrate that into the scan interface (where the quals are
evaluated at the end).

What do you mean when you say "will not be easy to implement"? What
problems do you foresee?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP Incremental JSON Parser

On Wed, Jan 24, 2024 at 10:04 AM Andrew Dunstan  wrote:
> The cfbot reports an error on a 32 bit build 
> :
>
> # Running: pg_basebackup -D 
> /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2
>  --no-sync -cfast --incremental 
> /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest
> pg_basebackup: error: could not upload manifest: ERROR:  could not parse 
> backup manifest: file size is not an integer
> pg_basebackup: removing data directory 
> "/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2"
> [02:41:07.830](0.073s) not ok 2 - incremental backup from node1
> [02:41:07.830](0.000s) #   Failed test 'incremental backup from node1'
>
> I have set up a Debian 12 EC2 instance following the recipe at 
> ,
>  and ran what I think are the same tests dozens of times, but the failure did 
> not reappear in my setup. Unfortunately, the test doesn't show the failing 
> manifest or log the failing field, so trying to divine what happened here is 
> more than difficult.
>
> Not sure how to address this.

Yeah, that's really odd. The backup size field is printed like this:

appendStringInfo(&buf, "\"Size\": %zu, ", size);

And parsed like this:

size = strtoul(parse->size, &ep, 10);
if (*ep)
json_manifest_parse_failure(parse->context,

 "file size is not an integer");

I confess to bafflement -- how could the output of the first fail to
be parsed by the second? The manifest has to look pretty much valid in
order not to error out before it gets to this check, with just that
one field corrupted. But I don't understand how that could happen.

I agree that the error reporting could be better here, but it didn't
seem worth spending time on when I wrote the code. I figured the only
way we could end up with something like "Size": "Walrus" is if the
user was messing with us on purpose. Apparently that's not so, yet the
mechanism eludes me. Or maybe it's not some random string, but is
something like an empty string or a number with trailing garbage or a
number that's out of range. But I don't see how any of those things
can happen either.

Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: index prefetching

On 1/22/24 07:35, Konstantin Knizhnik wrote:
> 
> On 22/01/2024 1:47 am, Tomas Vondra wrote:
>> h, right. Well, you're right in this case we perhaps could set just one
>> of those flags, but the "purpose" of the two places is quite different.
>>
>> The "prefetch" flag is fully controlled by the prefetcher, and it's up
>> to it to change it (e.g. I can easily imagine some new logic touching
>> setting it to "false" for some reason).
>>
>> The "data" flag is fully controlled by the custom callbacks, so whatever
>> the callback stores, will be there.
>>
>> I don't think it's worth simplifying this. In particular, I don't think
>> the callback can assume it can rely on the "prefetch" flag.
>>
> Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
> cause any extra space overhead (because of alignment), but allows to
> avoid dynamic memory allocation (not sure if it is critical, but nice to
> avoid if possible).
> 

Because it's specific to index-only scans, while IndexPrefetchEntry is a
generic thing, for all places.

However:

(1) Melanie actually presented a very different way to implement this,
relying on the StreamingRead API. So chances are this struct won't
actually be used.

(2) After going through Melanie's patch, I realized this is actually
broken. The IOS case needs to keep more stuff, not just the all-visible
flag, but also the index tuple. Otherwise it'll just operate on the last
tuple read from the index, which happens to be in xs_ituple. Attached is
a patch with a trivial fix.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 5ac954b8d13fbb9419204195c15cc594870e9702 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20240124 1/2] Prefetch heap pages during index scans

Index scans are a significant source of random I/O on the indexed heap,
but can't benefit from kernel read-ahead. For bitmap scans that is not
an issue, because they do prefetch explicitly, but for plain index scans
this is a major bottleneck - reading page at a time does not allow
saturating modern storage systems.

This enhances index scans (including index-only scans) to prefetch heap
pages. The scan maintains a queue of future TIDs received from an index,
prefetch the associated heap page, and then eventually pass the TID to
the caller.

To eliminate unnecessary prefetches, a small cache of recent prefetches
is maintained, and the prefetches are skipped. Furthermore, sequential
access patterns are detected and not prefetched, on the assumption that
the kernel read-ahead will do this more efficiently.

These optimizations are best-effort heuristics - we don't know if the
kernel will actually prefetch the pages on it's own, and we can't easily
check that. Moreover, different kernels (and kernel) versions may behave
differently.

Note: For shared buffers we can easily check if a page is cached, and
the PrefetchBuffer() function already takes care of that. These
optimizations are primarily about the page cache.

The prefetching is also disabled for plans that may not be executed only
once - these plans may change direction, interfering with the prefetch
queue. Consider scrollable cursors with backwards scans. This might get
improved to allow the prefetcher to handle direction changes, but it's
not clear if it's worth it.

Note: If a plan changes the scan direction, that inherently wastes the
issued prefetches. If the direction changes often, it likely means a lot
of the pages are still cached. Similarly, if a plan pauses for a long
time, the already prefetched pages may get evicted.
---
 src/backend/commands/explain.c   |  18 +
 src/backend/executor/Makefile|   1 +
 src/backend/executor/execMain.c  |  12 +
 src/backend/executor/execPrefetch.c  | 884 +++
 src/backend/executor/instrument.c|   4 +
 src/backend/executor/nodeIndexonlyscan.c | 113 ++-
 src/backend/executor/nodeIndexscan.c |  68 +-
 src/include/executor/executor.h  |  52 ++
 src/include/executor/instrument.h|   2 +
 src/include/nodes/execnodes.h|  10 +
 src/tools/pgindent/typedefs.list |   3 +
 11 files changed, 1162 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/executor/execPrefetch.c

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 843472e6ddc..b6631cdb2de 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 		!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 	   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
   has_local || has_temp ||
   has_s

Re: cleanup patches for incremental backup

On Wed, Jan 24, 2024 at 1:05 PM Nathan Bossart  wrote:
> There might be an overflow risk in the cutoff time calculation, but I doubt
> that's the root cause of these failures:
>
> /*
>  * Files should only be removed if the last modification time 
> precedes the
>  * cutoff time we compute here.
>  */
> cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
>
> Otherwise, I think we'll probably need to add some additional logging to
> figure out what is happening...

Where, though? I suppose we could:

1. Change the server code so that it logs each WAL summary file
removed at a log level high enough to show up in the test logs.

2. Change the TAP test so that it prints out readdir(WAL summary
directory) at various points in the test.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: dblink query interruptibility

On Wed, Nov 22, 2023 at 10:29 AM Noah Misch  wrote:
>
> === Background
>
> Something as simple as the following doesn't respond to cancellation.  In
> v15+, any DROP DATABASE will hang as long as it's running:
>
>   SELECT dblink_exec(
> $$dbname='$$||current_database()||$$' port=$$||current_setting('port'),
> 'SELECT pg_sleep(15)');
>
> https://postgr.es/m/4b584c99.8090...@enterprisedb.com proposed a fix back in
> 2010.  Latches and the libpqsrv facility have changed the server programming
> environment since that patch.  The problem statement also came up here:
>
> On Thu, Dec 08, 2022 at 06:08:15PM -0800, Andres Freund wrote:
> > dblink.c uses a lot of other blocking libpq functions, which obviously also
> > isn't ok.
>
>
> === Key decisions
>
> This patch adds to libpqsrv facility.

I found that this patch was committed at d3c5f37dd5 and changed the
error message in postgres_fdw slightly. Here's an example:

#1. Begin a new transaction.
#2. Execute a query accessing to a foreign table, like SELECT * FROM

#3. Terminate the *remote* session corresponding to the foreign table.
#4. Commit the transaction, and then currently the following error
message is output.

ERROR:  FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
   invalid socket

Previously, before commit d3c5f37dd5, the error message at #4 did not
include "invalid socket." Now, after the commit, it does. Is this
change intentional?

+ /* Consume whatever data is available from the socket */
+ if (PQconsumeInput(conn) == 0)
+ {
+ /* trouble; expect PQgetResult() to return NULL */
+ break;
+ }
+ }
+
+ /* Now we can collect and return the next PGresult */
+ return PQgetResult(conn);

This code appears to cause the change. When the remote session ends,
PQconsumeInput() returns 0 and marks conn->socket as invalid.
Subsequent PQgetResult() calls pqWait(), detecting the invalid socket
and appending "invalid socket" to the error message.

I think the "invalid socket" message is unsuitable in this scenario,
and PQgetResult() should not be called after PQconsumeInput() returns
0. Thought?

Regards,

-- 
Fujii Masao

Re: cleanup patches for incremental backup

On Wed, Jan 24, 2024 at 02:08:08PM -0500, Robert Haas wrote:
> On Wed, Jan 24, 2024 at 1:05 PM Nathan Bossart  
> wrote:
>> Otherwise, I think we'll probably need to add some additional logging to
>> figure out what is happening...
> 
> Where, though? I suppose we could:
> 
> 1. Change the server code so that it logs each WAL summary file
> removed at a log level high enough to show up in the test logs.
> 
> 2. Change the TAP test so that it prints out readdir(WAL summary
> directory) at various points in the test.

That seems like a reasonable starting point.  Even if it doesn't help
determine the root cause, it should at least help rule out concurrent
summary removal.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Remove redundant HandleWalWriterInterrupts()

On Wed, Jan 24, 2024 at 07:30:17PM +0530, Bharath Rupireddy wrote:
> On Wed, Jan 24, 2024 at 5:50 PM Fujii Masao  wrote:
>> Because of commit 1bdd54e662, the code of HandleWalWriterInterrupts()
>> became the same as HandleMainLoopInterrupts(). So I'd like to propose to
>> remove HandleWalWriterInterrupts() and make walwriter use
>> HandleMainLoopInterrupts() instead for improved code simplicity. Thought?
>>
>> Patch attached.
> 
> Nice catch. Indeed they both are the same after commit 1bdd54e662. The
> patch LGTM.

LGTM, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: POC: GROUP BY optimization

2024-01-24 Thread Alexander Korotkov

On Wed, Jan 24, 2024 at 7:38 PM Nathan Bossart  wrote:
> A recent buildfarm failure [0] seems to indicate a name collision with the
> "abc" index in the aggregates.sql test and the "abc" table in
> namespace.sql.
>
> [0] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2024-01-24%2014%3A05%3A14

Thank you for catching this.  Fixed.

--
Regards,
Alexander Korotkov

Re: s_lock_test no longer works

2024-01-24 Thread Andres Freund

Hi,

On 2024-01-24 12:14:17 +0100, Alvaro Herrera wrote:
> I do wonder if we want to keep this around, given that it's been more
> than one year broken and nobody seems to have noticed, and the Meson
> build does not support the test as a target.

Perhaps we should just make the test built and run by default instead?  OTOH,
regress.c:test_spinlock() actually covers about as much as the standalone
test...

Greetings,

Andres Freund

Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

2024-01-24 Thread Kirk Wolak

On Mon, Jan 22, 2024 at 1:30 AM Kirk Wolak  wrote:

> On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson  wrote:
>
>> > On 19 Jan 2024, at 23:09, Kirk Wolak  wrote:
>>
>> ...
>> ./configure  --with-llvm
>> LLVM_CONFIG=/path/to/llvm-config
>>
>> --
>> Daniel Gustafsson
>>
>
> Thank you, that made it possible to build and run...
> UNFORTUNATELY this has a CLEAR memory leak (visible in htop)
> I am watching it already consuming 6% of my system memory.
>
>
Daniel,
  In the previous email, I made note that once the JIT was enabled, the
problem exists in 17Devel.
I re-included my script, which forced the JIT to be used...

  I attached an updated script that forced the settings.
But this is still leaking memory (outside of the
pg_backend_memory_context() calls).
Probably because it's at the LLVM level?  And it does NOT happen from
planning/opening the query.  It appears I have to fetch the rows to see the
problem.

Thanks in advance.  Let me know if I should be doing something different?

Kirk Out!
PS: I was wondering if we had a function that showed total memory of the
backend.  For helping to determine if we might have a 3rd party leak?
[increase in total memory consumed not noticed by
pg_backend_memory_contexts)

#include "postgres.h"
#include 

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(pg_backend_memory_footprint);

Datum pg_backend_memory_footprint(PG_FUNCTION_ARGS) {
long memory_usage_bytes = 0;
struct rusage usage;

getrusage(RUSAGE_SELF, &usage);
memory_usage_bytes = usage.ru_maxrss * 1024;

PG_RETURN_INT64(memory_usage_bytes);
}

Re: Emit fewer vacuum records by reaping removable tuples during pruning

On Thu, Jan 18, 2024 at 9:23 PM Melanie Plageman
 wrote:
> I have attached a rebased version of the former 0004 as v11-0001.

This looks correct to me, although I wouldn't mind some more eyes on
it. However, I think that the comments still need more work.

Specifically:

 /*
  * Prune, freeze, and count tuples.
  *
  * Accumulates details of remaining LP_DEAD line pointers on page in
  * dead_items array.  This includes LP_DEAD line pointers that we
  * pruned ourselves, as well as existing LP_DEAD line pointers that
  * were pruned some time earlier.  Also considers freezing XIDs in the
  * tuple headers of remaining items with storage. It also determines
  * if truncating this block is safe.
  */
-lazy_scan_prune(vacrel, buf, blkno, page,
-vmbuffer, all_visible_according_to_vm,
-&has_lpdead_items);
+if (got_cleanup_lock)
+lazy_scan_prune(vacrel, buf, blkno, page,
+vmbuffer, all_visible_according_to_vm,
+&has_lpdead_items);

I think this comment needs adjusting. Possibly, the preceding calls to
lazy_scan_noprune() and lazy_scan_new_or_empty() could even use a bit
better comments, but in those cases, you're basically keeping the same
code with the same comment, so it's kinda defensible. Here, though,
you're making the call conditional without any comment update.

 /*
  * Final steps for block: drop cleanup lock, record free space in the
  * FSM.
  *
  * If we will likely do index vacuuming, wait until
  * lazy_vacuum_heap_rel() to save free space. This doesn't just save
  * us some cycles; it also allows us to record any additional free
  * space that lazy_vacuum_heap_page() will make available in cases
  * where it's possible to truncate the page's line pointer array.
  *
+ * Our goal is to update the freespace map the last time we touch the
+ * page. If the relation has no indexes, or if index vacuuming is
+ * disabled, there will be no second heap pass; if this particular
+ * page has no dead items, the second heap pass will not touch this
+ * page. So, in those cases, update the FSM now.
+ *
  * Note: It's not in fact 100% certain that we really will call
  * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index
  * vacuuming (and so must skip heap vacuuming).  This is deemed okay
  * because it only happens in emergencies, or when there is very
  * little free space anyway. (Besides, we start recording free space
  * in the FSM once index vacuuming has been abandoned.)
  */

I think this comment needs a rewrite, not just sticking the other
comment in the middle of it. There's some duplication between these
two comments, and merging it all together should iron that out.
Personally, I think my comment (which was there before, this commit
only moves it here) is clearer than what's already here about the
intent, but it's lacking some details that are captured in the other
two paragraphs, and we probably don't want to lose those details.

If you'd like, I can try rewriting these comments to my satisfaction
and you can reverse-review the result. Or you can rewrite them and
I'll re-review the result. But I think this needs to be a little less
mechanical. It's not just about shuffling all the comments around so
that all the text ends up somewhere -- we also need to consider the
degree to which the meaning becomes duplicated when it all gets merged
together.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: s_lock_test no longer works

Andres Freund  writes:
> On 2024-01-24 12:14:17 +0100, Alvaro Herrera wrote:
>> I do wonder if we want to keep this around, given that it's been more
>> than one year broken and nobody seems to have noticed, and the Meson
>> build does not support the test as a target.

> Perhaps we should just make the test built and run by default instead?  OTOH,
> regress.c:test_spinlock() actually covers about as much as the standalone
> test...

If your spinlocks aren't working, it's unlikely you'll get as far as
being able to run test_spinlock().  I think the standalone test does
have some value; it's just that it's not needed very often these days.

regards, tom lane

Re: index prefetching

2024-01-24 Thread Melanie Plageman

On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
 wrote:
>
> On 1/24/24 01:51, Melanie Plageman wrote:
>
> >>> There are also table AM layering violations in my sketch which would
> >>> have to be worked out (not to mention some resource leakage I didn't
> >>> bother investigating [which causes it to fail tests]).
> >>>
> >>> 0001 is all of Thomas' streaming read API code that isn't yet in
> >>> master and 0002 is my rough sketch of index prefetching using the
> >>> streaming read API
> >>>
> >>> There are also numerous optimizations that your index prefetching
> >>> patch set does that would need to be added in some way. I haven't
> >>> thought much about it yet. I wanted to see what you thought of this
> >>> approach first. Basically, is it workable?
> >>
> >> It seems workable, yes. I'm not sure it's much simpler than my patch
> >> (considering a lot of the code is in the optimizations, which are
> >> missing from this patch).
> >>
> >> I think the question is where should the optimizations happen. I suppose
> >> some of them might/should happen in the StreamingRead API itself - like
> >> the detection of sequential patterns, recently prefetched blocks, ...
> >
> > So, the streaming read API does detection of sequential patterns and
> > not prefetching things that are in shared buffers. It doesn't handle
> > avoiding prefetching recently prefetched blocks yet AFAIK. But I
> > daresay this would be relevant for other streaming read users and
> > could certainly be implemented there.
> >
>
> Yes, the "recently prefetched stuff" cache seems like a fairly natural
> complement to the pattern detection and shared-buffers check.
>
> FWIW I wonder if we should make some of this customizable, so that
> systems with customized storage (e.g. neon or with direct I/O) can e.g.
> disable some of these checks. Or replace them with their version.

That's a promising idea.

> >> But I'm not sure what to do about optimizations that are more specific
> >> to the access path. Consider for example the index-only scans. We don't
> >> want to prefetch all the pages, we need to inspect the VM and prefetch
> >> just the not-all-visible ones. And then pass the info to the index scan,
> >> so that it does not need to check the VM again. It's not clear to me how
> >> to do this with this approach.
> >
> > Yea, this is an issue I'll need to think about. To really spell out
> > the problem: the callback dequeues a TID from the tid_queue and looks
> > up its block in the VM. It's all visible. So, it shouldn't return that
> > block to the streaming read API to fetch from the heap because it
> > doesn't need to be read. But, where does the callback put the TID so
> > that the caller can get it? I'm going to think more about this.
> >
>
> Yes, that's the problem for index-only scans. I'd generalize it so that
> it's about the callback being able to (a) decide if it needs to read the
> heap page, and (b) store some custom info for the TID.

Actually, I think this is no big deal. See attached. I just don't
enqueue tids whose blocks are all visible. I had to switch the order
from fetch heap then fill queue to fill queue then fetch heap.

While doing this I noticed some wrong results in the regression tests
(like in the alter table test), so I suspect I have some kind of
control flow issue. Perhaps I should fix the resource leak so I can
actually see the failing tests :)

As for your a) and b) above.

Regarding a): We discussed allowing speculative prefetching and
separating the logic for prefetching from actually reading blocks (so
you can prefetch blocks you ultimately don't read). We decided this
may not belong in a streaming read API. What do you think?

Regarding b): We can store per buffer data for anything that actually
goes down through the streaming read API, but, in the index only case,
we don't want the streaming read API to know about blocks that it
doesn't actually need to read.

- Melanie
From f6cb591ba520351ab7f0e7cbf9d6df3dacda6b44 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v3 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c  |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c|   8 +-
 src/backend/postmaster/checkpointer.c|  15 +-
 src/backend/storage/Makefile |   2 +-
 src/backend/storage/aio/Makefile |  14 +
 src/backend/storage/aio/meson.build  |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++
 src/backend/storage/buffer/bufmgr.c  | 560 +++
 src/backend/storage/buffer/localbuf.c|  14 +-
 src/backend/storage/meson.build  |   1 +
 src/backend/storage/smgr/smgr.c  |  49 +-
 src/include/storage/bufmgr.h |  22 +
 src/include/storage/smgr.h   |   4 +-
 src/include/storage/streaming_read.h |  45 ++
 src/include/utils/rel.h  |   6 -
 src/tools/pgindent/typedefs.list

Re: s_lock_test no longer works

2024-01-24 Thread Andres Freund

Hi,

On 2024-01-24 15:05:12 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2024-01-24 12:14:17 +0100, Alvaro Herrera wrote:
> >> I do wonder if we want to keep this around, given that it's been more
> >> than one year broken and nobody seems to have noticed, and the Meson
> >> build does not support the test as a target.
> 
> > Perhaps we should just make the test built and run by default instead?  
> > OTOH,
> > regress.c:test_spinlock() actually covers about as much as the standalone
> > test...
> 
> If your spinlocks aren't working, it's unlikely you'll get as far as
> being able to run test_spinlock().  I think the standalone test does
> have some value; it's just that it's not needed very often these days.

As long as the uncontended case works, you can get surprisingly far... But
still, fair enough. If so, I think we should just rig things so the standalone
test gets built and run by default. It's not like that'd be a measurably
expensive thing to do.

Greetings,

Andres Freund

Re: dblink query interruptibility

2024-01-24 Thread Noah Misch

On Thu, Jan 25, 2024 at 04:23:39AM +0900, Fujii Masao wrote:
> I found that this patch was committed at d3c5f37dd5 and changed the
> error message in postgres_fdw slightly. Here's an example:
> 
> #1. Begin a new transaction.
> #2. Execute a query accessing to a foreign table, like SELECT * FROM
> 
> #3. Terminate the *remote* session corresponding to the foreign table.
> #4. Commit the transaction, and then currently the following error
> message is output.
> 
> ERROR:  FATAL:  terminating connection due to administrator command
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>invalid socket
> 
> Previously, before commit d3c5f37dd5, the error message at #4 did not
> include "invalid socket." Now, after the commit, it does. Is this
> change intentional?

No.  It's a consequence of an intentional change in libpq call sequence, but I
was unaware I was changing an error message.

> + /* Consume whatever data is available from the socket */
> + if (PQconsumeInput(conn) == 0)
> + {
> + /* trouble; expect PQgetResult() to return NULL */
> + break;
> + }
> + }
> +
> + /* Now we can collect and return the next PGresult */
> + return PQgetResult(conn);
> 
> This code appears to cause the change. When the remote session ends,
> PQconsumeInput() returns 0 and marks conn->socket as invalid.
> Subsequent PQgetResult() calls pqWait(), detecting the invalid socket
> and appending "invalid socket" to the error message.
> 
> I think the "invalid socket" message is unsuitable in this scenario,
> and PQgetResult() should not be called after PQconsumeInput() returns
> 0. Thought?

The documentation is absolute about the necessity of PQgetResult():

  PQsendQuery cannot be called again (on the same connection) until
  PQgetResult has returned a null pointer, indicating that the command is
  done.

  PQgetResult must be called repeatedly until it returns a null pointer,
  indicating that the command is done. (If called when no command is active,
  PQgetResult will just return a null pointer at once.)

Similar statements also appear in libpq-pipeline-results,
libpq-pipeline-errors, and libpq-copy.

So, unless the documentation or my reading of it is wrong there, I think the
answer is something other than skipping PQgetResult().  Perhaps PQgetResult()
should not append "invalid socket" in this case?  The extra line is a net
negative, though it's not wrong and not awful.

Thanks for reporting the change.

Re: UUID v7

2024-01-24 Thread Marcos Pegoraro

I understand your point, but
'2000-01-01' :: timestamp and '1900-01-01' :: timestamp are both valid
timestamps.

So looks strange if user can do
select uuidv7(TIMESTAMP '2000-01-01')
but cannot do
select uuidv7(TIMESTAMP '1900-01-01')

Regards
Marcos


Em qua., 24 de jan. de 2024 às 14:51, Andrey Borodin 
escreveu:

>
>
> > On 24 Jan 2024, at 22:00, Marcos Pegoraro  wrote:
> >
> > Is enough from 1970 ?
> Per standard unix_ts_ms field is a number of milliseconds from UNIX start
> date 1970-01-01.
>
> > How about if user wants to have an UUID of his birth date ?
>
> I've claimed my
> 0078c135-bd00-70b1-865a-63c3741922a5
>
> But again, UUIDs are not designed to store timestamp. They are unique and
> v7 promote data locality via time-ordering.
>
>
> Best regards, Andrey Borodin.

Re: UUID v7

2024-01-24 Thread Jelte Fennema-Nio

On Wed, 24 Jan 2024 at 21:47, Marcos Pegoraro  wrote:
>
> I understand your point, but
> '2000-01-01' :: timestamp and '1900-01-01' :: timestamp are both valid 
> timestamps.
>
> So looks strange if user can do
> select uuidv7(TIMESTAMP '2000-01-01')
> but cannot do
> select uuidv7(TIMESTAMP '1900-01-01')

I think that would be okay honestly. I don't think there's any
reasonable value for the uuid when a timestamp is given outside of the
date range that the uuid7 "algorithm" supports.

So +1 for erroring when you provide a timestamp outside of that range
(either too far in the past or too far in the future).

Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

2024-01-24 Thread Thomas Munro

On Thu, Jan 25, 2024 at 8:51 AM Kirk Wolak  wrote:
> getrusage(RUSAGE_SELF, &usage);
> memory_usage_bytes = usage.ru_maxrss * 1024;

FWIW log_statement_stats = on shows that in the logs.  See ShowUsage()
in postgres.c.

Re: UUID v7

2024-01-24 Thread Sergey Prokhorenko

"Other people" think that extracting the timestamp from UUIDv7 in violation of 
the new RFC, and generating UUIDv7 from the timestamp were both terrible and 
poorly thought out ideas. The authors of the new RFC had very good reasons to 
prohibit this. And the problems you face are the best confirmation of the 
correctness of the new RFC. It’s better to throw all this gag out of the 
official patch. Don't tempt developers to break the new RFC with these 
error-producing functions.

Sergey prokhorenkosergeyprokhore...@yahoo.com.au 

On Wednesday, 24 January 2024 at 04:30:02 pm GMT+3, Aleksander Alekseev 
 wrote:  

 Hi,

> Function to extract timestamp does not provide any guarantees at all. 
> Standard states this, see Kyzer answers upthread.
> Moreover, standard urges against relying on that if uuidX was generated 
> before uuidY, then uuidX happen, but does not guaranty that.
> All what is guaranteed is the uniqueness at certain conditions.
>
> > Otherwise you can calculate crc64(X) or sha256(X)
> > internally in order to generate an unique ID and claim that it's fine.
> >
> > Values that violate named invariants should be rejected with an error.
>
> Think about the value that you pass to uuid generation function as an 
> entropy. It’s there to ensure uniqueness and promote ordering (but not 
> guarantee).

If the standard doesn't guarantee something it doesn't mean it forbids
us to give stronger guarantees. I'm convinced that these guarantees
will be useful in real-world applications, at least the ones acting
exclusively within Postgres.

This being said, I understand your point of view too. Let's see what
other people think.

-- 
Best regards,
Aleksander Alekseev

Re: Patch: Improve Boolean Predicate JSON Path Docs

"David E. Wheeler"  writes:
> In any event, something to do with @@, perhaps to have some compatibility 
> with `jsonb @> jsonb`? I don’t know why @@ was important to have.

Yeah, that's certainly under-explained.  But it seems like I'm not
getting traction for the idea of changing the behavior, so let's
go back to just documenting it.  I spent some time going over your
text and also cleaning up nearby shaky English, and ended with v8
attached.  I'd be content to commit this if it looks good to you.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5030a1045f..99616d2298 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15889,6 +15889,9 @@ table2-mapping


 Does JSON path return any item for the specified JSON value?
+(This is useful only with SQL-standard JSON path expressions, not
+predicate check
+expressions, since those always return a value.)


 '{"a":[1,2,3,4,5]}'::jsonb @? '$.a[*] ? (@ > 2)'
@@ -15903,9 +15906,12 @@ table2-mapping


 Returns the result of a JSON path predicate check for the
-specified JSON value.  Only the first item of the result is taken into
-account.  If the result is not Boolean, then NULL
-is returned.
+specified JSON value.
+(This is useful only
+with predicate
+check expressions, not SQL-standard JSON path expressions,
+since it will return NULL if the path result is
+not a single boolean value.)


 '{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 2'
@@ -17029,6 +17035,9 @@ ERROR:  value too long for type character(2)

 Checks whether the JSON path returns any item for the specified JSON
 value.
+(This is useful only with SQL-standard JSON path expressions, not
+predicate check
+expressions, since those always return a value.)
 If the vars argument is specified, it must
 be a JSON object, and its fields provide named values to be
 substituted into the jsonpath expression.
@@ -17052,8 +17061,12 @@ ERROR:  value too long for type character(2)


 Returns the result of a JSON path predicate check for the specified
-JSON value.  Only the first item of the result is taken into account.
-If the result is not Boolean, then NULL is returned.
+JSON value.
+(This is useful only
+with predicate
+check expressions, not SQL-standard JSON path expressions,
+since it will either fail or return NULL if the
+path result is not a single boolean value.)
 The optional vars
 and silent arguments act the same as
 for jsonb_path_exists.
@@ -17075,6 +17088,12 @@ ERROR:  value too long for type character(2)

 Returns all JSON items returned by the JSON path for the specified
 JSON value.
+For SQL-standard JSON path expressions it returns the JSON
+values selected from target.
+For predicate
+check expressions it returns the result of the predicate
+check: true, false,
+or null.
 The optional vars
 and silent arguments act the same as
 for jsonb_path_exists.
@@ -17103,9 +17122,8 @@ ERROR:  value too long for type character(2)

 Returns all JSON items returned by the JSON path for the specified
 JSON value, as a JSON array.
-The optional vars
-and silent arguments act the same as
-for jsonb_path_exists.
+The parameters are the same as
+for jsonb_path_query.


 jsonb_path_query_array('{"a":[1,2,3,4,5]}', '$.a[*] ? (@ >= $min && @ <= $max)', '{"min":2, "max":4}')
@@ -17123,11 +17141,10 @@ ERROR:  value too long for type character(2)


 Returns the first JSON item returned by the JSON path for the
-specified JSON value.  Returns NULL if there are no
+specified JSON value, or NULL if there are no
 results.
-The optional vars
-and silent arguments act the same as
-for jsonb_path_exists.
+The parameters are the same as
+for jsonb_path_query.


 jsonb_path_query_first('{"a":[1,2,3,4,5]}', '$.a[*] ? (@ >= $min && @ <= $max)', '{"min":2, "max":4}')
@@ -17266,9 +17283,9 @@ ERROR:  value too long for type character(2)
   
 
   
-   SQL/JSON path expressions specify the items to be retrieved
-   from the JSON data, similar to XPath expressions used
-   for SQL access to XML. In PostgreSQL,
+   SQL/JSON path expressions specify item(s) to be retrieved
+   from a JSON value, similarly to XPath expressions used
+   for access to XML content. In PostgreSQL,
path expressions are implemented as the jsonpath
data type and can use any elements described in
.
@@ -17279,6 +17296,8 @

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-24 Thread Melanie Plageman

On Wed, Jan 24, 2024 at 2:59 PM Robert Haas  wrote:
>
> On Thu, Jan 18, 2024 at 9:23 PM Melanie Plageman
>  wrote:
> > I have attached a rebased version of the former 0004 as v11-0001.
>
> This looks correct to me, although I wouldn't mind some more eyes on
> it. However, I think that the comments still need more work.
>
> Specifically:
>
>  /*
>   * Prune, freeze, and count tuples.
>   *
>   * Accumulates details of remaining LP_DEAD line pointers on page in
>   * dead_items array.  This includes LP_DEAD line pointers that we
>   * pruned ourselves, as well as existing LP_DEAD line pointers that
>   * were pruned some time earlier.  Also considers freezing XIDs in 
> the
>   * tuple headers of remaining items with storage. It also determines
>   * if truncating this block is safe.
>   */
> -lazy_scan_prune(vacrel, buf, blkno, page,
> -vmbuffer, all_visible_according_to_vm,
> -&has_lpdead_items);
> +if (got_cleanup_lock)
> +lazy_scan_prune(vacrel, buf, blkno, page,
> +vmbuffer, all_visible_according_to_vm,
> +&has_lpdead_items);
>
> I think this comment needs adjusting. Possibly, the preceding calls to
> lazy_scan_noprune() and lazy_scan_new_or_empty() could even use a bit
> better comments, but in those cases, you're basically keeping the same
> code with the same comment, so it's kinda defensible. Here, though,
> you're making the call conditional without any comment update.
>
>  /*
>   * Final steps for block: drop cleanup lock, record free space in the
>   * FSM.
>   *
>   * If we will likely do index vacuuming, wait until
>   * lazy_vacuum_heap_rel() to save free space. This doesn't just save
>   * us some cycles; it also allows us to record any additional free
>   * space that lazy_vacuum_heap_page() will make available in cases
>   * where it's possible to truncate the page's line pointer array.
>   *
> + * Our goal is to update the freespace map the last time we touch the
> + * page. If the relation has no indexes, or if index vacuuming is
> + * disabled, there will be no second heap pass; if this particular
> + * page has no dead items, the second heap pass will not touch this
> + * page. So, in those cases, update the FSM now.
> + *
>   * Note: It's not in fact 100% certain that we really will call
>   * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip 
> index
>   * vacuuming (and so must skip heap vacuuming).  This is deemed okay
>   * because it only happens in emergencies, or when there is very
>   * little free space anyway. (Besides, we start recording free space
>   * in the FSM once index vacuuming has been abandoned.)
>   */
>
> I think this comment needs a rewrite, not just sticking the other
> comment in the middle of it. There's some duplication between these
> two comments, and merging it all together should iron that out.
> Personally, I think my comment (which was there before, this commit
> only moves it here) is clearer than what's already here about the
> intent, but it's lacking some details that are captured in the other
> two paragraphs, and we probably don't want to lose those details.
>
> If you'd like, I can try rewriting these comments to my satisfaction
> and you can reverse-review the result. Or you can rewrite them and
> I'll re-review the result. But I think this needs to be a little less
> mechanical. It's not just about shuffling all the comments around so
> that all the text ends up somewhere -- we also need to consider the
> degree to which the meaning becomes duplicated when it all gets merged
> together.

I will take a stab at rewriting the comments myself first. Usually, I
try to avoid changing comments if the code isn't functionally
different because I know it adds additional review overhead and I try
to reduce that to an absolute minimum. However, I see what you are
saying and agree that it would be better to have actually good
comments instead of frankenstein comments made up of parts that were
previously considered acceptable. I'll have a new version ready by
tomorrow.

- Melanie

Re: UUID v7

2024-01-24 Thread Sergey Prokhorenko

That's right! There is no point in waiting for the official approval of the new 
RFC, which obviously will not change anything. I have been a contributor to 
this RFC for several years, and I can testify that every aspect imaginable has 
been thoroughly researched and agreed upon. Nothing new will definitely appear 
in the new RFC.


Sergey prokhorenkosergeyprokhore...@yahoo.com.au 

On Monday, 22 January 2024 at 07:22:32 am GMT+3, Nikolay Samokhvalov 
 wrote:  
 
 On Fri, Jan 19, 2024 at 10:07 AM Andrey Borodin  wrote:



> On 19 Jan 2024, at 13:25, Andrey Borodin  wrote:
> 
> Also, I've added some documentation on all functions.

Here's v12. Changes:
1. Documentation improvements
2. Code comments
3. Better commit message and reviews list


Thank you, Andrey! I have just checked v12 – cleanly applied to HEAD, and 
functions work well. I especially like that fact that we keep 
uuid_extract_time(..) here – this is a great thing to have for time-based 
partitioning, and in many cases we will be able to decide not to have a 
creation column timestamp (e.g., "created_at") at all, saving 8 bytes.
The docs and comments look great too.
Overall, the patch looks mature enough. It would be great to have it in pg17. 
Yes, the RFC is not fully finalized yet, but it's very close. And many 
libraries are already including implementation of UUIDv7 – here are some 
examples:
- https://www.npmjs.com/package/uuidv7
- https://crates.io/crates/uuidv7
- https://github.com/google/uuid/pull/139
Nik

Current Connection Information

2024-01-24 Thread Maiquel Grassi

Hi,

It would be viable and appropriate to implement a unified function that 
provides important information about the current connection?

Just an example: "Current Connection Informations".

I implemented it in PL/pgSQL to demonstrate the idea, see on GitHub:
https://github.com/maiquelgrassi/DBA-toolkit/blob/main/cluster/dba_whoami_function.sql

Regards,
Maiquel.

Re: Make documentation builds reproducible

On Tue, Jan 23, 2024 at 12:32 PM Peter Smith  wrote:
>
> On Tue, Jan 23, 2024 at 12:13 PM Tom Lane  wrote:
> >
> > Peter Smith  writes:
> > > I usually the HTML documentation locally using command:
> > > make STYLE=website html
> > > This has been working forever, but seems to have broken due to commit
> > > [1] having an undeclared variable.
> >
> > Interestingly, that still works fine for me, on RHEL8 with
> >
> > docbook-dtds-1.0-69.el8.noarch
> > docbook-style-xsl-1.79.2-9.el8.noarch
> > docbook-style-dsssl-1.79-25.el8.noarch
> >
> > What docbook version are you using?
> >
>
> [postgres@CentOS7-x64 sgml]$ sudo yum list installed | grep docbook
> docbook-dtds.noarch   1.0-60.el7 @anaconda
> docbook-style-dsssl.noarch1.79-18.el7@base
> docbook-style-xsl.noarch  1.78.1-3.el7   @anaconda
>

IIUC these releases notes [1] say autolink.index.see existed since
v1.79.1, but unfortunately, that is more recent than my ancient
installed v1.78.1

>From the release notes:
--
Robert Stayton: autolink.index.see.xml

New param to control automatic links in index from see and
seealso to indexterm primary.
--

==
[1] https://docbook.sourceforge.net/release/xsl/1.79.1/RELEASE-NOTES.html

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

2024-01-24 Thread Kirk Wolak

On Wed, Jan 24, 2024 at 4:16 PM Thomas Munro  wrote:

> On Thu, Jan 25, 2024 at 8:51 AM Kirk Wolak  wrote:
> > getrusage(RUSAGE_SELF, &usage);
> > memory_usage_bytes = usage.ru_maxrss * 1024;
>
> FWIW log_statement_stats = on shows that in the logs.  See ShowUsage()
> in postgres.c.
>

Thank you for this, here is the *TERMINAL *(Below is the tail of the log).
Notice that the pg_backend_memory_contexts does NOT show the memory
consumed.
But your logging sure did!  (I wonder if I enable logging during planning,
but there is like 82,000 cursors being opened... (This removed the FETCH
and still leaks)


7:01:08 kwolak@postgres= # *select pg_temp.fx(497);*
NOTICE:  ("9848 kB","10 MB","638 kB")
NOTICE:  ---after close, Count a: 82636, count b: 82636
NOTICE:  ("9997 kB","10 MB","648 kB")
 fx


(1 row)

Time: 525870.117 ms (08:45.870)


*Tail*:

024-01-24 17:01:08.752 EST [28804] DETAIL:  ! system usage stats:
!   0.001792 s user, 0.00 s system, 0.005349 s elapsed
!   [560.535969 s user, 31.441656 s system total]
!   185300 kB max resident size
!   232/0 [29219648/54937864] filesystem blocks in/out
!   0/25 [0/1016519] page faults/reclaims, 0 [0] swaps
!   0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!   10/1 [62671/9660] voluntary/involuntary context switches
2024-01-24 17:01:08.752 EST [28804] STATEMENT:  explain SELECT DISTINCT
seid, fr_field_name, st_field_name
  FROM pg_temp.parts
 WHERE seid <> 497 AND partnum >= '1'
 ORDER BY seid;
2024-01-24 17:01:08.759 EST [28804] LOG:  QUERY STATISTICS
2024-01-24 17:01:08.759 EST [28804] DETAIL:  ! system usage stats:
!   0.006207 s user, 0.92 s system, 0.006306 s elapsed
!   [560.542262 s user, 31.441748 s system total]
!*   185300 kB max resident size*
!   0/0 [29219648/54937864] filesystem blocks in/out
!   0/4 [0/1016523] page faults/reclaims, 0 [0] swaps
!   0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!   0/1 [62672/9661] voluntary/involuntary context switches
2024-01-24 17:01:08.759 EST [28804] STATEMENT:  SELECT 'pg_temp.fx(497); --
Not run, do \dt+ parts';
2024-01-24 17:04:30.844 EST [28746] LOG:  checkpoint starting: time
2024-01-24 17:04:32.931 EST [28746] LOG:  checkpoint complete: wrote 21
buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.008 s,
sync=0.006 s, total=2.087 s; sync files=15, longest=0.001 s, average=0.001
s; distance=98 kB, estimate=134 kB; lsn=0/16304D8, redo lsn=0/1630480
2024-01-24 17:11:06.350 EST [28804] LOG:  QUERY STATISTICS
2024-01-24 17:11:06.350 EST [28804] DETAIL:  ! system usage stats:
!   515.952870 s user, 6.688389 s system, 525.869933 s elapsed
!   [1076.495280 s user, 38.130145 s system total]
!*   708104 kB max resident size*
!   37/3840 [29589648/54941712] filesystem blocks in/out
!   0/327338 [0/1343861] page faults/reclaims, 0 [0] swaps
!   0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!   22001/5216 [84675/14878] voluntary/involuntary context
switches
2024-01-24 17:11:06.350 EST [28804] STATEMENT: * select pg_temp.fx(497);*
2024-01-24 17:12:16.162 EST [28804] LOG:  QUERY STATISTICS
2024-01-24 17:12:16.162 EST [28804] DETAIL:  ! system usage stats:
!   1.130029 s user, 0.007727 s system, 1.157486 s elapsed
!   [1077.625396 s user, 38.137921 s system total]
!   *708104 kB max resident size*
!   992/0 [29590640/54941720] filesystem blocks in/out
!   3/41 [3/1343902] page faults/reclaims, 0 [0] swaps
!   0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!   9/68 [84685/14946] voluntary/involuntary context switches
2024-01-24 17:12:16.162 EST [28804] STATEMENT:  select now();
2024-01-24 17:12:30.944 EST [28804] LOG:  QUERY STATISTICS
2024-01-24 17:12:30.944 EST [28804] DETAIL:  ! system usage stats:
!   0.004561 s user, 0.19 s system, 0.004580 s elapsed
!   [1077.630064 s user, 38.137944 s system total]
!   *708104 kB max resident size*
!   0/0 [29590640/54941728] filesystem blocks in/out
!   0/4 [3/1343906] page faults/reclaims, 0 [0] swaps
!   0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!   0/0 [84686/14947] voluntary/involuntary context switches
2024-01-24 17:12:30.944 EST [28804] STATEMENT:  select now();

Re: Patch: Improve Boolean Predicate JSON Path Docs

2024-01-24 Thread David E. Wheeler

On Jan 24, 2024, at 16:32, Tom Lane  wrote:

> "David E. Wheeler"  writes:
> 
>> In any event, something to do with @@, perhaps to have some compatibility 
>> with `jsonb @> jsonb`? I don’t know why @@ was important to have.
> 
> Yeah, that's certainly under-explained.  But it seems like I'm not
> getting traction for the idea of changing the behavior, so let's
> go back to just documenting it.

Curious about those discussions. On the one hand I find the distinction between 
the two behaviors to be odd, and to produce unexpected results when they’re not 
used in the proper context.

It’s reminds me of the Perl idea of context, where functions behave differently 
in scalar and list context, and if you expect list behavior on scalar context 
you’re gonna get a surprise. This is a bit of a challenge for those new to the 
language, as they’re not necessarily aware of the context.

> I spent some time going over your
> text and also cleaning up nearby shaky English, and ended with v8
> attached.  I'd be content to commit this if it looks good to you.

This looks very nice, thank you. A couple of comments.

> +  
> +   Predicate check expressions are required in the
> +   @@ operator (and the
> +   jsonb_path_match function), and should not be 
> used
> +   with the @? operator (or the
> +   jsonb_path_exists function).
> +  
> + 
> +

I had this bit here:

   Conversely, non-predicate jsonpath expressions should not be
   used with the @@ operator (or the
   jsonb_path_match function).

I think it’s important to let people know what the difference is in the 
behavior of the two forms, in every spot it’s likely to come up. SQL-standard 
JSON Path expressions should never be used in contexts (functions, operators) 
only designed to work with predicate check expressions, and the docs should say 
so IMO.

>
> -The lax mode facilitates matching of a JSON document structure and path
> -expression if the JSON data does not conform to the expected schema.
> +The lax mode facilitates matching of a JSON document and path
> +expression when the JSON data does not conform to the expected schema.

What do you think of also dropping the article from all the references to “the 
strict mode” or “the lax mode”, to make them “strict mode” and “lax mode”, 
respectively?

Thanks for the review!

Best,

David

Re: Refactoring backend fork+exec code

2024-01-24 Thread Heikki Linnakangas


On 23/01/2024 21:50, Andres Freund wrote:

On 2024-01-23 21:07:08 +0200, Heikki Linnakangas wrote:

On 22/01/2024 23:07, Andres Freund wrote:

diff --git a/src/backend/utils/activity/backend_status.c 
b/src/backend/utils/activity/backend_status.c
index 1a1050c8da1..92f24db4e18 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -257,17 +257,16 @@ pgstat_beinit(void)
else
{
/* Must be an auxiliary process */
-   Assert(MyAuxProcType != NotAnAuxProcess);
+   Assert(IsAuxProcess(MyBackendType));
/*
 * Assign the MyBEEntry for an auxiliary process.  Since it 
doesn't
 * have a BackendId, the slot is statically allocated based on 
the
-* auxiliary process type (MyAuxProcType).  Backends use slots 
indexed
-* in the range from 0 to MaxBackends (exclusive), so we use
-* MaxBackends + AuxProcType as the index of the slot for an 
auxiliary
-* process.
+* auxiliary process type.  Backends use slots indexed in the 
range
+* from 0 to MaxBackends (exclusive), and aux processes use the 
slots
+* after that.
 */
-   MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
+   MyBEEntry = &BackendStatusArray[MaxBackends + MyBackendType - 
FIRST_AUX_PROC];
}


Hm, this seems less than pretty. It's probably ok for now, but it seems like a
better fix might be to just start assigning backend ids to aux procs or switch
to indexing by pgprocno?


Using pgprocno is a good idea. Come to think of it, why do we even have a
concept of backend ID that's separate from pgprocno? backend ID is used to
index the ProcState array, but AFAICS we could use pgprocno as the index to
that, too.


I think we should do that. There are a few processes not participating in
sinval, but it doesn't make enough of a difference to make sinval slower. And
I think there'd be far bigger efficiency improvements to sinvaladt than not
having a handful more entries.


And here we go. BackendID is now a 1-based index directly into the 
PGPROC array.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 87db35c877f4492b98848dfcd0baa49cc8100c1b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 24 Jan 2024 23:15:55 +0200
Subject: [PATCH v7 1/4] Remove superfluous 'pgprocno' field from PGPROC

It was always just the index of the PGPROC entry from the beginning of
the proc array. Introduce a macro to compute it from the pointer
instead.

Reviewed-by: XXX
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407%40iki.fi
---
 src/backend/access/transam/clog.c |  4 ++--
 src/backend/access/transam/twophase.c | 11 +--
 src/backend/access/transam/xlog.c |  2 +-
 src/backend/postmaster/bgwriter.c |  2 +-
 src/backend/postmaster/pgarch.c   |  2 +-
 src/backend/postmaster/walsummarizer.c|  2 +-
 src/backend/storage/buffer/bufmgr.c   |  6 +++---
 src/backend/storage/ipc/procarray.c   |  6 +++---
 src/backend/storage/lmgr/condition_variable.c | 12 ++--
 src/backend/storage/lmgr/lwlock.c |  6 +++---
 src/backend/storage/lmgr/predicate.c  |  2 +-
 src/backend/storage/lmgr/proc.c   |  1 -
 src/include/storage/lock.h|  2 +-
 src/include/storage/proc.h|  6 +-
 14 files changed, 29 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f6e7da7ffc9..7550309c25a 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -458,7 +458,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 		 * less efficiently.
 		 */
 		if (nextidx != INVALID_PGPROCNO &&
-			ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
+			GetPGProcByNumber(nextidx)->clogGroupMemberPage != proc->clogGroupMemberPage)
 		{
 			/*
 			 * Ensure that this proc is not a member of any clog group that
@@ -473,7 +473,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 
 		if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
 		   &nextidx,
-		   (uint32) proc->pgprocno))
+		   (uint32) GetNumberFromPGProc(proc)))
 			break;
 	}
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8426458f7f5..234c8d08ebc 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -284,7 +284,7 @@ TwoPhaseShmemInit(void)
 			TwoPhaseState->freeGXacts = &gxacts[i];
 
 			/* associate it with a PGPROC assigned by InitProcGlobal */
-			gxacts[i].pgprocno = PreparedXactProcs[i].pgprocno;
+			gxacts[i].pgprocno = GetNumberFromPG

Re: [PATCH] Add native windows on arm64 support

On Wed, Jan 24, 2024 at 06:45:21AM -0500, Dave Cramer wrote:
> I managed to get it to build the vcvarsall arch needs to be x64. I need to
> add some options, but the patch above needs to be applied to build it.

Nice.  If I may ask, what kind of host and/or configuration have you
used to reach a state where the code can be compiled and run tests
with meson?  If you have found specific steps, it may be a good thing
to document that on the wiki, say around [1].

Perhaps you have not included TAP?  It may be fine in terms of runtime
checks and coverage.

[1]: 
https://wiki.postgresql.org/wiki/PostgreSQL_Buildfarm_Howto#Running_on_Windows
--
Michael

signature.asc
Description: PGP signature

Re: A performance issue with Memoize

2024-01-24 Thread David Rowley

On Fri, 20 Oct 2023 at 23:40, Richard Guo  wrote:
> The Memoize runtime stats 'Hits: 0  Misses: 1  Evictions: '
> seems suspicious to me, so I've looked into it a little bit, and found
> that the MemoizeState's keyparamids and its outerPlan's chgParam are
> always different, and that makes us have to purge the entire cache each
> time we rescan the memoize node.
>
> But why are they always different?  Well, for the query above, we have
> two NestLoopParam nodes, one (with paramno 1) is created when we replace
> outer-relation Vars in the scan qual 't1.two = s.two', the other one
> (with paramno 0) is added from the subquery's subplan_params, which was
> created when we replaced uplevel vars with Param nodes for the subquery.
> That is to say, the chgParam would be {0, 1}.
>
> When it comes to replace outer-relation Vars in the memoize keys, the
> two 't1.two' Vars are both replaced with the NestLoopParam with paramno
> 1, because it is the first NLP we see in root->curOuterParams that is
> equal to the Vars in memoize keys.  That is to say, the memoize node's
> keyparamids is {1}.

I see the function calls were put this way around in 5ebaaa494
(Implement SQL-standard LATERAL subqueries.), per:

@ -1640,6 +1641,7 @@ create_subqueryscan_plan(PlannerInfo *root, Path
*best_path,
{
scan_clauses = (List *)
replace_nestloop_params(root, (Node *) scan_clauses);
+   identify_nestloop_extparams(root, best_path->parent->subplan);
}

(identify_nestloop_extparams was later renamed to
process_subquery_nestloop_params in 46c508fbc.)

I think fixing it your way makes sense.  I don't really see any reason
why we should have two. However...

Another way it *could* be fixed would be to get rid of pull_paramids()
and change create_memoize_plan() to set keyparamids to all the param
IDs that match are equal() to each param_exprs.  That way nodeMemoize
wouldn't purge the cache as we'd know the changing param is accounted
for in the cache.  For the record, I don't think that's better, but it
scares me a bit less as I don't know what other repercussions there
are of applying your patch to get rid of the duplicate
NestLoopParam.paramval.

I'd feel better about doing it your way if Tom could comment on if
there was a reason he put the function calls that way around in
5ebaaa494.

I also feel like we might be getting a bit close to the minor version
releases to be adjusting this stuff in back branches.

David

Re: Documentation to upgrade logical replication cluster

Here are some review comments for patch v3.

==
doc/src/sgml/ref/pgupgrade.sgml

1.
+
+
+ This page does not cover steps to upgrade logical replication
clusters, refer
+for details on upgrading
+ logical replication clusters.
+
+

I felt that maybe this note was misplaced. Won't it be better to put
this down in the "Usage" section of this page?

BEFORE
These are the steps to perform an upgrade with pg_upgrade:

SUGGESTION (or something like this)
Below are the steps to perform an upgrade with pg_upgrade.

Note, the steps to upgrade logical replication clusters are not
covered here; refer to
for details.

~~~

2.
Configure the servers for log shipping. (You do not need to run
pg_backup_start() and
pg_backup_stop()
or take a file system backup as the standbys are still synchronized
- with the primary.) Only logical slots on the primary are copied to the
- new standby, but other slots on the old standby are not copied so must
- be recreated manually.
+ with the primary.) In version 17.0 or later, only logical slots on the
+ primary are copied to the new standby, but other slots on the
old standby
+ are not copied so must be recreated manually.

This para was still unclear to me. What is version 17.0 referring to
-- the old_cluster version? Do you mean something like:
If the old cluster is < v17 then logical slots are not copied. If the
old_cluster is >= v17 then...

==
doc/src/sgml/logical-replication.sgml

3.
+
+While upgrading a subscriber, write operations can be performed in the
+publisher, these changes will be replicated to the subscriber once the
+subscriber upgradation is completed.
+

3a.
/publisher, these changes/publisher. These changes/

3b.
"upgradation" ??. See [1]

maybe just /upgradation/upgrade/

~~~

4. GENERAL - prompts/paths

I noticed in v3 you removed all the cmd prompts like:
dba@node1:/opt/PostgreSQL/postgres/17/bin$
dba@node2:/opt/PostgreSQL/postgres/18/bin$
etc.

I thought those were helpful to disambiguate which server/version was
being operated on. I wonder if there is some way to keep information
still but not make it look like a real current directory that
Kuroda-san did not like:

e.g. Maybe something like the below is possible?

(dba@node1: v17) pg_upgrade...
(dba@node2: v18) pg_upgrade...

==
[1]
https://english.stackexchange.com/questions/192187/upgradation-not-universally-accepted#:~:text=Not%20all%20dictionaries%20(or%20native,by%20most%20non%2DIE%20speakers.

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Sequence Access Methods, round two

On Tue, Jan 23, 2024 at 10:58:50AM +0100, Peter Eisentraut wrote:
> On 18.01.24 16:54, Matthias van de Meent wrote:
> The proposed sequence AM patch would support a different nextval function,
> but does it support additional parameters?  I haven't found that.

Yes and no.  Yes as in "the patch set can support it" and no as in
"the patch does not implement that yet".  You could do two things
here:
- Add support for reloptions for sequences.  The patch does not
include that on purpose because it already covers a lot of ground, and
that did not look like a strict requirement to me as a first shot.  It
can be implemented on top of the patch set.  That's not technically
complicated, actually, but there are some shenanigans to discuss with
the heap relation used under the hood by a sequence for the in-core
method or any other sequence AM that would need a sequence.
- Control that with GUCs defined in the AM, which may be weird, still
enough at relation level.  And enough with the current patch set.

reloptions would make the most sense to me here, I assume, to ease we
handle use nextval().

> Another use case I have wished for from time to time is creating sequences
> using different data types, for example uuids.  You'd just need to provide a
> data-type-specific "next" function.  However, in this patch, all the values
> and state are hardcoded to int64.

Yeah, because all the cases I've seen would be happy with being able
to map a result to 8 bytes with a controlled computation method.  The
size of the output generated, the set of data types that can be
supported by a table AM and the OID/name of the SQL function in charge
of retrieving the value could be controlled in the callbacks
themselves, and this would require a design of the callbacks.  The
thing is that you *will* need callbacks and an AM layer to be able to
achieve that.  I agree this can be useful.  Now this is a separate
clause in the SEQUENCE DDLs, so it sounds to me like an entirely
different feature.

FWIW, MSSQL has a concept of custom data types for one, though these
need to map to integers (see user-defined_integer_type).

Another thing is the SQL specification.  You or Vik will very likely
correct me here, but the spec mentions that sequences need to work on
integer values.  A USING clause means that we already diverge from it,
perhaps it is OK to diverge more.  How about DDL properties like
Min/Max or increment, then?

> I think the proposed patch covers too broad a range of abstraction levels.
> The use cases described above are very high level and are just concerned
> with how you get the next value.  The current internal sequence state would
> be stored in whatever way it is stored now.  But this patch also includes
> callbacks for very low-level-seeming concepts like table AMs and
> persistence.  Those seem like different things.  And the two levels should
> be combinable.  Maybe I want a local sequence of uuids or a global sequence
> of uuids, or a local sequence of integers or a global sequence of integers.
> I mean, I haven't thought this through, but I get the feeling that there
> should be more than one level of API around this.

That's a tricky question, and I don't really know how far this needs
to go.  FWIW, like table AMs I don't want the callbacks to be set in
stone across major releases.  Now I am worrying a bit about designing
callbacks that are generic, still impact performance because they
require more catalog lookups and/or function point manipulations for
the default cases.  Separating the computation and the in-core SQL
functions in a cleaner way is a step that helps in any case, IMO,
though I agree that the design of the callbacks influences how much is
exposed to users and AM developers.  Having only a USING clause that
gives support to integer-based results while providing a way to force
the computation is useful.  Custom data types that can be plugged into
the callbacks are also useful, still they are doing to require an AM
callback layer so as an AM can decide what it needs to do with the
data type given by the user in input of CREATE SEQUENCE.
--
Michael


signature.asc
Description: PGP signature

Re: LLVM 18

2024-01-24 Thread Thomas Munro

On Wed, Jan 3, 2024 at 6:04 PM Thomas Munro  wrote:
> LLVM 16 provided a new function name[1], and LLVM 18 (not shipped yet)
> has started complaining[2] about the old spelling.
>
> Here's a patch.

And pushed.

Just in case anyone else is confused by this, be aware that they've
changed their numbering scheme.  The 18.1 schedule visible on llvm.org
doesn't imply that 18.0 has already shipped, it's just that they've
decided to start at X.1.

By the way, while testing on my Debian system with apt.llvm.org
packages, I discovered that we crash with its latest llvm-18 package,
namely:

llvm-18_1%3a18~++20240122112312+ad01447d30ed-1~exp1~20240122112329.478_amd64.deb

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x7f033e73f5f8 in llvm::InlineFunction(llvm::CallBase&,
llvm::InlineFunctionInfo&, bool, llvm::AAResults*, bool,
llvm::Function*) () from /lib/x86_64-linux-gnu/libLLVM-18.so.1

... so I re-confirmed that I wasn't hallucinating and it did work
before I disappeared for the holidays by downgrading to the one before
that from my /var/cache/apt/archives, namely:

llvm-18_1%3a18~++20231218112348+a4deb14e353c-1~exp1~20231218112405.407_amd64.deb

So I built the tip of their release/18.x branch so I could try to get
some more information out of my debugger and perhaps their assertions,
but it worked.  So I have to assume that something was broken at their
commit ad01447d30ed and has been fixed in the past few days, but I
didn't have time to dig further, and will re-check a bit later when a
fresh package shows up.

Re: Synchronizing slots from primary to standby

Here are some review comments for the patch v67-0001.

==
1.
There are a couple of places checking for failover usage on a standby.

+ if (RecoveryInProgress() && failover)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot enable failover for a replication slot"
+" created on the standby"));

and

+ if (RecoveryInProgress() && failover)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot enable failover for a replication slot"
+" on the standby"));

IMO the conditions should be written the other way around (failover &&
RecoveryInProgress()) to avoid the unnecessary function calls when
'failover' flag is probably mostly default false anyway.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-24 Thread Melanie Plageman

On Wed, Jan 24, 2024 at 4:34 PM Melanie Plageman
 wrote:
>
> On Wed, Jan 24, 2024 at 2:59 PM Robert Haas  wrote:
...
> > If you'd like, I can try rewriting these comments to my satisfaction
> > and you can reverse-review the result. Or you can rewrite them and
> > I'll re-review the result. But I think this needs to be a little less
> > mechanical. It's not just about shuffling all the comments around so
> > that all the text ends up somewhere -- we also need to consider the
> > degree to which the meaning becomes duplicated when it all gets merged
> > together.
>
> I will take a stab at rewriting the comments myself first. Usually, I
> try to avoid changing comments if the code isn't functionally
> different because I know it adds additional review overhead and I try
> to reduce that to an absolute minimum. However, I see what you are
> saying and agree that it would be better to have actually good
> comments instead of frankenstein comments made up of parts that were
> previously considered acceptable. I'll have a new version ready by
> tomorrow.

v12 attached has my attempt at writing better comments for this
section of lazy_scan_heap().

Above the combined FSM update code, I have written a comment that is a
revised version of your comment from above the lazy_scan_noprune() FSM
update code but with some of the additional details from the previous
comment above the lazy_scan_pruen() FSM update code.

The one part that I did not incorporate was the point about how
sometimes we think we'll do a second pass on the block so we don't
update the FSM but then we end up not doing it but it's all okay.

* Note: It's not in fact 100% certain that we really will call
* lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index
* vacuuming (and so must skip heap vacuuming).  This is deemed okay
* because it only happens in emergencies, or when there is very
* little free space anyway. (Besides, we start recording free space
* in the FSM once index vacuuming has been abandoned.)

I didn't incorporate it because I wasn't sure I understood the
situation. I can imagine us skipping updating the FSM after
lazy_scan_prune() because there are indexes on the relation and dead
items on the page and we think we'll do a second pass. Later, we end
up triggering a failsafe vacuum or, somehow, there are still too few
TIDs for the second pass, so we update do_index_vacuuming to false.
Then we wouldn't ever record this block's free space in the FSM. That
seems fine (which is what the comment says). So, what does the last
sentence mean? "Besides, we start recording..."

- Melanie
From fabf79ed6f5866634a324f740e2eb0a5530b3286 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 16 Jan 2024 17:28:07 -0500
Subject: [PATCH v12] Combine FSM updates for prune and no prune cases

lazy_scan_prune() and lazy_scan_noprune() update the freespace map with
identical conditions -- both with the goal of doing so once for each
page vacuumed. Combine both of their FSM updates. This consolidation is
easier now that cb970240f13df2 moved visibility map updates into
lazy_scan_prune().

While combining the FSM updates, simplify the logic for calling
lazy_scan_new_or_empty() and lazy_scan_noprune().

Discussion: https://postgr.es/m/CA%2BTgmoaLTvipm%3Dxx4rJLr07m908PCu%3DQH3uCjD1UOn8YaEuO2g%40mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 133 +++
 1 file changed, 52 insertions(+), 81 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0fb3953513d..432db4698bc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -838,6 +838,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		Page		page;
 		bool		all_visible_according_to_vm;
 		bool		has_lpdead_items;
+		bool		got_cleanup_lock = false;

 		if (blkno == next_unskippable_block)
 		{
@@ -931,63 +932,41 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);

+		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ vacrel->bstrategy);
+		page = BufferGetPage(buf);
+
 		/*
 		 * We need a buffer cleanup lock to prune HOT chains and defragment
 		 * the page in lazy_scan_prune.  But when it's not possible to acquire
 		 * a cleanup lock right away, we may be able to settle for reduced
 		 * processing using lazy_scan_noprune.
 		 */
-		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- vacrel->bstrategy);
-		page = BufferGetPage(buf);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			LockBuffer(buf, BUFFER_LOCK_SHARE);
+		got_cleanup_lock = ConditionalLockBufferForCleanup(buf);

-			/* Check for new or empty pages before lazy_scan_noprune call */
-			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
-	   vmbuffer))
-			{
-/* Processed as new/empty page (lock and pin released) */
-continue;
-			}
-
-			/*
-			 * Collect LP_DEAD items in dead_items

Re: Add tuples_skipped to pg_stat_progress_copy

2024-01-24 Thread torikoshia

On 2024-01-24 17:05, Masahiko Sawada wrote:
On Tue, Jan 23, 2024 at 1:02 AM torikoshia  
wrote:

On 2024-01-17 14:47, Masahiko Sawada wrote:
> On Wed, Jan 17, 2024 at 2:22 PM torikoshia 
> wrote:
>>
>> Hi,
>>
>> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
>> skip malformed data, but there is no way to watch the number of
>> skipped
>> rows during COPY.
>>
>> Attached patch adds tuples_skipped to pg_stat_progress_copy, which
>> counts the number of skipped tuples because source data is malformed.
>> If SAVE_ERROR_TO is not specified, this column remains zero.
>>
>> The advantage would be that users can quickly notice and stop COPYing
>> when there is a larger amount of skipped data than expected, for
>> example.
>>
>> As described in commit log, it is expected to add more choices for
>> SAVE_ERROR_TO like 'log' and using such options may enable us to know
>> the number of skipped tuples during COPY, but exposed in
>> pg_stat_progress_copy would be easier to monitor.
>>
>>
>> What do you think?
>
> +1
>
> The patch is pretty simple. Here is a comment:
>
> +   (if SAVE_ERROR_TO is specified, otherwise
> zero).
> +  
> + 
>
> To be precise, this counter only advances when a value other than
> 'ERROR' is specified to SAVE_ERROR_TO option.

Thanks for your comment and review!

Updated the patch according to your comment and option name change by
b725b7eec.

Thanks! The patch looks good to me. I'm going to push it tomorrow,
barring any objections.

Thanks!

BTW, based on this patch, I think we can add another option which
specifies the maximum tolerable number of malformed rows.
I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.
Attached 0002 is WIP patch for this(I haven't added doc yet).

Yeah, it could be a good option.

This may be better discussed in another thread, but any comments(e.g.
necessity of this option, option name) are welcome.

I'd recommend forking a new thread for this option. As far as I
remember, there also was an opinion that "reject limit" stuff is not
very useful.

OK, I'll make another thread for this.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-24 Thread Peter Geoghegan

On Wed, Jan 24, 2024 at 9:13 PM Melanie Plageman
 wrote:
> I didn't incorporate it because I wasn't sure I understood the
> situation. I can imagine us skipping updating the FSM after
> lazy_scan_prune() because there are indexes on the relation and dead
> items on the page and we think we'll do a second pass. Later, we end
> up triggering a failsafe vacuum or, somehow, there are still too few
> TIDs for the second pass, so we update do_index_vacuuming to false.
> Then we wouldn't ever record this block's free space in the FSM. That
> seems fine (which is what the comment says). So, what does the last
> sentence mean? "Besides, we start recording..."

It means: when the failsafe kicks in, from that point on we won't do
any more heap vacuuming. Clearly any pages that still need to be
scanned at that point won't ever be processed by
lazy_vacuum_heap_rel(). So from that point on we should record the
free space in every scanned heap page in the "first heap pass" --
including pages that have LP_DEAD stubs that aren't going to be made
LP_UNUSED in the ongoing VACUUM.

-- 
Peter Geoghegan

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-24 Thread jian he

On Wed, Jan 24, 2024 at 10:17 PM Sutou Kouhei  wrote:
>
> I uploaded my benchmark script so that you can run the same
> benchmark on your machine:
>
> https://gist.github.com/kou/be02e02e5072c91969469dbf137b5de5
>
> Could anyone try the benchmark with master and master+0001?
>

sorry. I made a mistake. I applied v6, 0001 to 0008 all the patches.

my tests:
CREATE unlogged TABLE data (a bigint);
SELECT setseed(0.29);
INSERT INTO data SELECT random() * 1 FROM generate_series(1, 1e7);

my setup:
meson setup --reconfigure ${BUILD} \
-Dprefix=${PG_PREFIX} \
-Dpgport=5462 \
-Dbuildtype=release \
-Ddocs_html_style=website \
-Ddocs_pdf=disabled \
-Dllvm=disabled \
-Dextra_version=_release_build

gcc version:  PostgreSQL 17devel_release_build on x86_64-linux,
compiled by gcc-11.4.0, 64-bit

apply your patch:
COPY data TO '/dev/null' WITH (FORMAT csv) \watch count=5
Time: 668.996 ms
Time: 596.254 ms
Time: 592.723 ms
Time: 591.663 ms
Time: 590.803 ms

not apply your patch, at git 729439607ad210dbb446e31754e8627d7e3f7dda
COPY data TO '/dev/null' WITH (FORMAT csv) \watch count=5
Time: 644.246 ms
Time: 583.075 ms
Time: 568.670 ms
Time: 569.463 ms
Time: 569.201 ms

I forgot to test other formats.

Re: Make COPY format extendable: Extract COPY TO format implementations

On Wed, Jan 24, 2024 at 11:17:26PM +0900, Sutou Kouhei wrote:
> In <10025bac-158c-ffe7-fbec-32b426291...@dunslane.net>
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Wed, 24 Jan 2024 07:15:55 -0500,
>   Andrew Dunstan  wrote:
>> We've spent quite a lot of blood sweat and tears over the years to make COPY
>> fast, and we should not sacrifice any of that lightly.

Clearly.

> I uploaded my benchmark script so that you can run the same
> benchmark on your machine:
> 
> https://gist.github.com/kou/be02e02e5072c91969469dbf137b5de5

Thanks, that saves time.  I am attaching it to this email as well, for
the sake of the archives if this link is removed in the future.

> Could anyone try the benchmark with master and master+0001?

Yep.  It is one point we need to settle before deciding what to do
with this patch set, and I've done so to reach my own conclusion.

I have a rather good machine at my disposal in the cloud, so I did a
few runs with HEAD and HEAD+0001, with PGDATA mounted on a tmpfs.
Here are some results for the 10M row case, as these should be the
least prone to noise, 5 runs each: 

master
text 10M  1732.570 1684.542 1693.430 1687.696 1714.845
csv 10M   1729.113 1724.926 1727.414 1726.237 1728.865
bin 10M   1679.097 1677.887 1676.764 1677.554 1678.120

master+0001
text 10M  1702.207 1654.818 1647.069 1690.568 1654.446
csv 10M   1764.939 1714.313 1712.444 1712.323 1716.952
bin 10M   1703.061 1702.719 1702.234 1703.346 1704.137

Hmm.  The point of contention in the patch is the change to use the
CopyToOneRow callback in CopyOneRowTo(), as we go through it for each
row and we should habe a worst-case scenario with a relation that has
a small attribute size.  The more rows, the more effect it would have.
The memory context switches and the StringInfo manipulations are
equally important, and there are a bunch of the latter, actually, with
optimizations around fe_msgbuf.

I've repeated a few runs across these two builds, and there is some
variance and noise, but I am going to agree with your point that the
effect 0001 cannot be seen.  Even HEAD is showing some noise.  So I am
discarding the concerns I had after seeing the numbers you posted
upthread.

+typedef bool (*CopyToProcessOption_function) (CopyToState cstate, DefElem 
*defel);
+typedef int16 (*CopyToGetFormat_function) (CopyToState cstate);
+typedef void (*CopyToStart_function) (CopyToState cstate, TupleDesc tupDesc);
+typedef void (*CopyToOneRow_function) (CopyToState cstate, TupleTableSlot 
*slot);
+typedef void (*CopyToEnd_function) (CopyToState cstate);

We don't really need a set of typedefs here, let's put the definitions
in the CopyToRoutine struct instead.

+extern CopyToRoutine CopyToRoutineText;
+extern CopyToRoutine CopyToRoutineCSV;
+extern CopyToRoutine CopyToRoutineBinary;

All that should IMO remain in copyto.c and copyfrom.c in the initial
patch doing the refactoring.  Why not using a fetch function instead
that uses a string in input?  Then you can call that once after
parsing the List of options in ProcessCopyOptions().

Introducing copyapi.h in the initial patch makes sense here for the TO
and FROM routines.

+/* All "text" and "csv" options are parsed in ProcessCopyOptions(). We may
+ * move the code to here later. */
Some areas, like this comment, are written in an incorrect format.

+if (cstate->opts.csv_mode)
+CopyAttributeOutCSV(cstate, colname, false,
+list_length(cstate->attnumlist) == 1);
+else
+CopyAttributeOutText(cstate, colname);

You are right that this is not worth the trouble of creating a
different set of callbacks for CSV.  This makes the result cleaner.

+getTypeBinaryOutputInfo(attr->atttypid, &out_func_oid, &isvarlena);
+fmgr_info(out_func_oid, &cstate->out_functions[attnum - 1]);

Actually, this split is interesting.  It is possible for a custom
format to plug in a custom set of out functions.  Did you make use of
something custom for your own stuff?  Actually, could it make sense to
split the assignment of cstate->out_functions into its own callback?
Sure, that's part of the start phase, but at least it would make clear
that a custom method *has* to assign these OIDs to work.  The patch
implies that as a rule, without a comment that CopyToStart *must* set
up these OIDs.

I think that 0001 and 0005 should be handled first, as pieces
independent of the rest.  Then we could move on with 0002~0004 and
0006~0008.
--
Michael
#!/bin/bash

set -eu
set -o pipefail

base_dir=$(dirname "$0")

prepare_sql()
{
  size=$1
  db_name=$2

  cat <

signature.asc
Description: PGP signature

Re: Make COPY format extendable: Extract COPY TO format implementations

On Thu, Jan 25, 2024 at 10:53:58AM +0800, jian he wrote:
> apply your patch:
> COPY data TO '/dev/null' WITH (FORMAT csv) \watch count=5
> Time: 668.996 ms
> Time: 596.254 ms
> Time: 592.723 ms
> Time: 591.663 ms
> Time: 590.803 ms
> 
> not apply your patch, at git 729439607ad210dbb446e31754e8627d7e3f7dda
> COPY data TO '/dev/null' WITH (FORMAT csv) \watch count=5
> Time: 644.246 ms
> Time: 583.075 ms
> Time: 568.670 ms
> Time: 569.463 ms
> Time: 569.201 ms
> 
> I forgot to test other formats.

There can be some variance in the tests, so you'd better run much more
tests so as you can get a better idea of the mean.  Discarding the N
highest and lowest values also reduces slightly the effects of the
noise you would get across single runs.
--
Michael


signature.asc
Description: PGP signature

Re: dblink query interruptibility

On Thu, Jan 25, 2024 at 5:45 AM Noah Misch  wrote:
>
> On Thu, Jan 25, 2024 at 04:23:39AM +0900, Fujii Masao wrote:
> > I found that this patch was committed at d3c5f37dd5 and changed the
> > error message in postgres_fdw slightly. Here's an example:
> >
> > #1. Begin a new transaction.
> > #2. Execute a query accessing to a foreign table, like SELECT * FROM
> > 
> > #3. Terminate the *remote* session corresponding to the foreign table.
> > #4. Commit the transaction, and then currently the following error
> > message is output.
> >
> > ERROR:  FATAL:  terminating connection due to administrator command
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> >invalid socket
> >
> > Previously, before commit d3c5f37dd5, the error message at #4 did not
> > include "invalid socket." Now, after the commit, it does. Is this
> > change intentional?
>
> No.  It's a consequence of an intentional change in libpq call sequence, but I
> was unaware I was changing an error message.
>
> > + /* Consume whatever data is available from the socket */
> > + if (PQconsumeInput(conn) == 0)
> > + {
> > + /* trouble; expect PQgetResult() to return NULL */
> > + break;
> > + }
> > + }
> > +
> > + /* Now we can collect and return the next PGresult */
> > + return PQgetResult(conn);
> >
> > This code appears to cause the change. When the remote session ends,
> > PQconsumeInput() returns 0 and marks conn->socket as invalid.
> > Subsequent PQgetResult() calls pqWait(), detecting the invalid socket
> > and appending "invalid socket" to the error message.
> >
> > I think the "invalid socket" message is unsuitable in this scenario,
> > and PQgetResult() should not be called after PQconsumeInput() returns
> > 0. Thought?
>
> The documentation is absolute about the necessity of PQgetResult():

The documentation looks unclear to me regarding what should be done
when PQconsumeInput() returns 0. So I'm not sure if PQgetResult()
must be called even in that case.

As far as I read some functions like libpqrcv_PQgetResult() that use
PQconsumeInput(), it appears that they basically report the error message
using PQerrorMessage(), without calling PQgetResult(),
when PQconsumeInput() returns 0.

Regards,

-- 
Fujii Masao

Re: Synchronizing slots from primary to standby

2024-01-24 Thread Amit Kapila

On Wed, Jan 24, 2024 at 5:17 PM shveta malik  wrote:
>
> PFA v67. Note that the GUC (enable_syncslot) name is unchanged. Once
> we have final agreement on the name, we can make the change in the
> next version.
>
> Changes in v67 are:
>
> 1) Addressed comments by Peter given in [1].
> 2) Addressed comments by Swada-San given in [2].
> 3) Removed syncing 'failover' on standby from remote_slot. The
> 'failover' field will be false for synced slots. Since we do not
> support sync to cascading standbys yet, thus failover=true was
> misleading and unused there.
>

But what will happen after the standby is promoted? After promotion,
ideally, it should have failover enabled, so that the slots can be
synced. Also, note that corresponding subscriptions still have the
failover flag enabled. I think we should copy the 'failover' option
for the synced slots.

-- 
With Regards,
Amit Kapila.

Re: Guiding principle for dropping LLVM versions?

2024-01-24 Thread Thomas Munro

Thanks all for the discussion.  Pushed.  A few build farm animals will
now fail in the configure step as discussed, and need some adjustment
(ie disable LLVM or upgrade to LLVM 10+ for the master branch).

Next year I think we should be able to do a much bigger cleanup, by
moving to LLVM 14+.

Re: Remove redundant HandleWalWriterInterrupts()

On Thu, Jan 25, 2024 at 4:40 AM Nathan Bossart  wrote:
>
> On Wed, Jan 24, 2024 at 07:30:17PM +0530, Bharath Rupireddy wrote:
> > On Wed, Jan 24, 2024 at 5:50 PM Fujii Masao  wrote:
> >> Because of commit 1bdd54e662, the code of HandleWalWriterInterrupts()
> >> became the same as HandleMainLoopInterrupts(). So I'd like to propose to
> >> remove HandleWalWriterInterrupts() and make walwriter use
> >> HandleMainLoopInterrupts() instead for improved code simplicity. Thought?
> >>
> >> Patch attached.
> >
> > Nice catch. Indeed they both are the same after commit 1bdd54e662. The
> > patch LGTM.
>
> LGTM, too.

Thank you both for reviewing! I've pushed the patch.

Regards,

-- 
Fujii Masao

RE: Synchronizing slots from primary to standby

2024-01-24 Thread Zhijie Hou (Fujitsu)

On Wednesday, January 24, 2024 1:11 PM Masahiko Sawada  
wrote:
> Here are random comments on slotsyncworker.c (v66):

Thanks for the comments:

> 
> ---
> +   elog(ERROR,
> +"cannot synchronize local slot \"%s\" LSN(%X/%X)"
> +" to remote slot's LSN(%X/%X) as synchronization"
> +" would move it backwards", remote_slot->name,
> 
> Many error messages in slotsync.c are splitted into several lines, but I 
> think it
> would reduce the greppability when the user looks for the error message in the
> source code.

Thanks for the suggestion! we combined most of the messages in the new version
patch. Although some messages including the above one were kept splitted,
because It's too long(> 120 col including the indent) to fit into the screen,
so I feel it's better to keep these messages splitted.

Best Regards,
Hou zj

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-24 Thread Masahiko Sawada

On Wed, Jan 24, 2024 at 11:17 PM Sutou Kouhei  wrote:
>
> Hi,
>
> In <10025bac-158c-ffe7-fbec-32b426291...@dunslane.net>
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Wed, 24 Jan 2024 07:15:55 -0500,
>   Andrew Dunstan  wrote:
>
> >
> > On 2024-01-24 We 03:11, Michael Paquier wrote:
> >> On Wed, Jan 24, 2024 at 02:49:36PM +0900, Sutou Kouhei wrote:
> >>> For COPY TO:
> >>>
> >>> 0001: This adds CopyToRoutine and use it for text/csv/binary
> >>> formats. No implementation change. This just move codes.
> >> 10M without this change:
> >>
> >>  format,elapsed time (ms)
> >>  text,1090.763
> >>  csv,1136.103
> >>  binary,1137.141
> >>
> >> 10M with this change:
> >>
> >>  format,elapsed time (ms)
> >>  text,1082.654
> >>  csv,1196.991
> >>  binary,1069.697
> >>
> >> These numbers point out that binary is faster by 6%, csv is slower by
> >> 5%, while text stays around what looks like noise range.  That's not
> >> negligible.  Are these numbers reproducible?  If they are, that could
> >> be a problem for anybody doing bulk-loading of large data sets.  I am
> >> not sure to understand where the improvement for binary comes from by
> >> reading the patch, but perhaps perf would tell more for each format?
> >> The loss with csv could be blamed on the extra manipulations of the
> >> function pointers, likely.
> >
> >
> > I don't think that's at all acceptable.
> >
> > We've spent quite a lot of blood sweat and tears over the years to make COPY
> > fast, and we should not sacrifice any of that lightly.
>
> These numbers aren't reproducible. Because these benchmarks
> executed on my normal machine not a machine only for
> benchmarking. The machine runs another processes such as
> editor and Web browser.
>
> For example, here are some results with master
> (94edfe250c6a200d2067b0debfe00b4122e9b11e):
>
> Format,N records,Elapsed time (ms)
> csv,1000,1073.715
> csv,1000,1022.830
> csv,1000,1073.584
> csv,1000,1090.651
> csv,1000,1052.259
>
> Here are some results with master + the 0001 patch:
>
> Format,N records,Elapsed time (ms)
> csv,1000,1025.356
> csv,1000,1067.202
> csv,1000,1014.563
> csv,1000,1032.088
> csv,1000,1058.110
>
>
> I uploaded my benchmark script so that you can run the same
> benchmark on your machine:
>
> https://gist.github.com/kou/be02e02e5072c91969469dbf137b5de5
>
> Could anyone try the benchmark with master and master+0001?
>

I've run a similar scenario:

create unlogged table test (a int);
insert into test select c from generate_series(1, 2500) c;
copy test to '/tmp/result.csv' with (format csv); -- generates 230MB file

I've run it on HEAD and HEAD+0001 patch and here are the medians of 10
executions for each format:

HEAD:
binary 2930.353 ms
text 2754.852 ms
csv 2890.012 ms

HEAD w/ 0001 patch:
binary 2814.838 ms
text 2900.845 ms
csv 3015.210 ms

Hmm I can see a similar trend that Suto-san had; the binary format got
slightly faster whereas both text and csv format has small regression
(4%~5%). I think that the improvement for binary came from the fact
that we removed "if (cstate->opts.binary)" branches from the original
CopyOneRowTo(). I've experimented with a similar optimization for csv
and text format; have different callbacks for text and csv format and
remove "if (cstate->opts.csv_mode)" branches. I've attached a patch
for that. Here are results:

HEAD w/ 0001 patch + remove branches:
binary 2824.502 ms
text 2715.264 ms
csv 2803.381 ms

The numbers look better now. I'm not sure these are within a noise
range but it might be worth considering having different callbacks for
text and csv formats.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


add_callback_for_csv_format.patch
Description: Binary data

Re: UUID v7

2024-01-24 Thread Nikolay Samokhvalov

On Wed, Jan 24, 2024 at 1:52 PM Sergey Prokhorenko <
sergeyprokhore...@yahoo.com.au> wrote:

> That's right! There is no point in waiting for the official approval of
> the new RFC, which obviously will not change anything. I have been a
> contributor to this RFC
> 
> for several years, and I can testify that every aspect imaginable has been
> thoroughly researched and agreed upon. Nothing new will definitely appear
> in the new RFC.
>

>From a practical point of view, these two things are extremely important to
have to support partitioning. It is better to implement limitations than
throw them away.

Without them, this functionality will be of a very limited use in
databases. We need to think about large tables – which means partitioning.

Nik

Re: UUID v7

2024-01-24 Thread Nikolay Samokhvalov

On Wed, Jan 24, 2024 at 8:40 PM Nikolay Samokhvalov  wrote:

> On Wed, Jan 24, 2024 at 1:52 PM Sergey Prokhorenko <
> sergeyprokhore...@yahoo.com.au> wrote:
>
>> That's right! There is no point in waiting for the official approval of
>> the new RFC, which obviously will not change anything. I have been a
>> contributor to this RFC
>> 
>> for several years, and I can testify that every aspect imaginable has been
>> thoroughly researched and agreed upon. Nothing new will definitely
>> appear in the new RFC.
>>
>
> From a practical point of view, these two things are extremely important
> to have to support partitioning. It is better to implement limitations than
> throw them away.
>
> Without them, this functionality will be of a very limited use in
> databases. We need to think about large tables – which means partitioning.
>

apologies -- this was a response to another email from you:

> "Other people" think that extracting the timestamp from UUIDv7 in
violation of the new RFC, and generating UUIDv7 from the timestamp were
both terrible and poorly thought out ideas. The authors of the new RFC had
very good reasons to prohibit this. And the problems you face are the best
confirmation of the correctness of the new RFC. It’s better to throw all
this gag out of the official patch. Don't tempt developers to break the new
RFC with these error-producing functions.

Re: Make COPY format extendable: Extract COPY TO format implementations

On Thu, Jan 25, 2024 at 01:36:03PM +0900, Masahiko Sawada wrote:
> Hmm I can see a similar trend that Suto-san had; the binary format got
> slightly faster whereas both text and csv format has small regression
> (4%~5%). I think that the improvement for binary came from the fact
> that we removed "if (cstate->opts.binary)" branches from the original
> CopyOneRowTo(). I've experimented with a similar optimization for csv
> and text format; have different callbacks for text and csv format and
> remove "if (cstate->opts.csv_mode)" branches. I've attached a patch
> for that. Here are results:
> 
> HEAD w/ 0001 patch + remove branches:
> binary 2824.502 ms
> text 2715.264 ms
> csv 2803.381 ms
> 
> The numbers look better now. I'm not sure these are within a noise
> range but it might be worth considering having different callbacks for
> text and csv formats.

Interesting.

Your numbers imply a 0.3% speedup for text, 0.7% speedup for csv and
0.9% speedup for binary, which may be around the noise range assuming
a ~1% range.  While this does not imply a regression, that seems worth
the duplication IMO.  The patch had better document the reason why the
split is done, as well.

CopyFromTextOneRow() has also specific branches for binary and
non-binary removed in 0005, so assuming that I/O is not a bottleneck,
the operation would be faster because we would not evaluate this "if"
condition for each row.  Wouldn't we also see improvements for COPY
FROM with short row values, say when mounting PGDATA into a
tmpfs/ramfs?
--
Michael


signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby