Re: pg_basebackup has an accidentaly separated help message

2023-12-25 Thread Kyotaro Horiguchi
At Mon, 25 Dec 2023 15:42:41 +0900, Michael Paquier  wrote 
in 
> On Mon, Dec 25, 2023 at 02:39:16PM +0900, Kyotaro Horiguchi wrote:
> > The attached patch contains both of the above fixes.
> 
> Good catches, let's fix them.  You have noticed that while translating
> these new messages, I guess?

Yes. So, it turns out that they're found after they have been
committed.

Because handling a large volume of translations all at once is
daunting, I am maintaining translations locally to avoid that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.

2023-12-25 Thread Anton A. Melnikov

On 25.12.2023 02:38, Alexander Korotkov wrote:


Pushed!


Thanks a lot!

With the best regards!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Trigger violates foreign key constraint

2023-12-25 Thread Pavel Luzanov

On 22.12.2023 14:39, Laurenz Albe wrote:

Yes, that is better - shorter and avoids passive mode.  Changed.


Thanks.


Also I don't really like "This is not considered a bug" part, since it
looks like an excuse.

In a way, it is an excuse, so why not be honest about it.


I still think that the "this is not a bug" style is not suitable for 
documentation.

But I checked the documentation and found 3-4 more such places.
Therefore, it's ok from me, especially since it really looks like a bug.

The example you provided in your other message (cascading triggers
fail if the table ovner has revoked the required permissions from
herself) is not really about breaking foreign keys.  You hit a
surprising error, but referential integrity will be maintained.


Yes, referential integrity will be maintained. But perhaps
it is worth adding a section to the documentation about system triggers
that are used to implement foreign keys. Now it is not mentioned anywhere,
but questions and problems arise from time to time.
Such a section named "System triggers" may be added as a separate chapter
to Part VII. Internals or as a subsection to Chapter 38.Triggers.

I thought about this after your recent excellent article [1],
which has an introduction to system triggers.

This does not negate the need for the patch being discussed.


Patch v3 is attached.


For me, it is ready for committer.

1. https://www.cybertec-postgresql.com/en/broken-foreign-keys-postgresql/

--
Pavel Luzanov
Postgres Professional:https://postgrespro.com


Proposal to include --exclude-extension Flag in pg_dump

2023-12-25 Thread Ayush Vatsa
Hi PostgreSQL Community,
Recently I have been working on pg_dump regarding my project and wanted to
exclude an extension from the dump generated. I wonder why it doesn't have
--exclude-extension type of support whereas --extension exists!
Since I needed that support, I took the initiative to contribute to the
community by adding the --exclude-extension flag.
Attached is the patch for the same. Looking forward to your feedback.

Regards
Ayush Vatsa
Amazon Web services (AWS)


v1-0001-Add-support-for-exclude-extension-in-pg_dump.patch
Description: Binary data


Re: Proposal to include --exclude-extension Flag in pg_dump

2023-12-25 Thread Ayush Vatsa
Added a CF entry for the same https://commitfest.postgresql.org/46/4721/

Regards
Ayush Vatsa
Amazon Web Services (AWS)

On Mon, 25 Dec 2023 at 15:48, Ayush Vatsa  wrote:

> Hi PostgreSQL Community,
> Recently I have been working on pg_dump regarding my project and wanted to
> exclude an extension from the dump generated. I wonder why it doesn't have
> --exclude-extension type of support whereas --extension exists!
> Since I needed that support, I took the initiative to contribute to the
> community by adding the --exclude-extension flag.
> Attached is the patch for the same. Looking forward to your feedback.
>
> Regards
> Ayush Vatsa
> Amazon Web services (AWS)
>


Re: Proposal to include --exclude-extension Flag in pg_dump

2023-12-25 Thread Junwang Zhao
Hi

On Mon, Dec 25, 2023 at 6:22 PM Ayush Vatsa  wrote:
>
> Added a CF entry for the same https://commitfest.postgresql.org/46/4721/
>
> Regards
> Ayush Vatsa
> Amazon Web Services (AWS)
>
> On Mon, 25 Dec 2023 at 15:48, Ayush Vatsa  wrote:
>>
>> Hi PostgreSQL Community,
>> Recently I have been working on pg_dump regarding my project and wanted to 
>> exclude an extension from the dump generated. I wonder why it doesn't have 
>> --exclude-extension type of support whereas --extension exists!
>> Since I needed that support, I took the initiative to contribute to the 
>> community by adding the --exclude-extension flag.
>> Attached is the patch for the same. Looking forward to your feedback.
>>
>> Regards
>> Ayush Vatsa
>> Amazon Web services (AWS)

  printf(_("  -e, --extension=PATTERN  dump the specified
extension(s) only\n"));
+ printf(_("  --exclude-extension=PATTERN  do NOT dump the specified
extension(s)\n"));
  printf(_("  -E, --encoding=ENCODING  dump the data in encoding
ENCODING\n"));

long options should not mess with short options, does the following
make sense to you?

 printf(_(" --enable-row-security enable row security (dump only
content user has\n"
 "access to)\n"));
+ printf(_("  --exclude-extension=PATTERN  do NOT dump the specified
extension(s)\n"));
 printf(_(" --exclude-table-and-children=PATTERN\n"

-- 
Regards
Junwang Zhao




Re: Avoid computing ORDER BY junk columns unnecessarily

2023-12-25 Thread Richard Guo
On Sat, Dec 23, 2023 at 1:32 AM Tom Lane  wrote:

> Heikki Linnakangas  writes:
> > On 22/12/2023 17:24, Tom Lane wrote:
> >> How much of your patchset still makes sense if we assume that we
> >> can always extract the ORDER BY column values from the index?
>
> > That would make it much less interesting. But I don't think that's a
> > good assumption. Especially in the kNN case, the ORDER BY value would
> > not be stored in the index. Most likely the index needs to calculate it
> > in some form, but it might take shortcuts like avoiding the sqrt().
>
> Yeah, fair point.  I'll try to take a look at your patchset after
> the holidays.


Agreed.

I haven't looked into these patches, but it seems that there is an issue
with how the targetlist is handled for foreign rels.  The following test
case for postgres_fdw hits the Assert in apply_tlist_labeling().

contrib_regression=# SELECT c3, c4 FROM ft1 ORDER BY c3, c1 LIMIT 1;
server closed the connection unexpectedly

When we create foreign final path in add_foreign_final_paths(), we use
root->upper_targets[UPPERREL_FINAL] as the PathTarget.  This PathTarget
is set in grouping_planner(), without removing the junk columns.  I
think this is why the above query hits the Assert.

Thanks
Richard


Re: Optimization outcome depends on the index order

2023-12-25 Thread Alexander Korotkov
On Fri, Dec 22, 2023 at 7:24 PM Andrei Lepikhov
 wrote:
> On 22/12/2023 11:48, Alexander Korotkov wrote:
> >  > Because we must trust all predictions made by the planner, we just
> >  > choose the most trustworthy path. According to the planner logic, it is
> >  > a path with a smaller selectivity. We can make mistakes anyway just
> >  > because of the nature of estimation.
> >
> > Even if we need to take selectivity into account here, it's still not
> > clear why this should be on top of other logic later in add_path().
> I got your point now, thanks for pointing it out. In the next version of
> the patch selectivity is used as a criteria only in the case of COSTS_EQUAL.

It looks better now.  But it's hard for me to judge these heuristics
in add_path().  Tom, what do you think about this?

--
Regards,
Alexander Korotkov




Re: Proposal to include --exclude-extension Flag in pg_dump

2023-12-25 Thread Ayush Vatsa
Hi,
> long options should not mess with short options, does the following
> make sense to you?
Yes that makes sense, a reason to keep them together is that they are of
the same kind
But I will update the patch accordingly.

One more thing I wanted to ask is, Should I separate them in the
pg_dump.sgml file too? Like writing documentation of --exclude-extension
with other long options?

After your feedback I will update on both places in a single patch

Regards,
Ayush Vatsa
Amazon Web Services (AWS)

On Mon, 25 Dec 2023 at 16:28, Junwang Zhao  wrote:

> Hi
>
> On Mon, Dec 25, 2023 at 6:22 PM Ayush Vatsa 
> wrote:
> >
> > Added a CF entry for the same https://commitfest.postgresql.org/46/4721/
> >
> > Regards
> > Ayush Vatsa
> > Amazon Web Services (AWS)
> >
> > On Mon, 25 Dec 2023 at 15:48, Ayush Vatsa 
> wrote:
> >>
> >> Hi PostgreSQL Community,
> >> Recently I have been working on pg_dump regarding my project and wanted
> to exclude an extension from the dump generated. I wonder why it doesn't
> have --exclude-extension type of support whereas --extension exists!
> >> Since I needed that support, I took the initiative to contribute to the
> community by adding the --exclude-extension flag.
> >> Attached is the patch for the same. Looking forward to your feedback.
> >>
> >> Regards
> >> Ayush Vatsa
> >> Amazon Web services (AWS)
>
>   printf(_("  -e, --extension=PATTERN  dump the specified
> extension(s) only\n"));
> + printf(_("  --exclude-extension=PATTERN  do NOT dump the specified
> extension(s)\n"));
>   printf(_("  -E, --encoding=ENCODING  dump the data in encoding
> ENCODING\n"));
>
> long options should not mess with short options, does the following
> make sense to you?
>
>  printf(_(" --enable-row-security enable row security (dump only
> content user has\n"
>  "access to)\n"));
> + printf(_("  --exclude-extension=PATTERN  do NOT dump the specified
> extension(s)\n"));
>  printf(_(" --exclude-table-and-children=PATTERN\n"
>
> --
> Regards
> Junwang Zhao
>


Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-12-25 Thread Amul Sul
On Mon, Dec 18, 2023 at 3:01 PM Peter Eisentraut 
wrote:

> On 11.12.23 13:22, Amul Sul wrote:
> >
> > create table t1 (a int, b int generated always as (a + 1) stored);
> > alter table t1 add column c int, alter column b set expression as (a
> > + c);
> > ERROR:  42703: column "c" does not exist
> >
> > I think intuitively, this ought to work.  Maybe just moving the new
> > pass
> > after AT_PASS_ADD_COL would do it.
> >
> >
> > I think we can't support that (like alter type) since we need to place
> > this new
> > pass before AT_PASS_OLD_INDEX & AT_PASS_OLD_CONSTR to re-add indexes and
> > constraints for the validation.
>
> Could we have AT_PASS_ADD_COL before AT_PASS_OLD_*?  So overall it would be
>
> ...
> AT_PASS_ALTER_TYPE,
> AT_PASS_ADD_COL, // moved
> AT_PASS_SET_EXPRESSION,  // new
> AT_PASS_OLD_INDEX,
> AT_PASS_OLD_CONSTR,
> AT_PASS_ADD_CONSTR,
> ...
>
> This appears to not break any existing tests.
>

(Sorry, for the delay)

Agree. I did that change in 0001 patch.

Regards,
Amul


v7-0001-Code-refactor-convert-macro-listing-to-enum.patch
Description: Binary data


v7-0002-Code-refactor-separate-function-to-find-all-depen.patch
Description: Binary data


v7-0003-Allow-to-change-generated-column-expression.patch
Description: Binary data


Re: Show WAL write and fsync stats in pg_stat_io

2023-12-25 Thread Nazir Bilal Yavuz
Hi,

Thanks for the review and feedback on your previous reply!

On Mon, 25 Dec 2023 at 09:40, Michael Paquier  wrote:
>
> On Mon, Dec 25, 2023 at 03:20:58PM +0900, Michael Paquier wrote:
> > pgstat_tracks_io_bktype() does not look correct to me.  Why is the WAL
> > receiver considered as something correct in the list of backend types,
> > while the intention is to *not* add it to pg_stat_io?  I have tried to
> > switche to the correct behavior of returning false for a
> > B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
> > freezes on its shutdown sequence.  Something weird is going on here.
> > Could you look at it?  See the XXX comment in the attached, which is
> > the same behavior as v6-0002.  It looks to me that the patch has
> > introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
> > incorrect way to avoid the root issue.
>
> Ah, that's because it would trigger an assertion failure:
> TRAP: failed Assert("pgstat_tracks_io_op(MyBackendType, io_object,
>  io_context, io_op)"), File: "pgstat_io.c", Line: 89, PID: 6824
> postgres: standby_local: walreceiver
> (ExceptionalCondition+0xa8)[0x560d1b4dd38a]
>
> And the backtrace just tells that this is the WAL receiver
> initializing a WAL segment:
> #5  0x560d1b3322c8 in pgstat_count_io_op_n
> (io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
> cnt=1) at pgstat_io.c:89
> #6  0x560d1b33254a in pgstat_count_io_op_time
> (io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
> start_time=..., cnt=1) at pgstat_io.c:181
> #7  0x560d1ae7f932 in XLogFileInitInternal (logsegno=3, logtli=1,
> added=0x7ffd2733c6eb, path=0x7ffd2733c2e0 "pg_wal/0001", '0'
> , "3") at xlog.c:3115
> #8  0x560d1ae7fc4e in XLogFileInit (logsegno=3, logtli=1) at
> xlog.c:3215

Correct.

>
> Wouldn't it be simpler to just bite the bullet in this case and handle
> WAL receivers in the IO tracking?

There is one problem and I couldn't decide how to solve it. We need to
handle read IO in WALRead() in xlogreader.c. How many bytes the
WALRead() function will read is controlled by a variable and it can be
different from XLOG_BLCKSZ. This is a problem because pg_stat_io's
op_bytes column is a constant.

Here are all WALRead() function calls:

1- read_local_xlog_page_guts() in xlogutils.c => WALRead(XLOG_BLCKSZ)
=> always reads XLOG_BLCKSZ.

2- summarizer_read_local_xlog_page() in walsummarizer.c =>
WALRead(XLOG_BLCKSZ) => always reads XLOG_BLCKSZ.

3- logical_read_xlog_page() in walsender.c => WALRead(XLOG_BLCKSZ) =>
always reads XLOG_BLCKSZ.

4- XLogSendPhysical() in walsender.c => WALRead(nbytes) =>  nbytes can
be different from XLOG_BLCKSZ.

5- WALDumpReadPage() in pg_waldump.c => WALRead(count) => count can be
different from XLOG_BLCKSZ.

4 and 5 are the problematic calls.

Melanie's answer to this problem on previous discussions:

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman  wrote:
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

Using op_bytes as 1 solves this problem but since it will be different
from the rest of the pg_stat_io view it could be hard to understand.
There is no clear separation by backends as it can be seen from the walsender.

>
> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
> treat op_bytes * number of reads as an approximation of the number of
> bytes read. I don't actually know what makes more sense. I don't think I
> would like having a number for bytes that is not accurate.

Also, we have a similar problem in XLogPageRead() in xlogrecovery.c.
pg_pread() call tries to read XLOG_BLCKSZ but it is not certain and we
don't count IO if it couldn't read XLOG_BLCKSZ. IMO, this is not as
important as the previous problem but it still is a problem.

I would be glad to hear opinions on these problems.

--
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-25 Thread Michail Nikolaev
Hello!

It seems like the idea of "old" snapshot is still a valid one.

> Should this deal with any potential XID wraparound, too?

As far as I understand in our case, we are not affected by this in any way.
Vacuum in our table is not possible because of locking, so, nothing
may be frozen (see below).
In the case of super long index building, transactional limits will
stop new connections using current
regular infrastructure because it is based on relation data (but not
actual xmin of backends).

> How does this behave when the newly inserted tuple's xmin gets frozen?
> This would be allowed to happen during heap page pruning, afaik - no
> rules that I know of which are against that - but it would create
> issues where normal snapshot visibility rules would indicate it
> visible to both snapshots regardless of whether it actually was
> visible to the older snapshot when that snapshot was created...

As I can see, heap_page_prune never freezes any tuples.
In the case of regular vacuum, it used this way: call heap_page_prune
and then call heap_prepare_freeze_tuple and then
heap_freeze_execute_prepared.

Merry Christmas,
Mikhail.




Re: apply pragma system_header to python headers

2023-12-25 Thread Tom Lane
I wrote:
> Probably a better way is to invent a separate header "plpython_system.h"
> that just includes the Python headers, to scope the pragma precisely.
> (I guess it could have the fixup #defines we're wrapping those headers
> in, too.)

> The same idea would work in plperl.

After updating one of my test machines to Fedora 39, I'm seeing these
Python warnings too.  So here's a fleshed-out patch proposal for doing
it like that.

regards, tom lane

diff --git a/src/pl/plperl/GNUmakefile b/src/pl/plperl/GNUmakefile
index 975f540b3e..31fb212bf1 100644
--- a/src/pl/plperl/GNUmakefile
+++ b/src/pl/plperl/GNUmakefile
@@ -108,11 +108,11 @@ uninstall: uninstall-lib uninstall-data
 
 install-data: installdirs
 	$(INSTALL_DATA) $(addprefix $(srcdir)/, $(DATA)) '$(DESTDIR)$(datadir)/extension/'
-	$(INSTALL_DATA) $(srcdir)/plperl.h $(srcdir)/ppport.h '$(DESTDIR)$(includedir_server)'
+	$(INSTALL_DATA) $(srcdir)/plperl.h $(srcdir)/plperl_system.h $(srcdir)/ppport.h '$(DESTDIR)$(includedir_server)'
 
 uninstall-data:
 	rm -f $(addprefix '$(DESTDIR)$(datadir)/extension'/, $(notdir $(DATA)))
-	rm -f $(addprefix '$(DESTDIR)$(includedir_server)'/, plperl.h ppport.h)
+	rm -f $(addprefix '$(DESTDIR)$(includedir_server)'/, plperl.h plperl_system.h ppport.h)
 
 .PHONY: install-data uninstall-data
 
diff --git a/src/pl/plperl/meson.build b/src/pl/plperl/meson.build
index 6a28f40a11..182e0fe169 100644
--- a/src/pl/plperl/meson.build
+++ b/src/pl/plperl/meson.build
@@ -71,6 +71,7 @@ install_data(
 
 install_headers(
   'plperl.h',
+  'plperl_system.h',
   'ppport.h',
   install_dir: dir_include_server,
 )
diff --git a/src/pl/plperl/plperl.h b/src/pl/plperl/plperl.h
index 5d8e51089a..7f338084a9 100644
--- a/src/pl/plperl/plperl.h
+++ b/src/pl/plperl/plperl.h
@@ -18,200 +18,11 @@
 /* defines free() by way of system headers, so must be included before perl.h */
 #include "mb/pg_wchar.h"
 
-/* stop perl headers from hijacking stdio and other stuff on Windows */
-#ifdef WIN32
-#define WIN32IO_IS_STDIO
-#endif			/* WIN32 */
-
 /*
- * Supply a value of PERL_UNUSED_DECL that will satisfy gcc - the one
- * perl itself supplies doesn't seem to.
+ * Pull in Perl headers via a wrapper header, to control the scope of
+ * the system_header pragma therein.
  */
-#define PERL_UNUSED_DECL pg_attribute_unused()
-
-/*
- * Sometimes perl carefully scribbles on our *printf macros.
- * So we undefine them here and redefine them after it's done its dirty deed.
- */
-#undef vsnprintf
-#undef snprintf
-#undef vsprintf
-#undef sprintf
-#undef vfprintf
-#undef fprintf
-#undef vprintf
-#undef printf
-
-/*
- * Perl scribbles on the "_" macro too.
- */
-#undef _
-
-/*
- * ActivePerl 5.18 and later are MinGW-built, and their headers use GCC's
- * __inline__.  Translate to something MSVC recognizes. Also, perl.h sometimes
- * defines isnan, so undefine it here and put back the definition later if
- * perl.h doesn't.
- */
-#ifdef _MSC_VER
-#define __inline__ inline
-#ifdef isnan
-#undef isnan
-#endif
-/* Work around for using MSVC and Strawberry Perl >= 5.30. */
-#define __builtin_expect(expr, val) (expr)
-#endif
-
-/*
- * Regarding bool, both PostgreSQL and Perl might use stdbool.h or not,
- * depending on configuration.  If both agree, things are relatively harmless.
- * If not, things get tricky.  If PostgreSQL does but Perl does not, define
- * HAS_BOOL here so that Perl does not redefine bool; this avoids compiler
- * warnings.  If PostgreSQL does not but Perl does, we need to undefine bool
- * after we include the Perl headers; see below.
- */
-#ifdef PG_USE_STDBOOL
-#define HAS_BOOL 1
-#endif
-
-/*
- * Newer versions of the perl headers trigger a lot of warnings with our
- * compiler flags (at least -Wdeclaration-after-statement,
- * -Wshadow=compatible-local are known to be problematic). The system_header
- * pragma hides warnings from within the rest of this file, if supported.
- */
-#ifdef HAVE_PRAGMA_GCC_SYSTEM_HEADER
-#pragma GCC system_header
-#endif
-
-/*
- * Get the basic Perl API.  We use PERL_NO_GET_CONTEXT mode so that our code
- * can compile against MULTIPLICITY Perl builds without including XSUB.h.
- */
-#define PERL_NO_GET_CONTEXT
-#include "EXTERN.h"
-#include "perl.h"
-
-/*
- * We want to include XSUB.h only within .xs files, because on some platforms
- * it undesirably redefines a lot of libc functions.  But it must appear
- * before ppport.h, so use a #define flag to control inclusion here.
- */
-#ifdef PG_NEED_PERL_XSUB_H
-/*
- * On Windows, win32_port.h defines macros for a lot of these same functions.
- * To avoid compiler warnings when XSUB.h redefines them, #undef our versions.
- */
-#ifdef WIN32
-#undef accept
-#undef bind
-#undef connect
-#undef fopen
-#undef fstat
-#undef kill
-#undef listen
-#undef lstat
-#undef mkdir
-#undef open
-#undef putenv
-#undef recv
-#undef rename
-#undef select
-#undef send
-#undef socket
-#undef stat
-#undef unlink
-#endif
-
-#include "XSUB.h"
-#endif
-
-/* put back our 

No LINGUAS file yet for pg_combinebackup

2023-12-25 Thread Tom Lane
Since dc2123400, any "make" in src/bin/pg_combinebackup whines

cat: ./po/LINGUAS: No such file or directory

if you have selected --enable-nls.  This is evidently from

AVAIL_LANGUAGES := $(shell cat $(srcdir)/po/LINGUAS)

I don't particularly care to see that warning until whenever
it is that the translations first get populated.  Perhaps
we should hack up nls-global.mk to hide the warning, but
that might bite somebody someday.  I'm inclined to just
add an empty LINGUAS file as a temporary measure.
Thoughts?

regards, tom lane




Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-25 Thread Pavel Borisov
Hi, Alexander!

On Mon, 25 Dec 2023 at 02:51, Alexander Korotkov 
wrote:

> On Tue, Dec 12, 2023 at 3:22 PM Alexander Korotkov 
> wrote:
> >
> > On Mon, Dec 11, 2023 at 6:16 PM Peter Geoghegan  wrote:
> > > Will you be in Prague this week? If not this might have to wait.
> >
> > Sorry, I wouldn't be in Prague this week.  Due to my current
> > immigration status, I can't travel.
> > I wish you to have a lovely time in Prague.  I'm OK to wait, review
> > once you can.  I will probably provide a more polished version
> > meanwhile.
>
> Please find the revised patchset attached.  It comes with revised
> comments and commit messages.  Besides bug fixing the second patch
> makes optimization easier to understand.  Now the flag used for
> skipping checks of same direction required keys is named
> continuescanPrechecked and means exactly that *continuescan flag is
> known to be true for the last item on the page.
>
> Any objections to pushing these two patches?
>

I've reviewed both patches:
0001 - is a pure refactoring replacing argument transfer from via struct
member to transfer explicitly as a function argument. It's justified by
the fact firstPage is localized only to several places. The patch looks
simple and good enough.

0002:
continuescanPrechecked is semantically much better than
previous requiredMatchedByPrecheck which confused me earlier. Thanks!

>From the new comments, it looks a little bit hard to understand who does
what. Semantics "if caller told" in comments looks more clear to me. Could
you especially give attention to the comments:

"If they wouldn't be matched, then the *continuescan flag would be set for
the current item and the last item on the page accordingly."
"If the key is required for the opposite direction scan, we need to know
there was already at least one matching item on the page.  For those keys."

> Prechecking the value of the continuescan flag for the last item on the
>+ * page (according to the scan direction).
Maybe, in this case, it would be more clear like: "...(for backwards scan
it will be the first item on a page)"

Otherwise the patch 0002 looks like a good fix for the bug to be pushed.

Kind regards,
Pavel Borisov


Re: Show WAL write and fsync stats in pg_stat_io

2023-12-25 Thread Michael Paquier
On Mon, Dec 25, 2023 at 04:09:34PM +0300, Nazir Bilal Yavuz wrote:
> On Wed, 9 Aug 2023 at 21:52, Melanie Plageman  
> wrote:
>> If there is any combination of BackendType and IOContext which will
>> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
>> op_bytes. For other cases, we may have to consider using op_bytes 1 and
>> tracking reads and write IOOps in number of bytes (instead of number of
>> pages). I don't actually know if there is a clear separation by
>> BackendType for these different cases.
> 
> Using op_bytes as 1 solves this problem but since it will be different
> from the rest of the pg_stat_io view it could be hard to understand.
> There is no clear separation by backends as it can be seen from the walsender.

I find the use of 1 in this context a bit confusing, because when
referring to a counter at N, then it can be understood as doing N
times a operation, but it would be much less than that.  Another
solution would be to use NULL (as a synonym of "I don't know") and
then document that in this case all the bigint counters of pg_stat_io
track the number of bytes rather than the number of operations?

>> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
>> treat op_bytes * number of reads as an approximation of the number of
>> bytes read. I don't actually know what makes more sense. I don't think I
>> would like having a number for bytes that is not accurate.
> 
> Also, we have a similar problem in XLogPageRead() in xlogrecovery.c.
> pg_pread() call tries to read XLOG_BLCKSZ but it is not certain and we
> don't count IO if it couldn't read XLOG_BLCKSZ. IMO, this is not as
> important as the previous problem but it still is a problem.
> 
> I would be glad to hear opinions on these problems.

Correctness matters a lot for monitoring, IMO.
--
Michael


signature.asc
Description: PGP signature


Re: pread, pwrite, etc return ssize_t not int

2023-12-25 Thread Thomas Munro
On Mon, Dec 25, 2023 at 7:09 AM Tom Lane  wrote:
> Coverity whinged this morning about the following bit in
> the new pg_combinebackup code:
>
> 644 unsignedrb;
> 645
> 646 /* Read the block from the correct source, except if 
> dry-run. */
> 647 rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
> 648 if (rb != BLCKSZ)
> 649 {
> >>> CID 1559912:  Control flow issues  (NO_EFFECT)
> >>> This less-than-zero comparison of an unsigned value is never true. 
> >>> "rb < 0U".
> 650 if (rb < 0)
> 651 pg_fatal("could not read file \"%s\": %m", 
> s->filename);
>
> It's dead right to complain of course.  (I kind of think that the
> majority of places where reconstruct.c is using "unsigned" variables
> are poorly-thought-through; many of them look like they should be
> size_t, and I suspect some other ones beside this one are flat wrong
> or at least unnecessarily fragile.  But I digress.)

Yeah.

> While looking around for other places that might've made comparable
> mistakes, I noted that we have places that are storing the result of
> pg_pread[v]/pg_pwrite[v] into an "int" variable even though they are
> passing a size_t count argument that there is no obvious reason to
> believe must fit in int.  This seems like trouble waiting to happen,
> so I fixed some of these in the attached.  The major remaining place
> that I think we ought to change is the newly-minted
> FileRead[V]/FileWrite[V] functions, which are declared to return int
> but really should be returning ssize_t IMO.  I didn't do that here
> though.

Agreed in theory.  Note that we've only been using size_t in fd.c
functions since:

commit 2d4f1ba6cfc2f0a977f1c30bda9848041343e248
Author: Peter Eisentraut 
Date:   Thu Dec 8 08:51:38 2022 +0100

Update types in File API

Make the argument types of the File API match stdio better:

- Change the data buffer to void *, from char *.
- Change FileWrite() data buffer to const on top of that.
- Change amounts to size_t, from int.

I guess it was an oversight not to change the return type to match at
the same time.  That said, I think that would only be for tidiness.

Some assorted observations:

1.  We don't yet require "large file" support, meaning that we use
off_t our fd.c and lseek()/p*() replacement functions, but we know it
is only 32 bits on Windows, and we avoid creating large files.  I
think that means that a hypothetical very large write would break that
assumption, creating data whose position cannot be named in those
calls.  We could fix that on Windows by adjusting our wrappers, either
to work with pgoff_t instead off off_t (yuck) or redefining off_t
(yuck), but given that both options are gross, so far we have imagined
that we should move towards using large files only conditionally, on
sizeof(off_t) >= 8.

2.  Windows' native read() and write() functions still have prototypes
like 1988 POSIX, eg int read(int filedes, void *buf, unsigned int
nbyte).  It was 2001 POSIX that changed them to ssize_t read(int
filedesc, void *buf, size_t nbytes).  I doubt there is much that can
be done about that, except perhaps adding a wrapper that caps it.
Seems like overkill for a hypothetical sort of a problem...

3.  Windows' native ReadFile() and WriteFile() functions, which our
'positionified' pg_p*() wrapper functions use, work in terms of DWORD
= unsigned long which is 32 bit on that cursed ABI.  Our wrappers
should probably cap.

I think a number of Unixoid systems implemented the POSIX interface
change by capping internally, which doesn't matter much in practice
because no one really tries to transfer gigabytes at once, and any
non-trivial transfer size probably requires handling short transfers.
For example, man read on Linux:

   On Linux, read() (and similar system calls) will transfer at most
   0x7000 (2,147,479,552) bytes, returning the number of bytes
   actually transferred.  (This is true on both 32-bit and 64-bit
   systems.)

> We could go further by insisting that *all* uses of pg_pread/pg_pwrite
> use ssize_t result variables.  I think that's probably overkill --- in
> the example above, which is only asking to write BLCKSZ worth of data,
> surely an int is sufficient.  But you could argue that allowing this
> pattern at all creates risk of copy/paste errors.

Yeah.

> Of course the real elephant in the room is that plain old read(2)
> and write(2) also return ssize_t.  I've not attempted to vet every
> call of those, and I think it'd likely be a waste of effort, as
> we're unlikely to ever try to shove more than INT_MAX worth of
> data through them.  But it's a bit harder to make that argument
> for the iovec-based file APIs.  I think we ought to try to keep
> our uses of those functions clean on this point.

Yeah I think it's OK for a caller that knows it's passing in an int
value to (implicitly) cast the return to int. 

Re: Statistics Import and Export

2023-12-25 Thread Tomas Vondra
Hi,

I finally had time to look at the last version of the patch, so here's a
couple thoughts and questions in somewhat random order. Please take this
as a bit of a brainstorming and push back if you disagree some of my
comments.

In general, I like the goal of this patch - not having statistics is a
common issue after an upgrade, and people sometimes don't even realize
they need to run analyze. So, it's definitely worth improving.

I'm not entirely sure about the other use case - allowing people to
tweak optimizer statistics on a running cluster, to see what would be
the plan in that case. Or more precisely - I agree that would be an
interesting and useful feature, but maybe the interface should not be
the same as for the binary upgrade use case?


interfaces
--

When I thought about the ability to dump/load statistics in the past, I
usually envisioned some sort of DDL that would do the export and import.
So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
or something like that, and that'd do all the work. This would mean
stats are "first-class citizens" and it'd be fairly straightforward to
add this into pg_dump, for example. Or at least I think so ...

Alternatively we could have the usual "functional" interface, with a
functions to export/import statistics, replacing the DDL commands.

Unfortunately, none of this works for the pg_upgrade use case, because
existing cluster versions would not support this new interface, of
course. That's a significant flaw, as it'd make this useful only for
upgrades of future versions.

So I think for the pg_upgrade use case, we don't have much choice other
than using "custom" export through a view, which is what the patch does.

However, for the other use case (tweaking optimizer stats) this is not
really an issue - that always happens on the same instance, so no issue
with not having the "export" function and so on. I'd bet there are more
convenient ways to do this than using the export view. I'm sure it could
share a lot of the infrastructure, ofc.

I suggest we focus on the pg_upgrade use case for now. In particular, I
think we really need to find a good way to integrate this into
pg_upgrade. I'm not against having custom CLI commands, but it's still a
manual thing - I wonder if we could extend pg_dump to dump stats, or
make it built-in into pg_upgrade in some way (possibly disabled by
default, or something like that).


JSON format
---

As for the JSON format, I wonder if we need that at all? Isn't that an
unnecessary layer of indirection? Couldn't we simply dump pg_statistic
and pg_statistic_ext_data in CSV, or something like that? The amount of
new JSONB code seems to be very small, so it's OK I guess.

I'm still a bit unsure about the "right" JSON schema. I find it a bit
inconvenient that the JSON objects mimic the pg_statistic schema very
closely. In particular, it has one array for stakind values, another
array for stavalues, array for stanumbers etc. I understand generating
this JSON in SQL is fairly straightforward, and for the pg_upgrade use
case it's probably OK. But my concern is it's not very convenient for
the "manual tweaking" use case, because the "related" fields are
scattered in different parts of the JSON.

That's pretty much why I envisioned a format "grouping" the arrays for a
particular type of statistics (MCV, histogram) into the same object, as
for example in

 {
   "mcv" : {"values" : [...], "frequencies" : [...]}
   "histogram" : {"bounds" : [...]}
 }

But that's probably much harder to generate from plain SQL (at least I
think so, I haven't tried).


data missing in the export
--

I think the data needs to include more information. Maybe not for the
pg_upgrade use case, where it's mostly guaranteed not to change, but for
the "manual tweak" use case it can change. And I don't think we want two
different formats - we want one, working for everything.

Consider for example about the staopN and stacollN fields - if we clone
the stats from one table to the other, and the table uses different
collations, will that still work? Similarly, I think we should include
the type of each column, because it's absolutely not guaranteed the
import function will fail if the type changes. For example, if the type
changes from integer to text, that will work, but the ordering will
absolutely not be the same. And so on.

For the extended statistics export, I think we need to include also the
attribute names and expressions, because these can be different between
the statistics. And not only that - the statistics values reference the
attributes by positions, but if the two tables have the attributes in a
different order (when ordered by attnum), that will break stuff.


more strict checks
--

I think the code should be a bit more "defensive" when importing stuff,
and do at least some sanity checks. For the pg_upgrade use case this
should be mostly non-issue (except for maybe helping to d

Re: Track in pg_replication_slots the reason why slots conflict?

2023-12-25 Thread Amit Kapila
On Thu, Dec 21, 2023 at 8:21 PM Bertrand Drouvot
 wrote:
>
> On Thu, Dec 21, 2023 at 07:55:51PM +0530, Amit Kapila wrote:
> > On Thu, Dec 21, 2023 at 5:05 PM Andres Freund  wrote:
> > > I'm not entirely sure I understand the difference - just whether we add 
> > > one
> > > new column or replace the existing 'conflicting' column? I can see 
> > > arguments
> > > for either.
> > >
> >
> > Agreed. I think the argument against replacing the existing
> > 'conflicting' column is that there is a chance that it is being used
> > by some monitoring script which I guess shouldn't be a big deal to
> > change. So, if we don't see that as a problem, I would prefer to have
> > a single column with conflict reason where one of its values indicates
> > there is no conflict.
>
> +1
>

Does anyone else have a preference on whether to change the existing
column or add a new one?

-- 
With Regards,
Amit Kapila.




Re: Avoid computing ORDER BY junk columns unnecessarily

2023-12-25 Thread Richard Guo
On Fri, Dec 22, 2023 at 2:38 AM Heikki Linnakangas  wrote:

> v1-0004-Omit-columns-from-final-tlist-that-were-only-need.patch
>
> The main patch in this series.


This patch filters out the junk columns created for ORDER BY/GROUP BY,
and retains the junk columns created for RowLocks.  I'm afraid this may
have a problem about out-of-order resnos.  For instance,

create table mytable (foo text, bar text);

# explain select foo from mytable order by bar for update of mytable;
server closed the connection unexpectedly

This query triggers another Assert in apply_tlist_labeling:

Assert(dest_tle->resno == src_tle->resno);

At first there are three TargetEntry items: foo, bar and ctid, with
resnos being 1, 2 and 3.  And then the second item 'bar' is removed,
leaving only two items: foo and ctid, with resnos being 1 and 3.  So now
we have a missing resno, and finally hit the Assert.

Thanks
Richard


Re: Optimization outcome depends on the index order

2023-12-25 Thread Andrei Lepikhov

On 25/12/2023 18:36, Alexander Korotkov wrote:

On Fri, Dec 22, 2023 at 7:24 PM Andrei Lepikhov
 wrote:

On 22/12/2023 11:48, Alexander Korotkov wrote:

  > Because we must trust all predictions made by the planner, we just
  > choose the most trustworthy path. According to the planner logic, it is
  > a path with a smaller selectivity. We can make mistakes anyway just
  > because of the nature of estimation.

Even if we need to take selectivity into account here, it's still not
clear why this should be on top of other logic later in add_path().

I got your point now, thanks for pointing it out. In the next version of
the patch selectivity is used as a criteria only in the case of COSTS_EQUAL.


It looks better now.  But it's hard for me to judge these heuristics
in add_path().  Tom, what do you think about this?

Just food for thought:
Another approach I have considered was to initialize the relation index 
list according to some consistent rule: place unique indexes at the head 
of the list, arrange indexes according to the number of columns involved 
and sort in some lexical order.

But IMO, the implemented approach looks better.

--
regards,
Andrei Lepikhov
Postgres Professional





Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-25 Thread Masahiko Sawada
On Thu, Dec 21, 2023 at 4:41 PM John Naylor  wrote:
>
> On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada  wrote:
> >
> > I found the following comment and wanted to discuss:
> >
> > // this might be better as "iterate over nodes", plus a callback to
> > RT_DUMP_NODE,
> > // which should really only concern itself with single nodes
> > RT_SCOPE void
> > RT_DUMP(RT_RADIX_TREE *tree)
> >
> > If it means we need to somehow use the iteration functions also for
> > dumping the whole tree, it would probably need to refactor the
> > iteration codes so that the RT_DUMP() can use them while dumping
> > visited nodes. But we need to be careful of not adding overheads to
> > the iteration performance.
>
> Yeah, some months ago I thought a callback interface would make some
> things easier. I don't think we need that at the moment (possibly
> never), so that comment can be just removed. As far as these debug
> functions, I only found useful the stats and dumping a single node,
> FWIW.
>
> I've attached v47, which is v46 plus some fixes for radix tree.
>
> 0004 - moves everything for "delete" to the end -- gradually other
> things will be grouped together in a sensible order
>
> 0005 - trivial

LGTM.

>
> 0006 - shrink nodes -- still needs testing, but nothing crashes yet.

Cool. The coverage test results showed the shrink codes are also covered.

> This shows some renaming might be good: Previously we had
> RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've
> added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply
> not copying the slot to be deleted. This means when growing it would
> be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that
> reserves a new slot for the caller in the new node, but the caller
> must do the insert itself.

Agreed.

> Note that there are some practical
> restrictions/best-practices on whether shrinking should happen after
> deletion or vice versa. Hopefully it's clear, but let me know if the
> description can be improved. Also, it doesn't yet shrink from size
> class 32 to 16, but it could with a bit of work.

Sounds reasonable.

>
> 0007 - trivial, but could use a better comment. I also need to make
> sure stats reporting works (may also need some cleanup work).
>
> 0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago
> if DSA could just free all our allocated segments without throwing
> away the DSA, and that's still a good question.

LGTM.

>
> 0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think
> this name is better than RT_UPDATE_ITER_STACK, so maybe we should go
> back to that).

Will rename it.

> The assert doesn't fire, so I guess it does what it's
> supposed to?

Yes.

> For me, the iteration logic is still the most confusing
> piece out of the whole radix tree. Maybe that could be helped with
> some better variable names, but I wonder if it needs more invasive
> work.

True. Maybe more comments would also help.

>
> 0010 - some fixes for number of children accounting in node256
>
> 0011 - Long overdue pgindent of radixtree.h, without trying to fix up
> afterwards. Feel free to throw out and redo if this interferes with
> ongoing work.
>

LGTM.

I'm working on the below review comments and most of them are already
incorporated on the local branch:

> The rest are from your v46. The bench doesn't work for tid store
> anymore, so I squashed "disable bench for CI" until we get back to
> that. Some more review comments (note: patch numbers are for v47, but
> I changed nothing from v46 in this area):
>
> 0013:
>
> + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
> + * and stored in the radix tree.
>
> Recently outdated. The variable length values seems to work, so let's
> make everything match.
>
> +#define MAX_TUPLES_PER_PAGE  MaxOffsetNumber
>
> Maybe we don't need this macro anymore? The name no longer fits, in any case.

Removed.

>
> +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber 
> *offsets,
> + int num_offsets)
> +{
> + char buf[MaxBlocktableEntrySize];
> + BlocktableEntry *page = (BlocktableEntry *) buf;
>
> I'm not sure this is safe with alignment. Maybe rather than plain
> "char", it needs to be a union with BlocktableEntry, or something.

I tried it in the new patch set but could you explain why it could not
be safe with alignment?

>
> +static inline BlocktableEntry *
> +tidstore_iter_kv(TidStoreIter *iter, uint64 *key)
> +{
> + if (TidStoreIsShared(iter->ts))
> + return shared_rt_iterate_next(iter->tree_iter.shared, key);
> +
> + return local_rt_iterate_next(iter->tree_iter.local, key);
> +}
>
> In the old encoding scheme, this function did something important, but
> now it's a useless wrapper with one caller.

Removed.

>
> + /*
> + * In the shared case, TidStoreControl and radix_tree are backed by the
> + * same DSA area and rt_memory_usage() returns the value including both.
> + * So we don't need to add the size of TidStoreControl separ