date:20231120

Re: Remove MSVC scripts from the tree

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 08:00:09AM +0100, Peter Eisentraut wrote:
> I had suggested this arrangement as a way to reduce churn in this patch set.
> We'd just move over the existing separate chapter into a new section, and
> then later consider further rearrangements.
>
> It's not always clear where all of these things should go, as there are so
> many dimensions.  For example, the existing sentence "After you have
> everything installed, it is suggested that you run psql under CMD.EXE, as
> the MSYS console has buffering issues.", does that apply to MinGW, or really
> MSYS, or does it also apply if you build with Visual Something?

Even for this specific one, are you sure that it still applies?  :D

> Ultimately, I don't think MinGW needs to be its own section.

Yes, agreed.  The end result should be one single sect2 for Windows
divided into multiple sect3, perhaps themselves divided into more
sect4 for each build method.  As a whole, before refactoring all that,
I'd be in favor of a slightly different strategy once the MSVC scripts
and install-windows.sgml with its stuff specific to src/tools/msvc are
gone:
- Review all this section from the docs and trim them from everything
that we think is now irrelevant.
- Look at the rest and see how it can be efficiently refactored into
balanced sections.

Your suggestion to create a new sect2 for "Windows" as much as Andres'
suggestion are OK by as an intermediate step, and I suspect that the
end result will likely not be that.
--
Michael


signature.asc
Description: PGP signature

Re: Simplify if/else logic of walsender CreateReplicationSlot

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 06:01:42PM +1100, Peter Smith wrote:
> While reviewing another patch I was looking at the walsender's static
> function CreateReplicationSlot
> 
> I found that the current logic seemed to have some unnecessary if/else
> checking which can be simplified.

Good idea.  What you are suggesting here improves the readability of
this code, so +1.
--
Michael


signature.asc
Description: PGP signature

Inquiry on Generating Bitmaps from Filter Conditions in Index Scans

2023-11-20 Thread Jinjing Zhou

Hi hackers, 

I hope this message finds you well. I am reaching out to seek guidance on a 
specific aspect of PostgreSQL's index scanning functionality.

I am currently working on a vector search extension for postgres, where I need 
to generate bitmaps based on filter conditions during an index scan. The goal 
is to optimize the query performance by efficiently identifying the rows that 
meet the given criteria.

The query plan looks like this
> Index Scan using products_feature_idx on products  (cost=0.00..27.24 rows=495 
> width=12)
>          Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector)
>          Filter: ((price > '0.2'::double precision) AND (price <= 
> '0.7'::double precision))

We have a custom index for the order by clause on the feature column. Now we 
want to utilize the index on other columns like price column. We want to access 
the bitmap of price column's filter condition in the feature column index. Is 
there any way I can achieve this goal?

Any help or guidance is appreciated!

Thanks.
Jinjing Zhou

Re: Use of backup_label not noted in log

2023-11-20 Thread Michael Paquier

On Sat, Nov 18, 2023 at 01:49:15PM -0800, Andres Freund wrote:
> Note that the LSN in the "continuing" case is the one the backup started at,
> not where recovery will start.
> 
> I've wondered whether it's worth also adding an explicit message just after
> ReachedEndOfBackup(), but it seems far less urgent due to the existing
> "consistent recovery state reached at %X/%X" message.

Upgrading the surrounding DEBUG1 to a LOG is another option, but I
agree that I've seen less that as being an actual problem in the field
compared to the famous I-removed-a-backup-label-and-I-m-still-up,
until this user sees signs of corruption after recovery was finished,
sometimes days after putting back an instance online.

> Playing around with this a bit, I'm wondering if we instead should remove that
> message, and emit something more informative earlier on. If there's a problem,
> you kinda want the information before we finally get to the loop in
> PerformWalLRecovery(). If e.g. there's no WAL you'll only get
> LOG:  invalid checkpoint record
> PANIC:  could not locate a valid checkpoint record

I was looking at this code a few weeks ago and have on my stack of
list to do an item about sending a patch to make this exact message
PANIC more talkative as there are a lot of instances with
log_min_messages > log.

> which is delightfully lacking in details.

With a user panicking as much as the server itself, that's even more
tasty.

+   if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
+   ereport(LOG,
+   (errmsg("continuing to start from base backup with redo LSN 
%X/%X",
+   LSN_FORMAT_ARGS(ControlFile->backupStartPoint;

"Continuing to start" sounds a bit weird to me, though, considering
that there are a few LOGs that say "starting" when there is a signal
file, but I don't have a better idea on top of my mind.  So that
sounds OK here.
--
Michael

signature.asc
Description: PGP signature

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-20 Thread Dilip Kumar

On Fri, Nov 17, 2023 at 7:28 PM Alvaro Herrera  wrote:
>
> On 2023-Nov-17, Dilip Kumar wrote:

I think I need some more clarification for some of the review comments

> > On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera  
> > wrote:
> > >
> > > I just noticed that 0003 does some changes to
> > > TransactionGroupUpdateXidStatus() that haven't been adequately
> > > explained AFAICS.  How do you know that these changes are safe?
> >
> > IMHO this is safe as well as logical to do w.r.t. performance.  It's
> > safe because whenever we are updating any page in a group we are
> > acquiring the respective bank lock in exclusive mode and in extreme
> > cases if there are pages from different banks then we do switch the
> > lock as well before updating the pages from different groups.
>
> Looking at the coverage for this code,
> https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> it seems in our test suites we never hit the case where there is
> anything in the "nextidx" field for commit groups.

1)
I was looking into your coverage report and I have attached a
screenshot from the same, it seems we do hit the block where nextidx
is not INVALID_PGPROCNO, which means there is some other process other
than the group leader.  Although I have already started exploring the
injection point but just wanted to be sure what is your main concern
point about the coverage so though of checking that first.

470 : /*
 471 :  * If the list was not empty, the leader
will update the status of our
 472 :  * XID. It is impossible to have followers
without a leader because the
 473 :  * first process that has added itself to
the list will always have
 474 :  * nextidx as INVALID_PGPROCNO.
 475 :  */
 476  98 : if (nextidx != INVALID_PGPROCNO)
 477 : {
 478   2 : int extraWaits = 0;
 479 :
 480 : /* Sleep until the leader updates our
XID status. */
 481   2 :
pgstat_report_wait_start(WAIT_EVENT_XACT_GROUP_UPDATE);
 482 : for (;;)
 483 : {
 484 : /* acts as a read barrier */
 485   2 : PGSemaphoreLock(proc->sem);
 486   2 : if (!proc->clogGroupMember)
 487   2 : break;
 488   0 : extraWaits++;
 489 : }

2) Do we really need one separate lwlock tranche for each SLRU?

IMHO if we use the same lwlock tranche then the wait event will show
the same wait event name, right? And that would be confusing for the
user, whether we are waiting for Subtransaction or Multixact or
anything else.  Is my understanding no correct here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC, WIP: OR-clause support for indexes

2023-11-20 Thread Andrei Lepikhov


On 10/11/2023 16:20, Alena Rybakina wrote:

I added log like that: ERROR:  unrecognized node type: 0.

I fixed this issue and added some cosmetic refactoring.
The changes are presented in the or_patch_changes.diff file.


Looking into the patch, I found some trivial improvements (see attachment).
Also, it is not obvious that using a string representation of the clause 
as a hash table key is needed here. Also, by making a copy of the node 
in the get_key_nconst_node(), you replace the location field, but in the 
case of complex expression, you don't do the same with other nodes.
I propose to generate expression hash instead + prove the equality of 
two expressions by calling equal().


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 25a4235dbd..de27d2646c 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -178,6 +178,10 @@ transformBoolExprOr(ParseState *pstate, BoolExpr 
*expr_orig)
HTAB   *or_group_htab = NULL;
int len_ors = list_length(expr_orig->args);
 
+   /* If this is not an 'OR' expression, skip the transformation */
+   if (!or_transform_limit || expr_orig->boolop != OR_EXPR || len_ors < 2)
+   return transformBoolExpr(pstate, (BoolExpr *) expr_orig);
+
MemSet(&info, 0, sizeof(info));
info.keysize = sizeof(char *);
info.entrysize = sizeof(OrClauseGroupEntry);
@@ -188,10 +192,6 @@ transformBoolExprOr(ParseState *pstate, BoolExpr 
*expr_orig)
  &info,
  
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
-   /* If this is not an 'OR' expression, skip the transformation */
-   if (expr_orig->boolop != OR_EXPR || !or_transform_limit || len_ors == 1 
|| !or_group_htab)
-   return transformBoolExpr(pstate, (BoolExpr *) expr_orig);
-
foreach(lc, expr_orig->args)
{
Node   *arg = lfirst(lc);

Re: Question about the Implementation of vector32_is_highbit_set on ARM

2023-11-20 Thread John Naylor

On Wed, Nov 8, 2023 at 2:44 PM Xiang Gao  wrote:
>  * function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e.
>  * check each 32-bit element, but that would require an additional mask
>  * operation on x86.
>  */

> But I still don't understand why the vmaxvq_u32 intrinsic  is not used on the 
> arm platform.

The current use case expects all 1's or all 0's in a 32-bit lane. If
anyone tried using it for arbitrary values, vmaxvq_u32 could give a
different answer than on x86 using _mm_movemask_epi8, so I think
that's the origin of that comment. But it's still a maintenance hazard
as is, since x86 wouldn't work for arbitrary values. It seems the path
forward is to rename this function to vector32_is_any_lane_set(), as
in the attached (untested on Arm). That would allow each
implementation to use the most efficient path, whether it's by 8- or
32-bit lanes. If we someday needed to look at only the high bits, we
would need a new function that performed the necessary masking on x86.

It's possible this method could shave cycles on Arm in some 8-bit lane
cases where we don't actually care about the high bit specifically,
since the movemask equivalent is slow on that platform, but I haven't
looked yet.
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 59aa8245ed..f536905d4d 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -151,7 +151,7 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 		result = vector32_or(tmp1, tmp2);
 
 		/* see if there was a match */
-		if (vector32_is_highbit_set(result))
+		if (vector32_is_any_lane_set(result))
 		{
 			Assert(assert_result == true);
 			return true;
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..40558bbca8 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -78,7 +78,7 @@ static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
-static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline bool vector32_is_any_lane_set(const Vector32 v);
 #endif
 
 /* arithmetic operations */
@@ -278,22 +278,21 @@ vector8_is_highbit_set(const Vector8 v)
 }
 
 /*
- * Exactly like vector8_is_highbit_set except for the input type, so it
- * looks at each byte separately.
+ * Return true if any 32-bit lane is set, otherwise false.
  *
- * XXX x86 uses the same underlying type for 8-bit, 16-bit, and 32-bit
- * integer elements, but Arm does not, hence the need for a separate
- * function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e.
- * check each 32-bit element, but that would require an additional mask
- * operation on x86.
+ * XXX: We assume all lanes are either all zeros or all ones.
  */
 #ifndef USE_NO_SIMD
 static inline bool
-vector32_is_highbit_set(const Vector32 v)
+vector32_is_any_lane_set(const Vector32 v)
 {
 #if defined(USE_NEON)
-	return vector8_is_highbit_set((Vector8) v);
+	return vmaxvq_u32(v) > 0;
 #else
+	/*
+	 * There is no 32-bit version of _mm_movemask_epi8, but we can still use
+	 * the 8-bit version.
+	 */
 	return vector8_is_highbit_set(v);
 #endif
 }

Stop the search once replication origin is found

2023-11-20 Thread Antonin Houska

Although it's not performance-critical, I think it just makes sense to break
the loop in replorigin_session_setup() as soon as we've found the origin.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b0255ffd25..460e3dcc38 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -1144,6 +1144,7 @@ replorigin_session_setup(RepOriginId node, int acquired_by)
 
 		/* ok, found slot */
 		session_replication_state = curstate;
+		break;
 	}

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-20 Thread Andrey M. Borodin




> On 20 Nov 2023, at 13:51, Dilip Kumar  wrote:
> 
> 2) Do we really need one separate lwlock tranche for each SLRU?
> 
> IMHO if we use the same lwlock tranche then the wait event will show
> the same wait event name, right? And that would be confusing for the
> user, whether we are waiting for Subtransaction or Multixact or
> anything else.  Is my understanding no correct here?

If we give to a user multiple GUCs to tweak, I think we should give a way to 
understand which GUC to tweak when they observe wait times.

Best regards, Andrey Borodin.

Re: POC, WIP: OR-clause support for indexes

2023-11-20 Thread a.rybakina


On 20.11.2023 11:52, Andrei Lepikhov wrote:

On 10/11/2023 16:20, Alena Rybakina wrote:

I added log like that: ERROR: unrecognized node type: 0.

I fixed this issue and added some cosmetic refactoring.
The changes are presented in the or_patch_changes.diff file.


Looking into the patch, I found some trivial improvements (see 
attachment).
Also, it is not obvious that using a string representation of the 
clause as a hash table key is needed here. Also, by making a copy of 
the node in the get_key_nconst_node(), you replace the location field, 
but in the case of complex expression, you don't do the same with 
other nodes.
I propose to generate expression hash instead + prove the equality of 
two expressions by calling equal().



Thank you! I agree with your changes.

Re: PL/pgSQL: Incomplete item Allow handling of %TYPE arrays, e.g. tab.col%TYPE[]

2023-11-20 Thread Pavel Stehule

Hi

út 17. 10. 2023 v 3:20 odesílatel Quan Zongliang 
napsal:

>
> Attached new patch
>More explicit error messages based on type.
>
>
> On 2023/10/16 18:15, Quan Zongliang wrote:
> >
> >
> > Implement TODO item:
> > PL/pgSQL
> > Incomplete item Allow handling of %TYPE arrays, e.g. tab.col%TYPE[]
> >
> > As a first step, deal only with [], such as
> > xxx.yyy%TYPE[]
> > xxx%TYPE[]
> >
> > It can be extended to support multi-dimensional and complex syntax in
> > the future.
> >
>

I did some deeper check:

- I don't like too much parser's modification (I am sending alternative own
implementation) - the SQL parser allows richer syntax, and for full
functionality is only few lines more

- original patch doesn't solve %ROWTYPE

(2023-11-20 10:04:36) postgres=# select * from foo;
┌┬┐
│ a  │ b  │
╞╪╡
│ 10 │ 20 │
│ 30 │ 40 │
└┴┘
(2 rows)

(2023-11-20 10:08:29) postgres=# do $$
declare v foo%rowtype[];
begin
  v := array(select row(a,b) from foo);
  raise notice '%', v;
end;
$$;
NOTICE:  {"(10,20)","(30,40)"}
DO

- original patch doesn't solve type RECORD
the error message should be more intuitive, although the arrays of record
type can be supported, but it probably needs bigger research.

(2023-11-20 10:10:34) postgres=# do $$
declare r record; v r%type[];
begin
  v := array(select row(a,b) from foo);
  raise notice '%', v;
end;
$$;
ERROR:  syntax error at or near "%"
LINE 2: declare r record; v r%type[];
 ^
CONTEXT:  invalid type name "r%type[]"

- missing documentation

- I don't like using the word "partitioned" in the regress test name
"partitioned_table". It is confusing

Regards

Pavel



> >
> > --
> > Quan Zongliang
commit 3418f61191fe4b377717451b929cefb8028d3718
Author: ok...@github.com 
Date:   Mon Nov 20 08:49:41 2023 +0100

poc

diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index a341cde2c1..83e91d82d5 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -2095,6 +2095,29 @@ plpgsql_build_datatype(Oid typeOid, int32 typmod,
 	return typ;
 }
 
+/*
+ * Returns an array for type specified as argument.
+ */
+PLpgSQL_type *
+plpgsql_datatype_arrayof(PLpgSQL_type *dtype)
+{
+	Oid			array_typeid;
+
+	if (dtype->typisarray)
+		return dtype;
+
+	array_typeid = get_array_type(dtype->typoid);
+
+	if (!OidIsValid(dtype->typoid))
+		ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("could not find array type for data type %s",
+		format_type_be(dtype->typoid;
+
+	return plpgsql_build_datatype(array_typeid, dtype->atttypmod,
+  dtype->collation, NULL);
+}
+
 /*
  * Utility subroutine to make a PLpgSQL_type struct given a pg_type entry
  * and additional details (see comments for plpgsql_build_datatype).
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 6a09bfdd67..4784e8fd5c 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -2789,7 +2789,7 @@ read_datatype(int tok)
 	StringInfoData ds;
 	char	   *type_name;
 	int			startlocation;
-	PLpgSQL_type *result;
+	PLpgSQL_type *result = NULL;
 	int			parenlevel = 0;
 
 	/* Should only be called while parsing DECLARE sections */
@@ -2817,15 +2817,11 @@ read_datatype(int tok)
 			   K_TYPE, "type"))
 			{
 result = plpgsql_parse_wordtype(dtname);
-if (result)
-	return result;
 			}
 			else if (tok_is_keyword(tok, &yylval,
 	K_ROWTYPE, "rowtype"))
 			{
 result = plpgsql_parse_wordrowtype(dtname);
-if (result)
-	return result;
 			}
 		}
 	}
@@ -2841,15 +2837,11 @@ read_datatype(int tok)
 			   K_TYPE, "type"))
 			{
 result = plpgsql_parse_wordtype(dtname);
-if (result)
-	return result;
 			}
 			else if (tok_is_keyword(tok, &yylval,
 	K_ROWTYPE, "rowtype"))
 			{
 result = plpgsql_parse_wordrowtype(dtname);
-if (result)
-	return result;
 			}
 		}
 	}
@@ -2865,19 +2857,58 @@ read_datatype(int tok)
 			   K_TYPE, "type"))
 			{
 result = plpgsql_parse_cwordtype(dtnames);
-if (result)
-	return result;
 			}
 			else if (tok_is_keyword(tok, &yylval,
 	K_ROWTYPE, "rowtype"))
 			{
 result = plpgsql_parse_cwordrowtype(dtnames);
-if (result)
-	return result;
 			}
 		}
 	}
 
+	/* Array declaration can follow, but we check it only for known type */
+	if (result)
+	{
+		bool		be_array = false;
+
+		tok = yylex();
+
+		/*
+		 * SQL syntax allows multiple [] [ iconst ], ARRAY, ARRAY [ ]
+		 * or ARRAY [ icons ]. Should we support all? It is not too hard.
+		 */
+		if (tok_is_keyword(tok, &yylval,
+		   K_ARRAY, "array"))
+		{
+			be_array = true;
+			tok = yylex();
+		}
+
+		if (tok == '[')
+		{
+			be_array = true;
+
+			while (tok == '[')
+			{
+tok = yylex();
+if (tok == ICONST)
+	tok = yylex();
+
+if (tok != ']')
+	yyerror("syntax error, expected \"]\"");
+
+tok = yylex();
+			}
+		}
+
+		plpgsql_push_back_token(tok);
+
+		if (be_array)
+			re

Re: Why is hot_standby_feedback off by default?

2023-11-20 Thread John Naylor

On Tue, Oct 24, 2023 at 3:42 AM Nathan Bossart  wrote:
>
> On Sun, Oct 22, 2023 at 12:07:59PM -0700, Andres Freund wrote:
> > Medium term, I think we need an approximate xid->"time of assignment" 
> > mapping that's continually maintained on the primary. One of the things 
> > that'd show us to do is introduce a GUC to control the maximum effect of 
> > hs_feedback  on the primary, in a useful unit. Numbers of xids are not a 
> > useful unit (100k xids is forever on some systems, a few minutes at best on 
> > others, the rate is not necessarily that steady when plpgsql exception 
> > handles are used, ...)
> >
> > It'd be useful to have such a mapping for other features too. E.g.
> >
> >  - making it visible in pg_stat _activity how problematic a longrunning 
> > xact is - a 3 day old xact that doesn't have an xid assigned and has a 
> > recent xmin is fine, it won't prevent vacuum from doing things. But a 
> > somewhat recent xact that still has a snapshot from before an old xact was 
> > cancelled could be problematic.
> >
> > - turn pg_class.relfrozenxid into an understandable timeframe. It's a fair 
> > bit of mental effort to classify "370M xids old" into problem/fine (it's 
> > e.g. not a problem on a system with a high xid rate, on a big table that 
> > takes a bit to a bit to vacuum).
> >
> > - using the mapping to compute an xid consumption rate IMO would be one 
> > building block for smarter AV scheduling. Together with historical vacuum 
> > runtimes it'd allow us to start vacuuming early enough to prevent hitting 
> > thresholds, adapt pacing, prioritize between tables etc.
>
> Big +1 to all of this.

Sounds like a TODO?

Re: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'

2023-11-20 Thread Ashutosh Bapat

On Sun, Nov 19, 2023 at 6:48 AM Alexander Korotkov  wrote:
>
> On Wed, Nov 15, 2023 at 5:07 PM Alexander Korotkov  
> wrote:
> >
> > On Wed, Nov 15, 2023 at 8:02 AM Andres Freund  wrote:
> > > The kinda because there are callers to bms_(add|del)_members() that pass 
> > > the
> > > same bms as a and b, which only works if the reallocation happens "late".
> >
> > +1,
> > Neat idea. I'm willing to work on this. Will propose the patch soon.
>
>
> It's here.  New REALLOCATE_BITMAPSETS forces bitmapset reallocation on
> each modification.  I also find it useful to add assert to all
> bitmapset functions on argument NodeTag.  This allows you to find
> access to hanging pointers earlier.

Creating separate patches for REALLOCATE_BITMAPSETs and
Assert(ISA(Bitmapset)) will be easier to review. We will be able to
check whether all the places that require either of the fixes have
been indeed fixed and correctly. I kept switching back and forth.

>
> I had the feeling of falling into a rabbit hole while debugging all
> the cases of failure with this new option.  With the second patch
> regressions tests pass.

I think this will increase memory consumption when planning queries
with partitioned tables (100s or 1000s of partitions). Have you tried
measuring the impact?

 We should take hit on memory consumption when there is correctness
involved but not all these cases look correctness problems. For
example. RelOptInfo::left_relids or SpecialJoinInfo::syn_lefthand may
not get modified after they are set. But just because
RelOptInfo::relids of a lower relation was assigned somewhere which
got modified, these two get modified. bms_copy() in
make_specialjoininfo may not be necessary. I haven't tried that myself
so I may be wrong.

What might be useful is to mark a bitmap as "final" once it's know
that it can not change. e.g. RelOptInfo->relids once set never
changes. Each operation that modifies a Bitmapset throws an
error/Asserts if it's marked as "final", thus catching the places
where we expect a Bitmapset being modified when not intended. This
will catch shared bitmapsets as well. We could apply bms_copy in only
those cases then.

-- 
Best Wishes,
Ashutosh Bapat

Re: Synchronizing slots from primary to standby

2023-11-20 Thread Drouvot, Bertrand


Hi,

On 11/18/23 11:45 AM, Amit Kapila wrote:

On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand
 wrote:


On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote:

On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand 
 wrote:

I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown
slotsync worker and drop slots. There could be other reasons(other than
promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code
there. I thought if the intention is to stop slotsync workers on promotion,
maybe FinishWalRecovery() is a better place to do it as it's indicating the end
of recovery and XLogShutdownWalRcv is also called in it.


I can see that slotsync_drop_initiated_slots() has been moved in 
FinishWalRecovery()
in v35. That looks ok.




I was thinking what if we just ignore creating such slots (which
require init state) in the first place? I think that can be
time-consuming in some cases but it will reduce the complexity and we
can always improve such cases later if we really encounter them in the
real world. I am not very sure that added complexity is worth
addressing this particular case, so I would like to know your and
others' opinions.



I'm not sure I understand your point. Are you saying that we should not create
slots on the standby that are "currently" reported in a 'i' state? (so just keep
the 'r' and 'n' states?)

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Perhaps a possible new feature to a future PostgreSQL release

2023-11-20 Thread Erki Eessaar

Hello

Let me assume that there is a table T with columns a, b, c, d, e, f, g, and h.

If one wants to select data from all the columns except d and e, then one has 
to write

SELECT a, b, c, f, g, h
FROM T;

instead of writing

SELECT ALL BUT (d, e)
FROM T;

or something similar (perhaps by using keywords EXCEPT or EXCLUDE).

The more a table has columns, the more one has to write the column names.

There are systems that support this kind of shorthand syntax in SQL:

BigQuery: 
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select-modifiers

Databricks: 
https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select.html#syntax

DuckDB: https://duckdb.org/docs/sql/query_syntax/select

Snowflake:https://stephenallwright.com/select-columns-except-snowflake/

I think that such syntax would be useful and if more and more DBMS-s start to 
offer it, then perhaps one day it will be in the SQL standard as well.

What do you think, is it something that could be added to PostgreSQL?

People are interested of this feature. The following links are just some 
examples:
http://www.postgresonline.com/journal/archives/41-How-to-SELECT-ALL-EXCEPT-some-columns-in-a-table.html

https://stackoverflow.com/questions/729197/exclude-a-column-using-select-except-columna-from-tablea

https://dba.stackexchange.com/questions/1957/sql-select-all-columns-except-some

https://www.reddit.com/r/SQL/comments/15x97kw/sql_is_there_a_way_to_just_exclude_1_column_in/


Best regards
Erki Eessaar

Re: Perhaps a possible new feature to a future PostgreSQL release

2023-11-20 Thread Laurenz Albe

On Mon, 2023-11-20 at 09:52 +, Erki Eessaar wrote:
> Let me assume that there is a table T with columns a, b, c, d, e, f, g, and h.
> 
> If one wants to select data from all the columns except d and e, then one has 
> to write
> 
> SELECT a, b, c, f, g, h
> FROM T;
> 
> instead of writing 
> 
> SELECT ALL BUT (d, e)
> FROM T;
> 
> or something similar (perhaps by using keywords EXCEPT or EXCLUDE).

This has been discussed before (repeatedly); see for example
https://www.postgresql.org/message-id/flat/CANcm6wbR3EG7t-G%3DTxy64Yt8nR6YbpzFRuTewJQ%2BkCq%3DrZ8M2A%40mail.gmail.com

All previous attempts went nowhere.


> I think that such syntax would be useful and if more and more DBMS-s start to
> offer it, then perhaps one day it will be in the SQL standard as well.

One of the reasons *against* the feature is that the SQL standard committee
might one day come up with a feature like that using a syntax that conflicts
with whatever we introduced on our own.

Yours,
Laurenz Albe

Re: should check collations when creating partitioned index

2023-11-20 Thread Peter Eisentraut


On 14.11.23 17:15, Tom Lane wrote:

I don't love the patch details though.  It seems entirely wrong to check
this before we check the opclass match.


Not sure why?  The order doesn't seem to matter?


Also, in at least some cases
the code presses on looking for another match if the current opclass
doesn't match; you've broken such cases.


I see.  That means we shouldn't raise an error on a mismatch but just do

if (key->partcollation[i] != collationIds[j])
continue;

and then let the existing error under if (!found) complain.

I suppose we could move that into the

if (get_opclass_opfamily_and_input_type(...))

block.  I'm not sure I see the difference.

Re: Use of backup_label not noted in log

2023-11-20 Thread Laurenz Albe

I can accept that adding log messages to back branches is ok.
Perhaps I am too nervous about things like that, because as an extension
developer I have been bitten too often by ABI breaks in minor releases
in the past.

On Mon, 2023-11-20 at 17:30 +0900, Michael Paquier wrote:
> +   if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
> +   ereport(LOG,
> +   (errmsg("continuing to start from base backup with redo 
> LSN %X/%X",
> +   LSN_FORMAT_ARGS(ControlFile->backupStartPoint;
> 
> "Continuing to start" sounds a bit weird to me, though, considering
> that there are a few LOGs that say "starting" when there is a signal
> file, but I don't have a better idea on top of my mind.  So that
> sounds OK here.

We can only reach that message in recovery or standby mode, right?
So why not write "continuing to recover from base backup"?

If we add a message for starting with "backup_label", shouldn't
we also add a corresponding message for starting from a checkpoint
found in the control file?  If you see that in a problem report,
you immediately know what is going on.

Yours,
Laurenz Albe

Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans

2023-11-20 Thread Tomas Vondra




On 11/19/23 18:19, Jinjing Zhou wrote:
> Hi hackers, 
> 
> I hope this message finds you well. I am reaching out to seek guidance
> on a specific aspect of PostgreSQL's index scanning functionality.
> 
> I am currently working on a vector search extension for postgres, where
> I need to generate bitmaps based on filter conditions during an index
> scan. The goal is to optimize the query performance by efficiently
> identifying the rows that meet the given criteria.
> 
> The query plan looks like this
> 
> Index Scan using products_feature_idx on products  (cost=0.00..27.24
> rows=495 width=12)
>          Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector)
>          Filter: ((price > '0.2'::double precision) AND (price <=
> '0.7'::double precision))
> 
> 
> We have a custom index for the order by clause on the feature column.
> Now we want to utilize the index on other columns like price column. We
> want to access the bitmap of price column's filter condition in the
> feature column index. Is there any way I can achieve this goal?
> 
> Any help or guidance is appreciated!
> 

I suppose you'll need to give more details about what exactly are you
trying to achieve, what you tried, maybe some code examples, etc. Your
question is quite vague, and it's unclear what "bitmaps generated on
filter conditions" or "custom index" means.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Synchronizing slots from primary to standby

2023-11-20 Thread Amit Kapila

On Sat, Nov 18, 2023 at 4:15 PM Amit Kapila  wrote:
>
> On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand
>  wrote:
>
> More Review for v35-0002*
> 
>

More review of v35-0002*

1.
+/*
+ * Helper function to check if local_slot is present in remote_slots list.
+ *
+ * It also checks if logical slot is locally invalidated i.e. invalidated on
+ * the standby but valid on the primary server. If found so, it sets
+ * locally_invalidated to true.
+ */
+static bool
+slot_exists_in_list(ReplicationSlot *local_slot, List *remote_slots,
+ bool *locally_invalidated)

The name of the function is a bit misleading because it checks the
validity of the slot not only whether it exists in remote_list. Would
it be better to name it as ValidateSyncSlot() or something along those
lines?

2.
+static long
+synchronize_slots(WalReceiverConn *wrconn)
{
...
+ /* Construct query to get slots info from the primary server */
+ initStringInfo(&s);
+ construct_slot_query(&s);
...
+ if (remote_slot->conflicting)
+ remote_slot->invalidated = get_remote_invalidation_cause(wrconn,
+ remote_slot->name);
...

+static ReplicationSlotInvalidationCause
+get_remote_invalidation_cause(WalReceiverConn *wrconn, char *slot_name)
{
...
+ appendStringInfo(&cmd,
+ "SELECT pg_get_slot_invalidation_cause(%s)",
+ quote_literal_cstr(slot_name));
+ res = walrcv_exec(wrconn, cmd.data, 1, slotRow);

Do we really need to query a second time to get the invalidation
cause? Can we adjust the slot_query to get it in one round trip? I
think this may not optimize much because the patch uses second round
trip only for invalidated slots but still looks odd. So unless the
query becomes too complicated, we should try to achive it one round
trip.

3.
+static long
+synchronize_slots(WalReceiverConn *wrconn)
+{
...
...
+ /* The syscache access needs a transaction env. */
+ StartTransactionCommand();
+
+ /* Make things live outside TX context */
+ MemoryContextSwitchTo(oldctx);
+
+ /* Construct query to get slots info from the primary server */
+ initStringInfo(&s);
+ construct_slot_query(&s);
+
+ elog(DEBUG2, "slot-sync worker's query:%s \n", s.data);
+
+ /* Execute the query */
+ res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow);

It is okay to perform the above query execution outside the
transaction context but I would like to know the reason for the same.
Do we want to retain anything beyond the transaction context or is
there some other reason to do this outside the transaction context?

4.
+static void
+construct_slot_query(StringInfo s)
+{
+ /*
+ * Fetch data for logical failover slots with sync_state either as
+ * SYNCSLOT_STATE_NONE or SYNCSLOT_STATE_READY.
+ */
+ appendStringInfo(s,
+ "SELECT slot_name, plugin, confirmed_flush_lsn,"
+ " restart_lsn, catalog_xmin, two_phase, conflicting, "
+ " database FROM pg_catalog.pg_replication_slots"
+ " WHERE failover and sync_state != 'i'");
+}

Why would the sync_state on the primary server be any valid value? I
thought it was set only on physical standby. I think it is better to
mention the reason for using the sync state and or failover flag in
the above comments. The current comment doesn't seem of much use as it
just states what is evident from the query.

5.
* This check should never pass as on the primary server, we have waited
+ * for the standby's confirmation before updating the logical slot. But to
+ * take care of any bug in that flow, we should retain this check.
+ */
+ if (remote_slot->confirmed_lsn > WalRcv->latestWalEnd)
+ {
+ elog(LOG, "skipping sync of slot \"%s\" as the received slot-sync "
+ "LSN %X/%X is ahead of the standby position %X/%X",
+ remote_slot->name,
+ LSN_FORMAT_ARGS(remote_slot->confirmed_lsn),
+ LSN_FORMAT_ARGS(WalRcv->latestWalEnd));
+

This should be elog(ERROR, ..). Normally, we use elog(ERROR, ...) for
such unexpected cases. And, you don't need to explicitly mention the
last sentence in the comment: "But to take care of any bug in that
flow, we should retain this check.".

6.
+synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
+ bool *slot_updated)
{
...
+ if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn)
+ {
+ ereport(WARNING,
+ errmsg("not synchronizing slot %s; synchronization would"
+" move it backwards", remote_slot->name));

I think here elevel should be LOG because user can't do much about
this. Do we use ';' at other places in the message? But when can we
hit this case? We can add some comments to state in which scenario
this possible. OTOH, if this is sort of can't happen case and we have
kept it to avoid any sort of inconsistency then we can probably use
elog(ERROR, .. with approapriate LSN locations, so that later the
problem could be debugged.

7.
+synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
+ bool *slot_updated)
{
...
+
+ StartTransactionCommand();
+
+ /* Make things live outside TX context */
+ MemoryContextSwitchTo(oldctx);
+
...

Simila

Re: Synchronizing slots from primary to standby

2023-11-20 Thread Amit Kapila

On Mon, Nov 20, 2023 at 3:17 PM Drouvot, Bertrand
 wrote:
>
> On 11/18/23 11:45 AM, Amit Kapila wrote:
> > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand
> >  wrote:
> >>
> >> On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote:
> >>> On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand 
> >>>  wrote:
> >>>
> >>> I feel the WaitForWALToBecomeAvailable may not be the best place to 
> >>> shutdown
> >>> slotsync worker and drop slots. There could be other reasons(other than
> >>> promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the 
> >>> code
> >>> there. I thought if the intention is to stop slotsync workers on 
> >>> promotion,
> >>> maybe FinishWalRecovery() is a better place to do it as it's indicating 
> >>> the end
> >>> of recovery and XLogShutdownWalRcv is also called in it.
> >>
> >> I can see that slotsync_drop_initiated_slots() has been moved in 
> >> FinishWalRecovery()
> >> in v35. That looks ok.
> >>>
> >
> > I was thinking what if we just ignore creating such slots (which
> > require init state) in the first place? I think that can be
> > time-consuming in some cases but it will reduce the complexity and we
> > can always improve such cases later if we really encounter them in the
> > real world. I am not very sure that added complexity is worth
> > addressing this particular case, so I would like to know your and
> > others' opinions.
> >
>
> I'm not sure I understand your point. Are you saying that we should not create
> slots on the standby that are "currently" reported in a 'i' state? (so just 
> keep
> the 'r' and 'n' states?)
>

Yes.

-- 
With Regards,
Amit Kapila.

Re: Stop the search once replication origin is found

2023-11-20 Thread Amit Kapila

On Mon, Nov 20, 2023 at 2:36 PM Antonin Houska  wrote:
>
> Although it's not performance-critical, I think it just makes sense to break
> the loop in replorigin_session_setup() as soon as we've found the origin.
>

Your proposal sounds reasonable to me.

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] make async slave to wait for lsn to be replayed

2023-11-20 Thread Картышов Иван


Alexander, thank you for your review and pointing this issues. According to
them I made some fixes and rebase all patch.

But I can`t repeat your ERROR. Not with hot_standby_feedback = on nor 
hot_standby_feedback = off.master: create table test as (select i from 
generate_series(1,1) i);
slave conn1: select pg_wal_replay_pause();
master: delete from test;
master: vacuum test;
master: select pg_current_wal_lsn();
slave conn2: select pg_wait_lsn('the value from previous query'::pg_lsn, 0);
slave conn1: select pg_wal_replay_resume();
slave conn2: ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be 
removed.Also I use little hack to work out of snapshot similar to 
SnapshotResetXmin.

Patch rebased and ready for review.


 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6156a..edbbe208cb 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1780,6 +1781,15 @@ PerformWalRecovery(void)
 break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+			{
+WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 42cced9ebe..ec6ab7722a 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00..7db208c726
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,284 @@
+/*-
+ *
+ * wait.c
+ *	  Implements wait lsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2023, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(&state->mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+state->lsn[i] < state->min_lsn.value)
+

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-20 Thread Dilip Kumar

On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin  wrote:

> > On 20 Nov 2023, at 13:51, Dilip Kumar  wrote:
> >
> > 2) Do we really need one separate lwlock tranche for each SLRU?
> >
> > IMHO if we use the same lwlock tranche then the wait event will show
> > the same wait event name, right? And that would be confusing for the
> > user, whether we are waiting for Subtransaction or Multixact or
> > anything else.  Is my understanding no correct here?
>
> If we give to a user multiple GUCs to tweak, I think we should give a way to 
> understand which GUC to tweak when they observe wait times.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-11-20 Thread Ashutosh Bapat

On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier  wrote:
>
> Hi all,
>
> I don't remember how many times in the last few years when I've had to
> hack the backend to produce a test case that involves a weird race
> condition across multiple processes running in the backend, to be able
> to prove a point or just test a fix (one recent case: 2b8e5273e949).
> Usually, I come to hardcoding stuff for the following situations:
> - Trigger a PANIC, to force recovery.
> - A FATAL, to take down a session, or just an ERROR.
> - palloc() failure injection.
> - Sleep to slow down a code path.
> - Pause and release with condition variable.
>
> And, while that's helpful to prove a point on a thread, nothing comes
> out of it in terms of regression test coverage in the tree because
> these tests are usually too slow and expensive, as they usually rely
> on hardcoded timeouts.  So that's pretty much attempting to emulate
> what one would do with a debugger in a predictable way, without the
> manual steps because human hands don't scale well.
>
> The reason behind that is of course more advanced testing, to be able
> to expand coverage when we have weird and non-deterministic race
> issues to deal with, and the code becoming more complex every year
> makes that even harder.  Fault and failure injection in specific paths
> comes into mind, additionally, particularly if you manage complex
> projects based on Postgres.
>
> So, please find attached a patch set that introduces an in-core
> facility to be able to set what I'm calling here an "injection point",
> that consists of being able to register in shared memory a callback
> that can be run within a defined location of the code.  It means that
> it is not possible to trigger a callback before shared memory is set,
> but I've faced far more the case where I wanted to trigger something
> after shmem is set anyway.  Persisting an injection point across
> restarts is also possible by adding some through an extension's shmem
> hook, as we do for custom LWLocks for example, as long as a library is
> loaded.
>
> This will remind a bit of what Alexander Korotkov has proposed here:
> https://www.postgresql.org/message-id/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com
> Also, this is much closee to what Craig Ringer is mentioning here,
> where it is named probe points, but I am using a minimal design that
> allows to achieve the same:
> https://www.postgresql.org/message-id/CAPpHfdsn-hzneYNbX4qcY5rnwr-BA1ogOCZ4TQCKQAw9qa48kA%40mail.gmail.com
>
> A difference is that I don't really see a point in passing to the
> callback triggered an area of data coming from the hash table itself,
> as at the end a callback could just refer to an area in shared memory
> or a static set of variables depending on what it wants, with one or
> more injection points (say a location to set a state, and a second to
> check it).  So, at the end, the problem comes down in my opinion to
> two things:
> - Possibility to trigger a condition defined by some custom code, in
> the backend (core code or even out-of-core).
> - Possibility to define a location in the code where a named point
> would be checked.
>
> 0001 introduces three APIs to create, run, and drop injection points:
> +extern void InjectionPointCreate(const char *name,
> +InjectionPointCallback callback);
> +extern void InjectionPointRun(const char *name);
> +extern void InjectionPointDrop(const char *name);
>
> Then one just needs to add a macro like that to trigger the callback
> registered in the code to test:
> INJECTION_POINT_RUN("String");
> So the footprint in the core tree is not zero, but it is as minimal as
> it can be.
>
> I have added some documentation to explain that, as well.  I am not
> wedded to the name proposed in the patch, so if you feel there is
> better, feel free to propose ideas.

Actually with Attach and Detach terminology, INJECTION_POINT becomes
the place where we "declare" the injection point. So the documentation
needs to first explain INJECTION_POINT and then explain the other
operations.

>
> This facility is hidden behind a specific configure/Meson switch,
> making it a no-op by default:
> --enable-injection-points
> -Dinjection_points={ true | false }

That's useful, but we will also see demand to enable those in
production (under controlled circumstances). But let the functionality
mature under a separate flag and developer builds before used in
production.

>
> 0002 is a test module to test these routines, that I have kept a
> maximum simple to ease review of the basics proposed here.  This could
> be extended further to propose more default modes with TAP tests on
> its own, as I don't see a real point in having the SQL bits or some
> common callbacks (like for the PANIC or the FATAL cases) in core.
>
> Thoughts and comments are welcome.

I think this is super useful functionality. Some comments here.

+/*
+ * Local cache of injection callbacks already loaded, stor

Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans

2023-11-20 Thread Matthias van de Meent

On Mon, 20 Nov 2023 at 09:30, Jinjing Zhou  wrote:
>
> Hi hackers,
>
> I hope this message finds you well. I am reaching out to seek guidance on a 
> specific aspect of PostgreSQL's index scanning functionality.
>
> I am currently working on a vector search extension for postgres, where I 
> need to generate bitmaps based on filter conditions during an index scan. The 
> goal is to optimize the query performance by efficiently identifying the rows 
> that meet the given criteria.
>
> The query plan looks like this
>
> Index Scan using products_feature_idx on products  (cost=0.00..27.24 rows=495 
> width=12)
>  Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector)
>  Filter: ((price > '0.2'::double precision) AND (price <= 
> '0.7'::double precision))
>
>
> We have a custom index for the order by clause on the feature column. Now we 
> want to utilize the index on other columns like price column. We want to 
> access the bitmap of price column's filter condition in the feature column 
> index. Is there any way I can achieve this goal?

If you mean "I'd like to use bitmaps generated by combining filter
results from index A, B, and C for (pre-)filtering the ordered index
lookups in index D",
then there is no current infrastructure to do this. Bitmap scans
currently generate a data structure that is not indexable, and can
thus not be used efficiently to push an index's generated bitmap into
another bitmap's scans.

There are efforts to improve the data structures we use for storing
TIDs during vacuum [0] which could extend to the TID bitmap structure,
but even then we'd need some significant effort to rewire Postgres'
internals to push down the bitmap filters; and that is even under the
assumption that pushing the bitmap down into the index AM is more
efficient than doing the merges above the index AM and then re-sorting
the data.

So, in short, it's not currently available in community PostgreSQL.
You could probably create a planner hook + custom executor node that
does this, but it won't be able to use much of the features available
inside PostgreSQL.

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/flat/CANWCAZbrZ58-w1W_3pg-0tOfbx8K41_n_03_0ndGV78hJWswBA%2540mail.gmail.com

Flushing large data immediately in pqcomm

2023-11-20 Thread Melih Mutlu

Hi hackers

I've been looking into ways to reduce the overhead we're having in pqcomm
and I'd like to propose a small patch to modify how socket_putmessage works.

Currently socket_putmessage copies any input data into the pqcomm send
buffer (PqSendBuffer) and the size of this buffer is 8K. When the send
buffer gets full, it's flushed and we continue to copy more data into the
send buffer until we have no data left to be sent.
Since the send buffer is flushed whenever it's full, I think we are safe to
say that if the size of input data is larger than the buffer size, which is
8K, then the buffer will be flushed at least once (or more, depends on the
input size) to store and all the input data.

Proposed change modifies socket_putmessage to send any data larger than
8K immediately without copying it into the send buffer. Assuming that the
send buffer would be flushed anyway due to reaching its limit, the patch
just gets rid of the copy part which seems unnecessary and sends data
without waiting.

This change affects places where pq_putmessage is used such as
pg_basebackup, COPY TO, walsender etc.

I did some experiments to see how the patch performs.
Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO
STDOUT". Here are perf results of both the patch and HEAD

HEAD:
-   94,13% 0,22%  postgres  postgres   [.] DoCopyTo
  - 93,90% DoCopyTo
  - 91,80% CopyOneRowTo
 + 47,35% CopyAttributeOutText
 - 26,49% CopySendEndOfRow
- 25,97% socket_putmessage
   - internal_putbytes
  - 24,38% internal_flush
 + secure_write
  + 1,47% memcpy (inlined)
 + 14,69% FunctionCall1Coll
 + 1,94% appendBinaryStringInfo
 + 0,75% MemoryContextResetOnly
  + 1,54% table_scan_getnextslot (inlined)

Patch:
-   94,40% 0,30%  postgres  postgres   [.] DoCopyTo
   - 94,11% DoCopyTo
  - 92,41% CopyOneRowTo
 + 51,20% CopyAttributeOutText
 - 20,87% CopySendEndOfRow
- 20,45% socket_putmessage
   - internal_putbytes
  - 18,50% internal_flush (inlined)
   internal_flush_buffer
 + secure_write
  + 1,61% memcpy (inlined)
 + 17,36% FunctionCall1Coll
 + 1,33% appendBinaryStringInfo
 + 0,93% MemoryContextResetOnly
  + 1,36% table_scan_getnextslot (inlined)

The patch brings a ~5% gain in socket_putmessage.

Also timed the pg_basebackup like:
time pg_basebackup -p 5432 -U replica_user -X none -c fast --no_maanifest
-D test

HEAD:
real0m10,040s
user0m0,768s
sys 0m7,758s

Patch:
real0m8,882s
user0m0,699s
sys 0m6,980s

It seems ~11% faster in this specific case.

I'd appreciate any feedback/thoughts.


[1]
CREATE TABLE test(id int, name text, time TIMESTAMP);
INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) AS
name, NOW() AS time FROM generate_series(1, 1) AS i;


Thanks,
-- 
Melih Mutlu
Microsoft


0001-Flush-large-data-immediately-in-pqcomm.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2023-11-20 Thread Drouvot, Bertrand





On 11/20/23 11:59 AM, Amit Kapila wrote:

On Mon, Nov 20, 2023 at 3:17 PM Drouvot, Bertrand
 wrote:


On 11/18/23 11:45 AM, Amit Kapila wrote:

On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand
 wrote:


On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote:

On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand 
 wrote:

I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown
slotsync worker and drop slots. There could be other reasons(other than
promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code
there. I thought if the intention is to stop slotsync workers on promotion,
maybe FinishWalRecovery() is a better place to do it as it's indicating the end
of recovery and XLogShutdownWalRcv is also called in it.


I can see that slotsync_drop_initiated_slots() has been moved in 
FinishWalRecovery()
in v35. That looks ok.




I was thinking what if we just ignore creating such slots (which
require init state) in the first place? I think that can be
time-consuming in some cases but it will reduce the complexity and we
can always improve such cases later if we really encounter them in the
real world. I am not very sure that added complexity is worth
addressing this particular case, so I would like to know your and
others' opinions.



I'm not sure I understand your point. Are you saying that we should not create
slots on the standby that are "currently" reported in a 'i' state? (so just keep
the 'r' and 'n' states?)



Yes.



As far the 'i' state here, from what I see, it is currently useful for:

1. Cascading standby to not sync slots with state = 'i' from
the first standby.
2. Easily report Slots that did not catch up on the primary yet.
3. Avoid inactive slots to block "active" ones creation.

So not creating those slots should not be an issue for 1. (sync are
not needed on cascading standby as not created on the first standby yet)
but is an issue for 2. (unless we provide another way to keep track and report
such slots) and 3. (as I think we should still need to reserve WAL).

I've a question: we'd still need to reserve WAL for those slots, no?

If that's the case and if we don't call ReplicationSlotCreate() then 
ReplicationSlotReserveWal()
would not work as  MyReplicationSlot would be NULL.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele


On 11/19/23 21:15, Michael Paquier wrote:

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5.  Now added back in CC with the two latest patches
you've proposed attached.)


Ugh, I must have hit reply instead of reply all. It's a rookie error and 
you hate to see it.



Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB.  If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery.  I've suggested to not do that to keep a trace of what
was happening during recovery.  The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good.  Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups.  The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.


This looks right to me.


On Sun, Nov 19, 2023 at 02:14:32PM -0400, David Steele wrote:

On 11/15/23 20:03, Michael Paquier wrote:

As the label is only an informational field, the parsing added to
pg_verifybackup is not really needed because it is used nowhere in the
validation process, so keeping the logic simpler would be the way to
go IMO.  This is contrary to the WAL range for example, where start
and end LSNs are used for validation with a pg_waldump command.
Robert, any comments about the addition of the label in the manifest?


I'm sure Robert will comment on this when he gets the time, but for now I
have backed off on passing the new info to pg_verifybackup and added
start/stop time.


FWIW, I'm OK with the bits for the backup manifest as presented.  So
if there are no remarks and/or no objections, I'd like to apply it but
let give some room to others to comment on that as there's been a gap
in the emails exchanged on pgsql-hackers.  I hope that the summary
I've posted above covers everything.  So let's see about doing
something around the middle of next week.  With Thanksgiving in the
US, a lot of folks will not have the time to monitor what's happening
on this thread.


Timing sounds good to me.



+  The end time for the backup. This is when the backup was stopped in
+  PostgreSQL and represents the earliest time
+  that can be used for time-based Point-In-Time Recovery.

This one is actually a very good point.  We'd lost this capacity with
the backup_label file gone without the end timestamps in the control
file.


Yeah, the end time is very important for recovery scenarios. We 
definitely need that recorded somewhere.



I've noticed on the other thread the remark about being less
aggressive with the fields related to recovery in the control file, so
I assume that this patch should leave the fields be after the end of
recovery from the start and only rely on backupRecoveryRequired to
decide if the recovery should use the fields or not:
https://www.postgresql.org/message-id/241ccde1-1928-4ba2-a0bb-5350f7b191a8@=pgmasters.net

+   ControlFile->backupCheckPoint = InvalidXLogRecPtr;
ControlFile->backupStartPoint = InvalidXLogRecPtr;
+   ControlFile->backupStartPointTLI = 0;
ControlFile->backupEndPoint = InvalidXLogRecPtr;
+   ControlFile->backupFromStandby = false;
ControlFile->backupEndRequired = false;

Still, I get the temptation of being consistent with the current style
on HEAD to reset everything, as well..


I'd rather reset everything for now (as we do now) and think about 
keeping these values as a separate patch. It may be that we don't want 
to keep all of them, or we need a separate flag to say recovery was 
completed. We are accumulating a lot of booleans here, maybe we need a 
state var (recoveryRequired, recoveryInProgress, recoveryComplete) and 
then define which other vars are valid in each state.


Regards,
-David

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele


[Resending since I accidentally replied off-list]

On 11/18/23 17:49, Andres Freund wrote:

On 2023-11-18 10:01:42 -0800, Andres Freund wrote:

What about adding it to the "redo starts at" message, something like

   redo starts at 12/12345678, taken from control file

or

   redo starts at 12/12345678, taken from backup label


I think it'd make sense to log use of backup_label earlier than that - the
locations from backup_label might end up not being available in the archive,
the primary or locally, and we'll error out with "could not locate a valid
checkpoint record".

I'd probably just do it within the if (read_backup_label()) block in
InitWalRecovery(), *before* the ReadCheckpointRecord().


Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1


I'd prefer something like:

LOG:  starting backup recovery with redo...


When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG:  continuing to start from base backup with redo LSN A/34100028
LOG:  entering standby mode
LOG:  redo starts at A/3954B958


And here:

LOG:  restarting backup recovery with redo...


Note that the LSN in the "continuing" case is the one the backup started at,
not where recovery will start.

I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


I think the current message is sufficient, but what do you have in mind?


We are quite inconsistent about how we spell LSNs. Sometimes with LSN
preceding, sometimes not. Sometimes with (LSN). Etc.


Well, this could be improved in HEAD for sure.


I do like the idea of expanding the "redo starts at" message
though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
ControlFile->backupEndPoint would provide information about when the node
might become consistent.


+1


Playing around with this a bit, I'm wondering if we instead should remove that
message, and emit something more informative earlier on. If there's a problem,
you kinda want the information before we finally get to the loop in
PerformWalLRecovery(). If e.g. there's no WAL you'll only get
LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

which is delightfully lacking in details.


I've been thinking about improving this myself. It would probably also 
help a lot to hint that restore_command may be missing or not returning 
results (but also not erroring). But there are a bunch of ways to get to 
this message so we'd need to be careful.



There also are some other oddities:

If the primary is down when starting up, and we'd need WAL from the primary
for the first record, the "redo start at" message is delayed until that
happens, because we emit the message not before we read the first record, but
after. That's just plain odd.


Agreed. Moving it up would be better.


And sometimes we'll start referencing the LSN at which we are starting
recovery before the "redo starts at" message. If e.g. we shut down
at a restart point, we'll emit

   LOG:  consistent recovery state reached at ...
before
   LOG:  redo starts at ...


Huh, I haven't seen that one. Definitely confusing.


But that's all clearly just material for HEAD.


Absolutely. I've been thinking about some of this as well, but want to 
see if we can remove the backup label first so we don't have to rework a 
bunch of stuff.


Of course, that shouldn't stop you from proceeding. I'm sure anything 
you are thinking of here could be adapted.


Regards,
-David

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele


On 11/20/23 06:35, Laurenz Albe wrote:


If we add a message for starting with "backup_label", shouldn't
we also add a corresponding message for starting from a checkpoint
found in the control file?  If you see that in a problem report,
you immediately know what is going on.


+1. It is easier to detect the presence of a message than the absence of 
one.


Regards,
-David

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-20 Thread Nazir Bilal Yavuz

Hi,

Thanks for the feedback.

On Mon, 20 Nov 2023 at 10:47, Michael Paquier  wrote:
>
> On Thu, Nov 09, 2023 at 02:39:26PM +0300, Nazir Bilal Yavuz wrote:
> > There are some differences between pg_stat_wal and pg_stat_io while
> > collecting WAL stats. For example in the XLogWrite() function in the
> > xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
> > is not something we want for pg_stat_io since pg_stat_io counts the
> > number of blocks rather than the system calls, so instead incremented
> > pg_stat_io by npages.
> >
> > Could that cause a problem since pg_stat_wal's behaviour will be
> > changed? Of course, as an alternative we could change pg_stat_io's
> > behaviour but in the end either pg_stat_wal's or pg_stat_io's
> > behaviour will be changed.
>
> Yep, that could be confusing for existing applications that track the
> information of pg_stat_wal.  The number of writes is not something
> that can be correctly shared between both.  The timings for the writes
> and the syncs could be shared at least, right?

Yes, the timings for the writes and the syncs should work. Another
question I have in mind is the pg_stat_reset_shared() function. When
we call it with 'io' it will reset pg_stat_wal's timings and when we
call it with 'wal' it won't reset them, right?

>
> This slightly relates to pgstat_count_io_op_n() in your latest patch,
> where it feels a bit weird to see an update of
> PendingWalStats.wal_sync sit in the middle of a routine dedicated to
> pg_stat_io..  I am not completely sure what's the right balance here,
> but I would try to implement things so as pg_stat_io paths does not
> need to know about PendingWalStats.

Write has block vs system calls differentiation but it is the same for
sync. Because of that I put PendingWalStats.wal_sync to pg_stat_io but
I agree that it looks a bit weird.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans

2023-11-20 Thread Jinjing Zhou

Thanks a lot! This is exactly what I'm asking. We've tried the CustomScanAPI at 
https://github.com/tensorchord/pgvecto.rs/pull/126, but met error with 
"variable not found in subplan target list". We're still investigating the root 
cause and thanks for your guidance!

Best
Jinjing Zhou
> From: "Matthias van de Meent"
> Date:  Mon, Nov 20, 2023, 19:33
> Subject:  Re: Inquiry on Generating Bitmaps from Filter Conditions in Index 
> Scans
> To: "Jinjing Zhou"
> Cc: "pgsql-hackers@lists.postgresql.org"
> On Mon, 20 Nov 2023 at 09:30, Jinjing Zhou  wrote:
> >
> > Hi hackers,
> >
> > I hope this message finds you well. I am reaching out to seek guidance on a 
> > specific aspect of PostgreSQL's index scanning functionality.
> >
> > I am currently working on a vector search extension for postgres, where I 
> > need to generate bitmaps based on filter conditions during an index scan. 
> > The goal is to optimize the query performance by efficiently identifying 
> > the rows that meet the given criteria.
> >
> > The query plan looks like this
> >
> > Index Scan using products_feature_idx on products  (cost=0.00..27.24 
> > rows=495 width=12)
> >  Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector)
> >  Filter: ((price > '0.2'::double precision) AND (price <= 
> > '0.7'::double precision))
> >
> >
> > We have a custom index for the order by clause on the feature column. Now 
> > we want to utilize the index on other columns like price column. We want to 
> > access the bitmap of price column's filter condition in the feature column 
> > index. Is there any way I can achieve this goal?
> 
> If you mean "I'd like to use bitmaps generated by combining filter
> results from index A, B, and C for (pre-)filtering the ordered index
> lookups in index D",
> then there is no current infrastructure to do this. Bitmap scans
> currently generate a data structure that is not indexable, and can
> thus not be used efficiently to push an index's generated bitmap into
> another bitmap's scans.
> 
> There are efforts to improve the data structures we use for storing
> TIDs during vacuum [0] which could extend to the TID bitmap structure,
> but even then we'd need some significant effort to rewire Postgres'
> internals to push down the bitmap filters; and that is even under the
> assumption that pushing the bitmap down into the index AM is more
> efficient than doing the merges above the index AM and then re-sorting
> the data.
> 
> So, in short, it's not currently available in community PostgreSQL.
> You could probably create a planner hook + custom executor node that
> does this, but it won't be able to use much of the features available
> inside PostgreSQL.
> 
> Kind regards,
> 
> Matthias van de Meent
> 
> [0] 
> https://www.postgresql.org/message-id/flat/CANWCAZbrZ58-w1W_3pg-0tOfbx8K41_n_03_0ndGV78hJWswBA%2540mail.gmail.com

Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans

2023-11-20 Thread Jinjing Zhou

Thanks. Our project is at https://github.com/tensorchord/pgvecto.rs. A custom 
index is implemented for the vector similarity search, which implements 
`amgettuples` with direction support to provide candidates for the order by 
clause. 

And we want to inject the filter condition using bitmap into the amgettuples 
process, instead of checking the tuples one by one to accelerate the whole 
process. 

Best
Jinjing Zhou
> From: "Tomas Vondra"
> Date:  Mon, Nov 20, 2023, 18:52
> Subject:  Re: Inquiry on Generating Bitmaps from Filter Conditions in Index 
> Scans
> To: "Jinjing Zhou", 
> "pgsql-hackers@lists.postgresql.org"
> On 11/19/23 18:19, Jinjing Zhou wrote:
> > Hi hackers, 
> > 
> > I hope this message finds you well. I am reaching out to seek guidance
> > on a specific aspect of PostgreSQL's index scanning functionality.
> > 
> > I am currently working on a vector search extension for postgres, where
> > I need to generate bitmaps based on filter conditions during an index
> > scan. The goal is to optimize the query performance by efficiently
> > identifying the rows that meet the given criteria.
> > 
> > The query plan looks like this
> > 
> > Index Scan using products_feature_idx on products  (cost=0.00..27.24
> > rows=495 width=12)
> >          Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector)
> >          Filter: ((price > '0.2'::double precision) AND (price <=
> > '0.7'::double precision))
> > 
> > 
> > We have a custom index for the order by clause on the feature column.
> > Now we want to utilize the index on other columns like price column. We
> > want to access the bitmap of price column's filter condition in the
> > feature column index. Is there any way I can achieve this goal?
> > 
> > Any help or guidance is appreciated!
> > 
> 
> I suppose you'll need to give more details about what exactly are you
> trying to achieve, what you tried, maybe some code examples, etc. Your
> question is quite vague, and it's unclear what "bitmaps generated on
> filter conditions" or "custom index" means.
> 
> regards
> 
> -- 
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

[PATCH] fix race condition in libpq (related to ssl connections)

2023-11-20 Thread Willi Mann


Hi,

I've found a race condition in libpq. It is about the initialization of
the my_bio_methods static variable in fe-secure-openssl.c, which is not
protected by any lock. The race condition may make the initialization of
the connection fail, and as an additional weird consequence, it might
cause openssl call close(0), so stdin of the client application gets
closed.

I've prepared a patch to protect the initialization of my_bio_methods
from the race. This is my first patch submission to the postgresql
project, so I hope I didn't miss anything. Any comments and suggestions
are of course very welcome.

I also prepared a testcase. In the testcase tarball, there is a patch
that adds sleeps at the right positions to make the close(0) problem
occur practically always. It also includes comments to explain how the
race can end up calling close(0).

Concerning the patch, it is only tested on Linux. I'm unsure about
whether the simple initialization of the mutex would work nowadays also
on Windows or whether the more complicated initialization also to be
found for the ssl_config_mutex in the same source file needs to be used.
Let me know whether I should adapt that.

We discovered the problem with release 11.5, but the patch and the 
testcase are against the master branch.


Regards,
Willi

--
___

Dr. Willi Mann | Principal Software Engineer, Tech Lead PQL

Celonis SE | Theresienstrasse 6 | 80333 Munich, Germany
F: +4989416159679
w.m...@celonis.com | www.celonis.com | LinkedIn | Twitter | Xing

AG Munich HRB 225439 | Management: Martin Klenk, Bastian Nominacher, 
Alexander RinkeFrom 721be8d6ea26d78fc086a2e85413858585ca9300 Mon Sep 17 00:00:00 2001
From: Willi Mann 
Date: Thu, 16 Nov 2023 11:50:27 +0100
Subject: [PATCH] libpq: Fix race condition in initialization of my_bio_methods
 static variable

---
 src/interfaces/libpq/fe-secure-openssl.c | 26 ++--
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index f1192d28f2..e8fa3eb403 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -81,6 +81,7 @@ static int	PQssl_passwd_cb(char *buf, int size, int rwflag, void *userdata);
 static int	my_sock_read(BIO *h, char *buf, int size);
 static int	my_sock_write(BIO *h, const char *buf, int size);
 static BIO_METHOD *my_BIO_s_socket(void);
+static void my_BIO_methods_init(void);
 static int	my_SSL_set_fd(PGconn *conn, int fd);
 
 
@@ -1882,8 +1883,8 @@ my_sock_write(BIO *h, const char *buf, int size)
 	return res;
 }
 
-static BIO_METHOD *
-my_BIO_s_socket(void)
+static void
+my_BIO_methods_init(void)
 {
 	if (!my_bio_methods)
 	{
@@ -1893,11 +1894,11 @@ my_BIO_s_socket(void)
 
 		my_bio_index = BIO_get_new_index();
 		if (my_bio_index == -1)
-			return NULL;
+			return;
 		my_bio_index |= (BIO_TYPE_DESCRIPTOR | BIO_TYPE_SOURCE_SINK);
 		my_bio_methods = BIO_meth_new(my_bio_index, "libpq socket");
 		if (!my_bio_methods)
-			return NULL;
+			return;
 
 		/*
 		 * As of this writing, these functions never fail. But check anyway,
@@ -1914,17 +1915,30 @@ my_BIO_s_socket(void)
 		{
 			BIO_meth_free(my_bio_methods);
 			my_bio_methods = NULL;
-			return NULL;
+			return;
 		}
 #else
 		my_bio_methods = malloc(sizeof(BIO_METHOD));
 		if (!my_bio_methods)
-			return NULL;
+			return;
 		memcpy(my_bio_methods, biom, sizeof(BIO_METHOD));
 		my_bio_methods->bread = my_sock_read;
 		my_bio_methods->bwrite = my_sock_write;
 #endif
 	}
+}
+
+static pthread_mutex_t my_BIO_methods_init_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+static BIO_METHOD *
+my_BIO_s_socket()
+{
+	if (pthread_mutex_lock(&my_BIO_methods_init_mutex) != 0)
+	{
+		return NULL;
+	}
+	my_BIO_methods_init();
+	pthread_mutex_unlock(&my_BIO_methods_init_mutex);
 	return my_bio_methods;
 }
 
-- 
2.39.2



testcase.tar.gz
Description: application/gzip

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Robert Haas

On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier  wrote:
> (I am not exactly sure how, but we've lost pgsql-hackers on the way
> when you sent v5.  Now added back in CC with the two latest patches
> you've proposed attached.)
>
> Here is a short summary of what has been missed by the lists:
> - I've commented that the patch should not create, not show up in
> fields returned the SQL functions or stream control files with a size
> of 512B, just stick to 8kB.  If this is worth changing this should be
> applied consistently across the board including initdb, discussed on
> its own thread.
> - The backup-related fields in the control file are reset at the end
> of recovery.  I've suggested to not do that to keep a trace of what
> was happening during recovery.  The latest version of the patch resets
> the fields.
> - With the backup_label file gone, we lose some information in the
> backups themselves, which is not good.  Instead, you have suggested an
> approach where this data is added to the backup manifest, meaning that
> no information would be lost, particularly useful for self-contained
> backups.  The fields planned to be added to the backup manifest are:
> -- The start and end time of the backup, the end timestamp being
> useful to know when stop time can be used for PITR.
> -- The backup label.
> I've agreed that it may be the best thing to do at this end to not
> lose any data related to the removal of the backup_label file.

I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits. The last time we changed the API, we changed
pg_stop_backup to pg_backup_stop, but this doesn't do that, and I
wonder if that's OK. Even if it is, do we really want to change this
API around again after such a short time?

That said, I don't have an intrinsic problem with moving this
information from the backup_label to the backup_manifest file since it
is purely informational. I do think there should perhaps be some
additions to the test cases, though.

I am concerned about the interaction of this proposal with incremental
backup. When you take an incremental backup, you get something that
looks a lot like a usable data directory but isn't. To prevent that
from causing avoidable disasters, the present version of the patch
adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the
backup_label. pg_combinebackup knows to look for those fields, and the
server knows that if they are present it should refuse to start. With
this change, though, I suppose those fields would end up in
pg_control. But that does not feel entirely great, because we have a
goal of keeping the amount of real data in pg_control below 512 bytes,
the traditional sector size, and this adds another 12 bytes of stuff
to that file that currently doesn't need to be there. I feel like
that's kind of a problem.

But my main point here is ... if we have a few more senior hackers
weigh in and vote in favor of this change, well then that's one thing.
But IMHO a discussion that's mostly between 2 people is not nearly a
strong enough consensus to justify this amount of disruption.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use of backup_label not noted in log

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 5:35 AM Laurenz Albe  wrote:
> I can accept that adding log messages to back branches is ok.
> Perhaps I am too nervous about things like that, because as an extension
> developer I have been bitten too often by ABI breaks in minor releases
> in the past.

I think that adding a log message to the back branches would probably
make my life better not worse, because when people do strange things
and then send me the log messages to figure out what the heck
happened, it would be there, and I'd have a clue. However, the world
doesn't revolve around me. I can imagine users getting spooked if a
new message that they've never seen before, and I think that risk
should be considered. There are good reasons for keeping the
back-branches stable, and as you said before, this isn't a bug fix.

I do also think it is worth considering how this proposal interacts
with the proposal to remove backup_label. If that proposal goes
through, then this proposal is obsolete, I believe. But if this is a
good idea, does that mean that's not a good idea? Or would we try to
make the pg_control which that patch would drop in place have some
internal difference which we could use to drive a similar log message?
Maybe we should, because knowing whether or not the user followed the
backup procedure correctly would indeed be a big help and it would be
regrettable to gain that capability only to lose it again...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should check collations when creating partitioned index

2023-11-20 Thread Tom Lane

Peter Eisentraut  writes:
> On 14.11.23 17:15, Tom Lane wrote:
>> I don't love the patch details though.  It seems entirely wrong to check
>> this before we check the opclass match.

> Not sure why?  The order doesn't seem to matter?

The case that was bothering me was if we had a non-collated type
versus a collated type.  That would result in throwing an error
about collation mismatch, when complaining about the opclass seems
more apropos.  However, if we do this:

> I see.  That means we shouldn't raise an error on a mismatch but just do
>  if (key->partcollation[i] != collationIds[j])
>  continue;

it might not matter much.

regards, tom lane

Re: On non-Windows, hard depend on uselocale(3)

2023-11-20 Thread Tom Lane

Thomas Munro  writes:
> If we are sure that we'll *never* want locale-aware printf-family
> functions (ie we *always* want "C" locale), then in the thought
> experiment above where I suggested we supply replacement _l()
> functions, we could just skip that for the printf family, but make
> that above comment actually true.  Perhaps with Ryu, but otherwise by
> punting to libc _l() or uselocale() save/restore.

It is pretty annoying that we've got that shiny Ryu code and can't
use it here.  From memory, we did look into that and concluded that
Ryu wasn't amenable to providing "exactly this many digits" as is
required by most variants of printf's conversion specs.  But maybe
somebody should go try harder.  (Worst case, you could do rounding
off by hand on the produced digit string, but that's ugly...)

regards, tom lane

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele


On 11/20/23 12:24, Robert Haas wrote:

On Mon, Nov 20, 2023 at 5:35 AM Laurenz Albe  wrote:

I can accept that adding log messages to back branches is ok.
Perhaps I am too nervous about things like that, because as an extension
developer I have been bitten too often by ABI breaks in minor releases
in the past.


I think that adding a log message to the back branches would probably
make my life better not worse, because when people do strange things
and then send me the log messages to figure out what the heck
happened, it would be there, and I'd have a clue. However, the world
doesn't revolve around me. I can imagine users getting spooked if a
new message that they've never seen before, and I think that risk
should be considered. There are good reasons for keeping the
back-branches stable, and as you said before, this isn't a bug fix.


Personally I think that the value of the information outweighs the 
weirdness of a new message appearing.



I do also think it is worth considering how this proposal interacts
with the proposal to remove backup_label. If that proposal goes
through, then this proposal is obsolete, I believe. 


Not at all. I don't even think the messages will need to be reworded, or 
not much since they don't mention backup_label.



But if this is a
good idea, does that mean that's not a good idea? Or would we try to
make the pg_control which that patch would drop in place have some
internal difference which we could use to drive a similar log message?


The recovery in pg_control patch has all the same recovery info stored, 
so similar (or the same) log message would still be appropriate.



Maybe we should, because knowing whether or not the user followed the
backup procedure correctly would indeed be a big help and it would be
regrettable to gain that capability only to lose it again...


The info is certainly valuable and we wouldn't lose it, unless there is 
something I'm not getting.


Regards,
-David

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-20 Thread Tom Lane

Richard Guo  writes:
> On Fri, Nov 17, 2023 at 11:38 AM Tom Lane  wrote:
>> That line of argument also leads to the conclusion that it'd be
>> okay to expose info about the ordering of the CTE result to the
>> upper planner.

> In the light of this conclusion, I had a go at propagating the pathkeys
> from CTEs up to the outer planner and came up with the attached.

Oh, nice!  I remembered we had code already to do this for regular
SubqueryScans, but I thought we'd need to do some refactoring to
apply it to CTEs.  I think you are right though that
convert_subquery_pathkeys can be used as-is.  Some thoughts:

* Do we really need to use make_tlist_from_pathtarget?  Why isn't
the tlist of the cteplan good enough (indeed, more so)?

* I don't love having this code assume that it knows how to find
the Path the cteplan was made from.  It'd be better to make
SS_process_ctes save that somewhere, maybe in a list paralleling
root->cte_plan_ids.

Alternatively: maybe it's time to do what the comments in
SS_process_ctes vaguely speculate about, and just save the Path
at that point, with construction of the plan left for createplan()?
That might be a lot of refactoring for not much gain, so not sure.

regards, tom lane

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele





On 11/20/23 12:11, Robert Haas wrote:

On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier  wrote:

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5.  Now added back in CC with the two latest patches
you've proposed attached.)

Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB.  If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery.  I've suggested to not do that to keep a trace of what
was happening during recovery.  The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good.  Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups.  The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.


I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits. 


From my perspective it's not that big a change for backup software but 
it does bring a lot of benefits, including fixing an outstanding bug in 
Postgres, i.e. reading pg_control without getting a torn copy.



The last time we changed the API, we changed
pg_stop_backup to pg_backup_stop, but this doesn't do that, and I
wonder if that's OK. Even if it is, do we really want to change this
API around again after such a short time?


This is a good point. We could just rename again, but not sure what 
names to go for this time. OTOH if the backup software is selecting 
fields then they will get an error because the names have changed. If 
the software is grabbing fields by position then they'll get a 
valid-looking result (even if querying by position is a terrible idea).


Another thing we could do is explicitly error if we see backup_label in 
PGDATA during recovery. That's just a few lines of code so would not be 
a big deal to maintain. This error would only be visible on restore, so 
it presumes that backup software is being tested.


Maybe just a rename to something like pg_backup_begin/end would be the 
way to go.



That said, I don't have an intrinsic problem with moving this
information from the backup_label to the backup_manifest file since it
is purely informational. I do think there should perhaps be some
additions to the test cases, though.


A little hard to add to the tests, I think, since they are purely 
informational, i.e. not pushed up by the parser. Maybe we could just 
grep for the fields?



I am concerned about the interaction of this proposal with incremental
backup. When you take an incremental backup, you get something that
looks a lot like a usable data directory but isn't. To prevent that
from causing avoidable disasters, the present version of the patch
adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the
backup_label. pg_combinebackup knows to look for those fields, and the
server knows that if they are present it should refuse to start. With
this change, though, I suppose those fields would end up in
pg_control. But that does not feel entirely great, because we have a
goal of keeping the amount of real data in pg_control below 512 bytes,
the traditional sector size, and this adds another 12 bytes of stuff
to that file that currently doesn't need to be there. I feel like
that's kind of a problem.


I think these fields would be handled the same as the rest of the fields 
in backup_label: returned from pg_backup_stop() and also stored in 
backup_manifest. Third-party software can do as they like with them and 
pg_combinebackup can just read from backup_manifest.


As for the pg_control file -- it might be best to give it a different 
name for backups that are not essentially copies of PGDATA. On the other 
hand, pgBackRest has included pg_control in incremental backups since 
day one and we've never had a user mistakenly do a manual restore of one 
and cause a problem (though manual restores are not the norm). Still, 
probably can't hurt to be a bit careful.



But my main point here is ... if we have a few more senior hackers
weigh in and vote in favor of this change, well then that's one thing.
But IMHO a discussion that's mostly between 2 people is not nearly a
strong enough consensus to justify this amount of disrupti

Re: Use of backup_label not noted in log

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 11:24:25 -0500, Robert Haas wrote:
> I do also think it is worth considering how this proposal interacts
> with the proposal to remove backup_label. If that proposal goes
> through, then this proposal is obsolete, I believe.

I think it's the opposite, if anything. Today you can at least tell there was
use of a backup_label by looking for backup_label.old and you can verify
fairly easily in a restore script that backup_label is present. If we "just"
use pg_control, neither of those is as easy. I.e. log messages would be more
important, not less.  Depending on how things work out, we might need to
reformulate and/or move them a bit, but that doesn't seem like a big deal.

> But if this is a good idea, does that mean that's not a good idea? Or would
> we try to make the pg_control which that patch would drop in place have some
> internal difference which we could use to drive a similar log message?

I think we absolutely have to. If there's no way to tell whether an "external"
pg_backup_start/stop() procedure actually used the proper pg_control, it'd
make the situation substantially worse compared to today's, already bad,
situation.

Greetings,

Andres Freund

Re: Use of backup_label not noted in log

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 17:30:31 +0900, Michael Paquier wrote:
> On Sat, Nov 18, 2023 at 01:49:15PM -0800, Andres Freund wrote:
> > Note that the LSN in the "continuing" case is the one the backup started at,
> > not where recovery will start.
> > 
> > I've wondered whether it's worth also adding an explicit message just after
> > ReachedEndOfBackup(), but it seems far less urgent due to the existing
> > "consistent recovery state reached at %X/%X" message.
> 
> Upgrading the surrounding DEBUG1 to a LOG is another option, but I
> agree that I've seen less that as being an actual problem in the field
> compared to the famous I-removed-a-backup-label-and-I-m-still-up,
> until this user sees signs of corruption after recovery was finished,
> sometimes days after putting back an instance online.

"end of backup reached" could scare users, it doesn't obviously indicate
something "good". "completed backup recovery, started at %X/%X" or such would
be better imo.


> +   if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
> +   ereport(LOG,
> +   (errmsg("continuing to start from base backup with redo 
> LSN %X/%X",
> +   LSN_FORMAT_ARGS(ControlFile->backupStartPoint;
> 
> "Continuing to start" sounds a bit weird to me, though, considering
> that there are a few LOGs that say "starting" when there is a signal
> file, but I don't have a better idea on top of my mind.  So that
> sounds OK here.

I didn't like it much either - but I like David's proposal in his sibling
reply:

LOG:  starting backup recovery with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1
LOG:  restarting backup recovery with redo LSN A/34100028
and adding the message from above:
LOG:  completing backup recovery with redo LSN A/34100028

Greetings,

Andres Freund

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 12:54 PM David Steele  wrote:
> Another thing we could do is explicitly error if we see backup_label in
> PGDATA during recovery. That's just a few lines of code so would not be
> a big deal to maintain. This error would only be visible on restore, so
> it presumes that backup software is being tested.

I think that if we do decide to adopt this proposal, that would be a
smart precaution.

> A little hard to add to the tests, I think, since they are purely
> informational, i.e. not pushed up by the parser. Maybe we could just
> grep for the fields?

Hmm. Or should they be pushed up by the parser?

> I think these fields would be handled the same as the rest of the fields
> in backup_label: returned from pg_backup_stop() and also stored in
> backup_manifest. Third-party software can do as they like with them and
> pg_combinebackup can just read from backup_manifest.

I think that would be a bad plan, because this is critical
information, and a backup manifest is not a thing that you're required
to have. It's not a natural fit at all. We don't want to create a
situation where if you nuke the backup_manifest then the server
forgets that what it has is an incremental backup rather than a usable
data directory.

> We absolutely need more people to look at this and sign off. I'm glad
> they have not so far because it has allowed time to whack the patch
> around and get it into better shape.

Cool.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Hide exposed impl detail of wchar.c

2023-11-20 Thread Jubilee Young

On Fri, Nov 17, 2023 at 2:26 AM John Naylor  wrote:
>
> On Fri, Nov 17, 2023 at 5:54 AM Nathan Bossart  
> wrote:
> >
> > It looks like is_valid_ascii() was originally added to pg_wchar.h so that
> > it could easily be used elsewhere [0] [1], but that doesn't seem to have
> > happened yet.
> >
> > Would moving this definition to a separate header file be a viable option?
>
> Seems fine to me. (I believe the original motivation for making it an
> inline function was for in pg_mbstrlen_with_len(), but trying that
> hasn't been a priority.)

In that case, I took a look across the codebase and saw a
utils/ascii.h that doesn't
seem to have gotten much love, but I suppose one could argue that it's intended
to be a backend-only header file?

As the codebase is growing some enhanced UTF-8 support, you'll want somewhere
that contains the optimized US-ASCII routines, because, as US-ASCII is
a subset of
UTF-8, and often faster to handle, it's typical for such codepaths to look like

```c
while (i < len && no_multibyte_chars) {
   i = i + ascii_op_version(i, buffer, &no_multibyte_chars);
}

while (i < len) {
i = i + utf8_op_version(i, buffer);
}
```

So it should probably end up living somewhere near the UTF-8 support, and
the easiest way to make it not go into something pgrx currently
includes would be
to make it a new header file, though there's a fair amount of API we
don't touch.

>From the pgrx / Rust perspective, Postgres function calls are passed
via callback
to a "guard function" that guarantees that longjmp and setjmp don't
cause trouble
(and makes sure we participate in that). So we only want to call
Postgres functions
if we "can't replace" them, as the overhead is quite a lot. That means
UTF-8-per-se
functions aren't very interesting to us as the Rust language already
supports it, but
we do benefit from access to transcoding to/from UTF-8.

—Jubilee

Re: trying again to get incremental backup

2023-11-20 Thread Alvaro Herrera

On 2023-Nov-16, Robert Haas wrote:

> On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera  
> wrote:

> > I don't understand this point.  Currently, the protocol is that
> > UPLOAD_MANIFEST is used to send the manifest prior to requesting the
> > backup.  You seem to be saying that you're thinking of removing support
> > for UPLOAD_MANIFEST and instead just give the LSN as an option to the
> > BASE_BACKUP command?
> 
> I don't think I'd want to do exactly that, because then you could only
> send one LSN, and I do think we want to send a set of LSN ranges with
> the corresponding TLI for each. I was thinking about dumping
> UPLOAD_MANIFEST and instead having a command like:
> 
> INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698
> 
> The client would execute this command one or more times before
> starting an incremental backup.

That sounds good to me.  Not having to parse the manifest server-side
sounds like a win, as does saving the transfer, for the cases where the
manifest is large.

Is this meant to support multiple timelines each with non-overlapping
adjacent ranges, rather than multiple non-adjacent ranges?

Do I have it right that you want to rewrite this bit before considering
this ready to commit?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)

Re: Use of backup_label not noted in log

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 11:35:15 +0100, Laurenz Albe wrote:
> On Mon, 2023-11-20 at 17:30 +0900, Michael Paquier wrote:
> > +   if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
> > +   ereport(LOG,
> > +   (errmsg("continuing to start from base backup with redo 
> > LSN %X/%X",
> > +   
> > LSN_FORMAT_ARGS(ControlFile->backupStartPoint;
> >
> > "Continuing to start" sounds a bit weird to me, though, considering
> > that there are a few LOGs that say "starting" when there is a signal
> > file, but I don't have a better idea on top of my mind.  So that
> > sounds OK here.
>
> We can only reach that message in recovery or standby mode, right?
> So why not write "continuing to recover from base backup"?

It can be reached without either too, albeit much less commonly.


> If we add a message for starting with "backup_label", shouldn't
> we also add a corresponding message for starting from a checkpoint
> found in the control file?  If you see that in a problem report,
> you immediately know what is going on.

Maybe - the reason I hesitate on that is that emitting an additional log
message when starting from a base backup just adds something "once once the
lifetime of a node". Whereas emitting something every start obviously doesn't
impose any limit.

You also can extrapolate from the messages absence that we started up without
backup_label, it's not like there would be a lot of messages inbetween
  "database system was interrupted; last ..."
and
  "starting backup recovery ..."
(commonly there would be no messages)

We can do more on HEAD of course, but we should be wary of just spamming the
log unnecessarily as well.


I guess we could add this message at the same time, including in the back
branches. Initially I thought that might be unwise, because replacing
elog(DEBUG1, "end of backup reached");
with a different message could theoretically cause issues, even if unlikely,
given that it's a DEBUG1 message.

But I think we actually want to emit the message a bit later, just *after* we
updated the control file, as that's the actually relevant piece after which we
won't go back to the "backup recovery" state.  I am somewhat agnostic about
whether we should add that in the back branches or not.


Here's the state with my updated patch, when starting up from a base backup:

LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0, 
64-bit
LOG:  listening on IPv6 address "::1", port 5441
LOG:  listening on IPv4 address "127.0.0.1", port 5441
LOG:  listening on Unix socket "/tmp/.s.PGSQL.5441"
LOG:  database system was interrupted; last known up at 2023-11-20 10:55:49 PST
LOG:  starting recovery from base backup with redo LSN E/AFF07F20, checkpoint 
LSN E/B01B17F0, on timeline ID 1
LOG:  entering standby mode
LOG:  redo starts at E/AFF07F20
LOG:  completed recovery from base backup with redo LSN E/AFF07F20
LOG:  consistent recovery state reached at E/B420FC80


Besides the phrasing and the additional log message (I have no opinion about
whether it should be backpatched or not), I used %u for TimelineID as
appropriate, and added a comma before "on timeline".

Greetings,

Andres Freund
>From d80e55942a096d9f1ab10b84ab6f2a9d371cf88d Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 18 Nov 2023 13:27:17 -0800
Subject: [PATCH v2] WIP: Log when starting up from a base backup

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20231117041811.vz4vgkthwjnwp...@awork3.anarazel.de
Backpatch:
---
 src/backend/access/transam/xlogrecovery.c | 32 +++
 1 file changed, 32 insertions(+)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6156aa..9803b481118 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -603,6 +603,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		if (StandbyModeRequested)
 			EnableStandbyMode();
 
+		/*
+		 * Omitting backup_label when creating a new replica, PITR node etc.
+		 * unfortunately is a common cause of corruption. Logging that
+		 * backup_label was used makes it a bit easier to exclude that as the
+		 * cause of observed corruption.
+		 *
+		 * Do so before we try to read the checkpoint record (which can fail),
+		 * as otherwise it can be hard to understand why a checkpoint other
+		 * than ControlFile->checkPoint is used.
+		 */
+		ereport(LOG,
+(errmsg("starting recovery from base backup with redo LSN %X/%X, checkpoint LSN %X/%X, on timeline ID %u",
+		LSN_FORMAT_ARGS(RedoStartLSN),
+		LSN_FORMAT_ARGS(CheckPointLoc),
+		CheckPointTLI)));
+
 		/*
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
@@ -742,6 +758,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 EnableStandbyMode();
 		}
 
+		/*
+		 * For the

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele


On 11/20/23 14:27, Andres Freund wrote:

Hi,

On 2023-11-19 14:28:12 -0400, David Steele wrote:

On 11/18/23 17:49, Andres Freund wrote:

On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1


I'd prefer something like:

LOG:  starting backup recovery with redo...



When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG:  continuing to start from base backup with redo LSN A/34100028
LOG:  entering standby mode
LOG:  redo starts at A/3954B958


And here:

LOG:  restarting backup recovery with redo...


I like it.


Cool.


I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


I think the current message is sufficient, but what do you have in mind?


Well, the consistency message is emitted after every restart. Whereas a single
instance only should go through backup recovery once. So it seems worthwhile
to differentiate the two in log messages.


Ah, right. That works for me, then.

Regards,
-David

Re: trying again to get incremental backup

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 2:03 PM Alvaro Herrera  wrote:
> That sounds good to me.  Not having to parse the manifest server-side
> sounds like a win, as does saving the transfer, for the cases where the
> manifest is large.

OK. I'll look into this next week, hopefully.

> Is this meant to support multiple timelines each with non-overlapping
> adjacent ranges, rather than multiple non-adjacent ranges?

Correct. I don't see how non-adjacent LSN ranges could ever be a
useful thing, but adjacent ranges on different timelines are useful.

> Do I have it right that you want to rewrite this bit before considering
> this ready to commit?

For sure. I don't think this is the only thing that needs to be
revised before commit, but it's definitely *a* thing that needs to be
revised before commit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-11-20 Thread Daniel Verite

 Hi,

Here's a new version to improve the performance of FETCH_COUNT
and extend the cases when it can be used.

Patch 0001 adds a new mode in libpq to allow the app to retrieve
larger chunks of results than the single row of the row-by-row mode.
The maximum number of rows per PGresult is set by the user.

Patch 0002 uses that mode in psql and gets rid of the cursor
implementation as suggested upthread.

The performance numbers look good.
For a query retrieving 50M rows of about 200 bytes:
  select repeat('abc', 200) from generate_series(1,500)
/usr/bin/time -v psql -At -c $query reports these metrics
(medians of 5 runs):

  version  | fetch_count | clock_time | user_time | sys_time | max_rss_size
(kB) 
---+-++---+--+---
 16-stable |   0 |   6.58 |  3.98 | 2.09 |  
3446276
 16-stable | 100 |   9.25 |  4.10 | 1.90 | 
8768
 16-stable |1000 |  11.13 |  5.17 | 1.66 | 
8904
 17-patch  |   0 |6.5 |  3.94 | 2.09 |  
3442696
 17-patch  | 100 |  5 |  3.56 | 0.93 | 
4096
 17-patch  |1000 |   6.48 |  4.00 | 1.55 | 
4344

Interestingly, retrieving by chunks of 100 rows appears to be a bit faster
than the default one big chunk. It means that independently
of using less memory, FETCH_COUNT implemented that way
would be a performance enhancement compared to both
not using it and using it in v16 with the cursor implementation.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 766bbe84def2db494f646caeaf29eefeba893c1a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Mon, 20 Nov 2023 17:24:55 +0100
Subject: [PATCH v4 1/2] Implement retrieval of results in chunks with libpq.

This mode is similar to the single-row mode except that chunks
of results contain up to N rows instead of a single row.
It is meant to reduce the overhead of the row-by-row allocations
for large result sets.
The mode is selected with PQsetChunkedRowsMode(int maxRows) and results
have the new status code PGRES_TUPLES_CHUNK.
---
 doc/src/sgml/libpq.sgml  |  96 +++--
 src/bin/pg_amcheck/pg_amcheck.c  |   1 +
 src/interfaces/libpq/exports.txt |   1 +
 src/interfaces/libpq/fe-exec.c   | 118 +--
 src/interfaces/libpq/libpq-fe.h  |   4 +-
 src/interfaces/libpq/libpq-int.h |   7 +-
 6 files changed, 183 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index ed88ac001a..8007bf67d8 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3537,7 +3537,20 @@ ExecStatusType PQresultStatus(const PGresult *res);
 The PGresult contains a single result 
tuple
 from the current command.  This status occurs only when
 single-row mode has been selected for the query
-(see ).
+(see ).
+   
+  
+ 
+
+ 
+  PGRES_TUPLES_CHUNK
+  
+   
+The PGresult contains several tuples
+from the current command. The count of tuples cannot exceed
+the maximum passed to .
+This status occurs only when the chunked mode has been selected
+for the query (see ).

   
  
@@ -5187,8 +5200,8 @@ PGresult *PQgetResult(PGconn *conn);
   
Another frequently-desired feature that can be obtained with
 and 
-   is retrieving large query results a row at a time.  This is discussed
-   in .
+   is retrieving large query results a limited number of rows at a time.  This 
is discussed
+   in .
   
 
   
@@ -5551,12 +5564,13 @@ int PQflush(PGconn *conn);
 
 
 
- To enter single-row mode, call PQsetSingleRowMode
- before retrieving results with PQgetResult.
- This mode selection is effective only for the query currently
- being processed. For more information on the use of
- PQsetSingleRowMode,
- refer to .
+ To enter single-row or chunked modes, call
+ respectively PQsetSingleRowMode
+ or PQsetChunkedRowsMode before retrieving results
+ with PQgetResult.  This mode selection is effective
+ only for the query currently being processed. For more information on the
+ use of these functions refer
+ to .
 
 
 
@@ -5895,10 +5909,10 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
   
  
 
- 
-  Retrieving Query Results Row-by-Row
+ 
+  Retrieving Query Results by chunks
 
-  
+  
libpq
single-row mode
   
@@ -5909,13 +5923,15 @@ UPDATE mytable SET x = x + 1 WHERE id = 42;
PGresult.  This can be unworkable for commands
that return a large number of rows.  For such cases, applications can use
 and  
in
-   single-row mode.  In this mode, the result row(s) are
-   returned to the application

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Robert Haas

On Sat, Oct 7, 2023 at 12:09 PM Tom Lane  wrote:
> Done that way.

Is there still outstanding work on this thread? Because I'm just now
using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this
kind of thing in a meson build:

[2264/2287] Linking target src/interfaces/ecpg/test/sql/parser
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'
[2266/2287] Linking target src/interfaces/ecpg/test/sql/insupd
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'
[2273/2287] Linking target src/interfaces/ecpg/test/sql/quote
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'
[2278/2287] Linking target src/interfaces/ecpg/test/sql/show
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'
[2280/2287] Linking target src/interfaces/ecpg/test/sql/sqlda
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 2:10 PM Robert Haas  wrote:
> > Is this meant to support multiple timelines each with non-overlapping
> > adjacent ranges, rather than multiple non-adjacent ranges?
>
> Correct. I don't see how non-adjacent LSN ranges could ever be a
> useful thing, but adjacent ranges on different timelines are useful.

Thinking about this a bit more, there are a couple of things we could
do here in terms of syntax. Once idea is to give up on having a
separate MANIFEST-WAL-RANGE command for each range and instead just
cram everything into either a single command:

MANIFEST-WAL-RANGES {tli} {startlsn} {endlsn}...

Or even into a single option to the BASE_BACKUP command:

BASE_BACKUP yadda yadda INCREMENTAL 'tli@startlsn-endlsn,...'

Or, since we expect adjacent, non-overlapping ranges, you could even
arrange to elide the duplicated boundary LSNs, e.g.

MANIFEST_WAL-RANGES {{tli} {lsn}}... {final-lsn}

Or

BASE_BACKUP yadda yadda INCREMENTAL 'tli@lsn,...,final-lsn'

I'm not sure what's best here. Trying to trim out the duplicated
boundary LSNs feels a bit like rearrangement for the sake of
rearrangement, but maybe it isn't really.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele


On 11/20/23 15:03, Andres Freund wrote:

On 2023-11-20 11:35:15 +0100, Laurenz Albe wrote:


If we add a message for starting with "backup_label", shouldn't
we also add a corresponding message for starting from a checkpoint
found in the control file?  If you see that in a problem report,
you immediately know what is going on.


Maybe - the reason I hesitate on that is that emitting an additional log
message when starting from a base backup just adds something "once once the
lifetime of a node". Whereas emitting something every start obviously doesn't
impose any limit.


Hmm, yeah, that would be a bit much.


Here's the state with my updated patch, when starting up from a base backup:

LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0, 
64-bit
LOG:  listening on IPv6 address "::1", port 5441
LOG:  listening on IPv4 address "127.0.0.1", port 5441
LOG:  listening on Unix socket "/tmp/.s.PGSQL.5441"
LOG:  database system was interrupted; last known up at 2023-11-20 10:55:49 PST
LOG:  starting recovery from base backup with redo LSN E/AFF07F20, checkpoint 
LSN E/B01B17F0, on timeline ID 1
LOG:  entering standby mode
LOG:  redo starts at E/AFF07F20
LOG:  completed recovery from base backup with redo LSN E/AFF07F20
LOG:  consistent recovery state reached at E/B420FC80

Besides the phrasing and the additional log message (I have no opinion about
whether it should be backpatched or not), I used %u for TimelineID as
appropriate, and added a comma before "on timeline".


I still wonder if we need "base backup" in the messages? That sort of 
implies (at least to me) you used pg_basebackup but that may not be the 
case.


FWIW, I also prefer "backup recovery" over "recovery from backup". 
"recovery from backup" reads fine here, but if gets more awkward when 
you want to say something like "recovery from backup settings". In that 
case, I think "backup recovery settings" reads better. Not important for 
this patch, maybe, but the recovery in pg_control patch went the other 
way and I definitely think it makes sense to keep them consistent, 
whichever way we go.


Other than that, looks good for HEAD. Whether we back patch or not is 
another question, of course.


Regards,
-David

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 14:14:00 -0500, Robert Haas wrote:
> On Sat, Oct 7, 2023 at 12:09 PM Tom Lane  wrote:
> > Done that way.
> 
> Is there still outstanding work on this thread? Because I'm just now
> using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this
> kind of thing in a meson build:

Ventura? In that case I assume you installed newer developer tools? Because
otherwise I think we were talking about issues introduced in Sonoma.

Greetings,

Andres Freund

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele


On 11/20/23 14:44, Robert Haas wrote:

On Mon, Nov 20, 2023 at 12:54 PM David Steele  wrote:

Another thing we could do is explicitly error if we see backup_label in
PGDATA during recovery. That's just a few lines of code so would not be
a big deal to maintain. This error would only be visible on restore, so
it presumes that backup software is being tested.


I think that if we do decide to adopt this proposal, that would be a
smart precaution.


I'd be OK with it -- what do you think, Michael? Would this be enough 
that we would not need to rename the functions, or should we just go 
with the rename?



A little hard to add to the tests, I think, since they are purely
informational, i.e. not pushed up by the parser. Maybe we could just
grep for the fields?


Hmm. Or should they be pushed up by the parser?


We could do that. I started on that road, but it's a lot of code for 
fields that aren't used. I think it would be better if the parser also 
loaded a data structure that represented the manifest. Seems to me 
there's a lot of duplicated code between pg_verifybackup and 
pg_combinebackup the way it is now.



I think these fields would be handled the same as the rest of the fields
in backup_label: returned from pg_backup_stop() and also stored in
backup_manifest. Third-party software can do as they like with them and
pg_combinebackup can just read from backup_manifest.


I think that would be a bad plan, because this is critical
information, and a backup manifest is not a thing that you're required
to have. It's not a natural fit at all. We don't want to create a
situation where if you nuke the backup_manifest then the server
forgets that what it has is an incremental backup rather than a usable
data directory.


I can't see why a backup would continue to be valid without a manifest 
-- that's not very standard for backup software. If you have the 
critical info in backup_label, you can't afford to lose that, so why 
should backup_manifest be any different?


Regards,
-David

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 2:35 PM Andres Freund  wrote:
> > Is there still outstanding work on this thread? Because I'm just now
> > using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this
> > kind of thing in a meson build:
>
> Ventura? In that case I assume you installed newer developer tools? Because
> otherwise I think we were talking about issues introduced in Sonoma.

I don't think I did anything special when installing developer tools.
xcode-select --version reports 2397, if that tells you anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 2:41 PM David Steele  wrote:
> I can't see why a backup would continue to be valid without a manifest
> -- that's not very standard for backup software. If you have the
> critical info in backup_label, you can't afford to lose that, so why
> should backup_manifest be any different?

I mean, you can run pg_basebackup --no-manifest.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-20 Thread Matthias van de Meent

On Wed, 8 Nov 2023 at 12:03, Tomas Vondra  wrote:
>
> Hi,
>
> here's an updated patch, addressing the review comments, and reworking
> how the work is divided between the workers & leader etc.
>
> 0001 is just v2, rebased to current master
>
> 0002 and 0003 address most of the issues, in particular it
>
>   - removes the unnecessary spool
>   - fixes bs_reltuples type to double
>   - a couple comments are reworded to be clearer
>   - changes loop/condition in brinbuildCallbackParallel
>   - removes asserts added for debugging
>   - fixes cast in comparetup_index_brin
>   - 0003 then simplifies comparetup_index_brin
>   - I haven't inlined the tuplesort_puttuple_common parameter
> (didn't seem worth it)

OK, thanks

> 0004 Reworks how the work is divided between workers and combined by the
> leader. It undoes the tableam.c changes that attempted to divide the
> relation into chunks matching the BRIN ranges, and instead merges the
> results in the leader (using the BRIN "union" function).

That's OK.

> I haven't done any indentation fixes yet.
>
> I did fairly extensive testing, using pageinspect to compare indexes
> built with/without parallelism. More testing is needed, but it seems to
> work fine (with other opclasses and so on).

After code-only review, here are some comments:

> +++ b/src/backend/access/brin/brin.c
> [...]
> +/* Magic numbers for parallel state sharing */
> +#define PARALLEL_KEY_BRIN_SHAREDUINT64CONST(0xA001)
> +#define PARALLEL_KEY_TUPLESORTUINT64CONST(0xA002)

These shm keys use the same constants also in use in
access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
operations, I'd prefer if we didn't actively introduce conflicting
identifiers when we still have significant amounts of unused values
remaining.

> +#define PARALLEL_KEY_QUERY_TEXTUINT64CONST(0xA003)

This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
others being in access/nbtree/nbtsort.c (value 0xA004, one
more than brin's), backend/executor/execParallel.c
(0xE008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
I've not checked that their uses are exactly the same, I'd expect at
least btree to match mostly, if not fully, 1:1).
I think we could probably benefit from a less ad-hoc sharing of query
texts. I don't think that needs to happen specifically in this patch,
but I think it's something to keep in mind in future efforts.

> +_brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
> [...]
> +BrinSpool  *spool = state->bs_spool;
> [...]
> +if (!state)
> +return;

I think the assignment to spool should be moved to below this
condition, as _brin_begin_parallel calls this with state=NULL when it
can't launch parallel workers, which will cause issues here.

> +state->bs_numtuples = brinshared->indtuples;

My IDE complains about bs_numtuples being an integer. This is a
pre-existing issue, but still valid: we can hit an overflow on tables
with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
but problematic nonetheless.

> + * FIXME This probably needs some memory management fixes - we're reading
> + * tuples from the tuplesort, we're allocating an emty tuple, and so on.
> + * Probably better to release this memory.

This should probably be resolved.

I also noticed that this is likely to execute `union_tuples` many
times when pages_per_range is coprime with the parallel table scan's
block stride (or when we for other reasons have many tuples returned
for each range); and this `union_tuples` internally allocates and
deletes its own memory context for its deserialization of the 'b'
tuple. I think we should just pass a scratch context instead, so that
we don't have the overhead of continously creating then deleting the
same memory context.

> +++ b/src/backend/catalog/index.c
> [...]
> -indexRelation->rd_rel->relam == BTREE_AM_OID)
> +(indexRelation->rd_rel->relam == BTREE_AM_OID ||
> + indexRelation->rd_rel->relam == BRIN_AM_OID))

I think this needs some more effort. I imagine a new
IndexAmRoutine->amcanbuildparallel is more appropriate than this
hard-coded list - external indexes may want to utilize the parallel
index creation planner facility, too.

Some notes:
As a project PostgreSQL seems to be trying to move away from
hardcoding heap into everything in favour of the more AM-agnostic
'table'. I suggest replacing all mentions of "heap" in the arguments
with "table", to reduce the work future maintainers need to do to fix
this. Even when this AM is mostly targetted towards the heap AM, other
AMs (such as those I've heard of that were developed internally at
EDB) use the same block-addressing that heap does, and should thus be
compatible with BRIN. Thus, "heap" is not a useful name here.

There are 2 new mentions of "tuplestore" in the patch, while the
structure used is tuplesort: one on form_and_spill_tuple, and on

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele


On 11/20/23 15:47, Robert Haas wrote:

On Mon, Nov 20, 2023 at 2:41 PM David Steele  wrote:

I can't see why a backup would continue to be valid without a manifest
-- that's not very standard for backup software. If you have the
critical info in backup_label, you can't afford to lose that, so why
should backup_manifest be any different?


I mean, you can run pg_basebackup --no-manifest.


Maybe this would be a good thing to disable for page incremental. With 
all the work being done by pg_combinebackup, it seems like it would be a 
good idea to be able to verify the final result?


I understand this is an option -- but does it need to be? What is the 
benefit of excluding the manifest?


Regards,
-David

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 14:46:13 -0500, Robert Haas wrote:
> On Mon, Nov 20, 2023 at 2:35 PM Andres Freund  wrote:
> > > Is there still outstanding work on this thread? Because I'm just now
> > > using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this
> > > kind of thing in a meson build:
> >
> > Ventura? In that case I assume you installed newer developer tools? Because
> > otherwise I think we were talking about issues introduced in Sonoma.
>
> I don't think I did anything special when installing developer tools.
> xcode-select --version reports 2397, if that tells you anything.

Odd then. My m1-mini running Ventura, also reporting 2397, doesn't show any of
those warnings. I did a CI run with Sonoma, and that does show them.

I'm updating said m1-mini it to Sonoma now, but that will take until I have to
leave for an appointment.

Greetings,

Andres Freund

Re: Implement missing join selectivity estimation for range types

2023-11-20 Thread Schoemans Maxime

On 14/11/2023 20:46, Tom Lane wrote:
> I took a brief look through this very interesting work.  I concur
> with Tomas that it feels a little odd that range join selectivity
> would become smarter than scalar inequality join selectivity, and
> that we really ought to prioritize applying these methods to that
> case.  Still, that's a poor reason to not take the patch.

Indeed, we started with ranges as this was the simpler case (no MCV) and 
was the topic of a course project.
The idea is to later write a second patch that applies these ideas to 
scalar inequality while also handling MCV's correctly.

> I also agree with the upthread criticism that having two identical
> functions in different source files will be a maintenance nightmare.
> Don't do it.  When and if there's a reason for the behavior to
> diverge between the range and multirange cases, it'd likely be
> better to handle that by passing in a flag to say what to do.

The duplication is indeed not ideal. However, there are already 8 other 
duplicate functions between the two files.
I would thus suggest to leave the duplication in this patch and create a 
second one that removes all duplication from the two files, instead of 
just removing the duplication for our new function.
What are your thoughts on this? If we do this, should the function 
definitions go in rangetypes.h or should we create a new 
rangetypes_selfuncs.h header?

> But my real unhappiness with the patch as-submitted is the test cases,
> which require rowcount estimates to be reproduced exactly.

> We need a more forgiving test method. Usually the
> approach is to set up a test case where the improved accuracy of
> the estimate changes the planner's choice of plan compared to what
> you got before, since that will normally not be too prone to change
> from variations of a percent or two in the estimates.

I have changed the test method to produce query plans for a 3-way range 
join.
The plans for the different operators differ due to the computed 
selectivity estimation, which was not the case before this patch.

Regards,
Maxime Schoemansdiff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index cefc4710fd..c670d225a0 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1335,3 +1335,558 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
 
 	return sum_frac;
 }
+
+/*
+ * This is a utility function used to estimate the join selectivity of
+ * range attributes using rangebound histogram statistics as described
+ * in this paper:
+ *
+ * Diogo Repas, Zhicheng Luo, Maxime Schoemans and Mahmoud Sakr, 2022
+ * Selectivity Estimation of Inequality Joins In Databases
+ * https://doi.org/10.48550/arXiv.2206.07396
+ *
+ * The attributes being joined will be treated as random variables
+ * that follow a distribution modeled by a Probability Density Function (PDF).
+ * Let the two attributes be denoted X, Y.
+ * This function finds the probability P(X < Y).
+ * Note that the PDFs of the two variables can easily be obtained
+ * from their bounds histogram, respectively hist1 and hist2 .
+ *
+ * Let the PDF of X, Y be denoted as f_X, f_Y.
+ * The probability P(X < Y) can be formalized as follows:
+ * P(X < Y)= integral_-inf^inf( integral_-inf^y ( f_X(x) * f_Y(y) dx dy ) )
+ *= integral_-inf^inf( F_X(y) * f_Y(y) dy )
+ * where F_X(y) denote the Cumulative Distribution Function of X at y.
+ * Note that F_X is the selectivity estimation (non-join),
+ * which is implemented using the function calc_hist_selectivity_scalar.
+ *
+ * Now given the histograms of the two attributes X, Y, we note the following:
+ * - The PDF of Y is a step function
+ * (constant piece-wise, where each piece is defined in a bin of Y's histogram)
+ * - The CDF of X is linear piece-wise
+ *   (each piece is defined in a bin of X's histogram)
+ * This leads to the conclusion that their product
+ * (used to calculate the equation above) is also linear piece-wise.
+ * A new piece starts whenever either the bin of X or the bin of Y changes.
+ * By parallel scanning the two rangebound histograms of X and Y,
+ * we evaluate one piece of the result between every two consecutive rangebounds
+ * in the union of the two histograms.
+ *
+ * Given that the product F_X * f_y is linear in the interval
+ * between every two consecutive rangebounds, let them be denoted prev, cur,
+ * it can be shown that the above formula can be discretized into the following:
+ * P(X < Y) =
+ *   0.5 * sum_0^{n+m-1} ( ( F_X(prev) + F_X(cur) ) * ( F_Y(cur) - F_Y(prev) ) )
+ * where n, m are the lengths of the two histograms.
+ *
+ * As such, it is possible to fully compute the join selectivity
+ * as a summation of CDFs, iterating over the bounds of the two histograms.
+ * This maximizes the code reuse, since the CDF is computed using
+ * the calc_hist_selectivity_scalar function, which is the function used
+

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 11:11:13 -0500, Robert Haas wrote:
> I think we need more votes to make a change this big. I have a
> concern, which I think I've expressed before, that we keep whacking
> around the backup APIs, and that has a cost which is potentially
> larger than the benefits.

+1.  The amount of whacking around in this area has been substantial, and it's
hard for operators to keep up. And realistically, with data sizes today, the
pressure to do basebackups with disk snapshots etc is not going to shrink.

Leaving that concern aside, I am still on the fence about this proposal. I
think it does decrease the chance of getting things wrong in the
streaming-basebackup case. But for external backups, it seems almost
universally worse (with the exception of the torn pg_control issue, that we
also can address otherwise):

It doesn't reduce the risk of getting things wrong, you can still omit placing
a file into the data directory and get silent corruption as a consequence. In
addition, it's harder to see when looking at a base backup whether the process
was right or not, because now the good and bad state look the same if you just
look on the filesystem level!

Then there's the issue of making ad-hoc debugging harder by not having a
human readable file with information anymore, including when looking at the
history, via backup_label.old.

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup

Greetings,

Andres Freund

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 15:56:19 -0400, David Steele wrote:
> I understand this is an option -- but does it need to be? What is the
> benefit of excluding the manifest?

It's not free to create the manifest, particularly if checksums are enabled.

Also, for external backups, there's no manifest...

- Andres

Re: PSQL error: total cell count of XXX exceeded

2023-11-20 Thread Alvaro Herrera

On 2023-Sep-12, Tom Lane wrote:

> I'm more than a bit skeptical about trying to do something about this,
> simply because this range of query result sizes is far past what is
> practical.  The OP clearly hasn't tested his patch on actually
> overflowing query results, and I don't care to either.

I think we're bound to hit this limit at some point in the future, and
it seems easy enough to solve.  I propose the attached, which is pretty
much what Hongxu last submitted, with some minor changes.

Having this make a difference requires some 128GB of RAM, so it's not a
piece of cake, but it's an amount that can be reasonably expected to be
physically installed in real machines nowadays.

(I first thought we could just use pg_mul_s32_overflow during
printTableInit and raise an error if that returns true, but that just
postpones the problem.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Subversion to GIT: the shortest path to happiness I've ever heard of
(Alexey Klyukin)
>From 75abe5a485532cbc03db1eade2e1a6099914f98f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 20 Nov 2023 17:35:18 +0100
Subject: [PATCH v5] Avoid overflow in fe_utils' printTable API

We were using 32-bit integer arithmetic to compute the total number of
cells, which can overflow when used with large resultsets.  We still
don't allow more than 4 billion rows, but now the total number of cells
can exceed that.

Author: Hongxu Ma 
Discussion: https://postgr.es/m/tybp286mb0351b057b101c90d7c1239e6b4...@tybp286mb0351.jpnp286.prod.outlook.com
---
 src/fe_utils/print.c | 22 ++
 src/include/fe_utils/print.h |  2 +-
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/src/fe_utils/print.c b/src/fe_utils/print.c
index 7af1ccb6b5..565cbc6d13 100644
--- a/src/fe_utils/print.c
+++ b/src/fe_utils/print.c
@@ -3172,6 +3172,8 @@ void
 printTableInit(printTableContent *const content, const printTableOpt *opt,
 			   const char *title, const int ncolumns, const int nrows)
 {
+	int64		total_cells;
+
 	content->opt = opt;
 	content->title = title;
 	content->ncolumns = ncolumns;
@@ -3179,7 +3181,8 @@ printTableInit(printTableContent *const content, const printTableOpt *opt,
 
 	content->headers = pg_malloc0((ncolumns + 1) * sizeof(*content->headers));
 
-	content->cells = pg_malloc0((ncolumns * nrows + 1) * sizeof(*content->cells));
+	total_cells = (int64) ncolumns * nrows;
+	content->cells = pg_malloc0((total_cells + 1) * sizeof(*content->cells));
 
 	content->cellmustfree = NULL;
 	content->footers = NULL;
@@ -3249,15 +3252,17 @@ void
 printTableAddCell(printTableContent *const content, char *cell,
   const bool translate, const bool mustfree)
 {
+	int64		total_cells;
+
 #ifndef ENABLE_NLS
 	(void) translate;			/* unused parameter */
 #endif
 
-	if (content->cellsadded >= content->ncolumns * content->nrows)
+	total_cells = (int64) content->ncolumns * content->nrows;
+	if (content->cellsadded >= total_cells)
 	{
-		fprintf(stderr, _("Cannot add cell to table content: "
-		  "total cell count of %d exceeded.\n"),
-content->ncolumns * content->nrows);
+		fprintf(stderr, _("Cannot add cell to table content: total cell count of %lld exceeded.\n"),
+(long long int) total_cells);
 		exit(EXIT_FAILURE);
 	}
 
@@ -3273,7 +3278,7 @@ printTableAddCell(printTableContent *const content, char *cell,
 	{
 		if (content->cellmustfree == NULL)
 			content->cellmustfree =
-pg_malloc0((content->ncolumns * content->nrows + 1) * sizeof(bool));
+pg_malloc0((total_cells + 1) * sizeof(bool));
 
 		content->cellmustfree[content->cellsadded] = true;
 	}
@@ -3341,9 +3346,10 @@ printTableCleanup(printTableContent *const content)
 {
 	if (content->cellmustfree)
 	{
-		int			i;
+		int64		total_cells;
 
-		for (i = 0; i < content->nrows * content->ncolumns; i++)
+		total_cells = (int64) content->ncolumns * content->nrows;
+		for (int64 i = 0; i < total_cells; i++)
 		{
 			if (content->cellmustfree[i])
 free(unconstify(char *, content->cells[i]));
diff --git a/src/include/fe_utils/print.h b/src/include/fe_utils/print.h
index 13ebb00362..a697d0ba8d 100644
--- a/src/include/fe_utils/print.h
+++ b/src/include/fe_utils/print.h
@@ -171,7 +171,7 @@ typedef struct printTableContent
 	const char **cells;			/* NULL-terminated array of cell content
  * strings */
 	const char **cell;			/* Pointer to the last added cell */
-	long		cellsadded;		/* Number of cells added this far */
+	int64		cellsadded;		/* Number of cells added this far */
 	bool	   *cellmustfree;	/* true for cells that need to be free()d */
 	printTableFooter *footers;	/* Pointer to the first footer */
 	printTableFooter *footer;	/* Pointer to the last added footer */
-- 
2.39.2

Re: Partial aggregates pushdown

2023-11-20 Thread Robert Haas

On Mon, Nov 13, 2023 at 3:26 AM fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> In postgres_fdw.sql, I have corrected the output format for floating point 
> numbers
> by extra_float_digits.

Looking at this, I find that it's not at all clear to me how the
partial aggregate function is defined. Let's look at what we have for
documentation:

+  
+   Paraemter AGGPARTIALFUNC optionally defines a
+   partial aggregate function used for partial aggregate pushdown; see
+for details.
+  

+   Partial aggregate function (zero if none).
+   See  for the definition
+   of partial aggregate function.

+   Partial aggregate pushdown is an optimization for queries that contains
+   aggregate expressions for a partitioned table across one or more remote
+   servers. If multiple conditions are met, partial aggregate function

+   When partial aggregate pushdown is used for aggregate expressions,
+   remote queries replace aggregate function calls with partial
+   aggregate function calls.  If the data type of the state value is not

But there's no definition of what the behavior of the function is
anywhere that I can see, not even in . Everywhere it only describes how the
partial aggregate function is used, not what it is supposed to do.

Looking at the changes in pg_aggregate.dat, it seems like the partial
aggregate function is a second aggregate defined in a way that mostly
matches the original, except that (1) if the original final function
would have returned a data type other than internal, then the final
function is removed; and (2) if the original final function would have
returned a value of internal type, then the final function is the
serialization function of the original aggregate. I think that's a
reasonable definition, but the documentation and code comments need to
be a lot clearer.

I do have a concern about this, though. It adds a lot of bloat. It
adds a whole lot of additional entries to pg_aggregate, and every new
aggregate we add in the future will require a bonus entry for this,
and it needs a bunch of new pg_proc entries as well. One idea that
I've had in the past is to instead introduce syntax that just does
this, without requiring a separate aggregate definition in each case.
For example, maybe instead of changing string_agg(whatever) to
string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE
string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or
something. Then all aggregates could be treated in a generic way. I'm
not completely sure that's better, but I think it's worth considering.

I think that the control mechanism needs some thought. Right now,
there are two possible behaviors: either we assume that the local and
remote sides are the same unconditionally, or we assume that they're
the same if the remote side is a new enough version. I do like having
those behaviors available, but I wonder if we need to do something
better or different. What if somebody wants to push down a
non-built-in aggregate, for example? I realize that we don't have
great solutions to the problem of knowing which functions are
push-downable in general, and I don't know that partial aggregation
needs to be any better than anything else, but it's probably worth
comparing and contrasting the approach we take here with the
approaches we've taken in other, similar cases. From that point of
view, I think check_partial_aggregate_support is a novelty: we don't
do those kinds of checks in other cases, AFAIK. But on the other hand,
there is the 'extensions' argument to postgres_fdw.

I don't think the patch does a good job explaining why HAVING,
DISTINCT, and ORDER BY are a problem. It seems to me that HAVING
shouldn't really be a problem, because HAVING is basically a WHERE
clause that occurs after aggregation is complete, and whether or not
the aggregation is safe shouldn't depend on what we're going to do
with the value afterward. The HAVING clause can't necessarily be
pushed to the remote side, but I don't see how or why it could make
the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a
little trickier: if we pushed down DISTINCT, we'd still have to
re-DISTINCT-ify when combining locally, and if we pushed down ORDER
BY, we'd have to do a merge pass to combine the returned values unless
we could prove that the partitions were non-overlapping ranges that
would be visited in the correct order. Although that all sounds
doable, I think it's probably a good thing that the current patch
doesn't try to handle it -- this is complicated already. But it should
explain why it's not handling it and maybe even a bit about how it
could be handling in the future, rather than just saying "well, this
kind of thing is not safe." The trouble with that explanation is that
it does nothing to help the reader understand whether the thing in
question is *fundamentally* unsafe or whether we just don't have the
right code to make it work.

Typo: Paraemter

I'm so sorry to keep complaining about comm

Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-11-20 Thread Robert Haas

On Tue, Nov 14, 2023 at 11:21 PM Jeff Davis  wrote:
> After adding the search path cache (recent commit f26c2368dc) hopefully
> that helps to make the above suggestion more reasonable performance-
> wise. I think we can call that progress.

I agree. Not to burden you, but do you know what the overhead is now,
and do you have any plans to further reduce it? I don't believe that's
the only thing we ought to be doing here, necessarily, but it is one
thing that we definitely should be doing and probably the least
controversial.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Tom Lane

Robert Haas  writes:
> Is there still outstanding work on this thread? Because I'm just now
> using a new MacBook (M2, Ventura 13.6.2) and I'm getting a lot of this
> kind of thing in a meson build:

13.6.2?  longfin's host is on 13.6.1, and the only thing Software
Update is showing me is an option to upgrade to Sonoma.  But anyway...

> [2264/2287] Linking target src/interfaces/ecpg/test/sql/parser
> ld: warning: -undefined error is deprecated
> ld: warning: ignoring duplicate libraries: '-lz'

Hmm ... I fixed these things in the autoconf build: neither my
buildfarm animals nor manual builds show any warnings.  I thought
the problems weren't there in the meson build.  Need to take another
look I guess.

regards, tom lane

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Robert Haas

On Mon, Nov 20, 2023 at 3:53 PM Tom Lane  wrote:
> 13.6.2?  longfin's host is on 13.6.1, and the only thing Software
> Update is showing me is an option to upgrade to Sonoma.  But anyway...

Uh, I guess Apple made a special version just for me? That's
definitely what it says.

> > [2264/2287] Linking target src/interfaces/ecpg/test/sql/parser
> > ld: warning: -undefined error is deprecated
> > ld: warning: ignoring duplicate libraries: '-lz'
>
> Hmm ... I fixed these things in the autoconf build: neither my
> buildfarm animals nor manual builds show any warnings.  I thought
> the problems weren't there in the meson build.  Need to take another
> look I guess.

They're definitely there for me, and there are a whole lot of them. I
would have thought that if they were there for you in the meson build
you would have noticed, since ninja suppresses a lot of distracting
output that make prints.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David G. Johnston

On Mon, Nov 20, 2023 at 1:37 PM Andres Freund  wrote:

>
> Given that, I wonder if what we should do is to just add a new field to
> pg_control that says "error out if backup_label does not exist", that we
> set
> when creating a streaming base backup
>
>
I thought this was DOA since we don't want to ever leave the cluster in a
state where a crash requires intervention to restart.  But I agree that it
is not possible to fool-proof agaInst a naive backup that copies over the
pg_control file as-is if breaking the crashed cluster option is not in play.

I agree that this works if the pg_control generated by stop backup produces
the line and we retain the label file as a separate and now mandatory
component to using the backup.

Or is the idea to make v17 error if it sees a backup label unless
pg_control has the feature flag field?  Which doesn't exist normally, does
in the basebackup version, and is removed once the backup is restored?

David J.

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Tom Lane

Robert Haas  writes:
> On Mon, Nov 20, 2023 at 3:53 PM Tom Lane  wrote:
>> 13.6.2?  longfin's host is on 13.6.1, and the only thing Software
>> Update is showing me is an option to upgrade to Sonoma.  But anyway...

> Uh, I guess Apple made a special version just for me? That's
> definitely what it says.

Might be for M-series only; longfin's host is still Intel.

>> Hmm ... I fixed these things in the autoconf build: neither my
>> buildfarm animals nor manual builds show any warnings.  I thought
>> the problems weren't there in the meson build.  Need to take another
>> look I guess.

> They're definitely there for me, and there are a whole lot of them. I
> would have thought that if they were there for you in the meson build
> you would have noticed, since ninja suppresses a lot of distracting
> output that make prints.

I'm generally still using autoconf, I only run meson builds when
somebody complains about them ;-).  But yeah, I see lots of
"ld: warning: -undefined error is deprecated" when I do that.
This seems to have been installed by Andres' 9a95a510a:

   ldflags += ['-isysroot', pg_sysroot]
+  # meson defaults to -Wl,-undefined,dynamic_lookup for modules, which we
+  # don't want because a) it's different from what we do for autoconf, b) it
+  # causes warnings starting in macOS Ventura
+  ldflags_mod += ['-Wl,-undefined,error']

The autoconf side seems to just be letting this option default.
I'm not sure what the default choice is, but evidently it's not
"-undefined error"?  Or were they stupid enough to not allow you
to explicitly select the default behavior?

Also, I *don't* see any complaints about duplicate libraries.
What build options are you using?

regards, tom lane

Re: PANIC serves too many masters

2023-11-20 Thread Jeff Davis

Hi,

On Sat, 2023-11-18 at 14:29 -0800, Andres Freund wrote:
> I don't quite know what we should do. But the current situation
> decidedly
> doesn't seem great.

Agreed. Better classification is nice, but it also requires more
discipline and it might not always be obvious which category something
fits in. What about an error loop resulting in:

  ereport(PANIC, (errmsg_internal("ERRORDATA_STACK_SIZE exceeded")));

We'd want a core file, but I don't think we want to restart in that
case, right?

Also, can we do a change like this incrementally by updating a few
PANIC sites at a time? Is it fine to leave plain PANICs in place for
the foreseeable future, or do you want all of them to eventually move?

Regards,
Jeff Davis

Re: Annoying build warnings from latest Apple toolchain

2023-11-20 Thread Tom Lane

I wrote:
> The autoconf side seems to just be letting this option default.
> I'm not sure what the default choice is, but evidently it's not
> "-undefined error"?  Or were they stupid enough to not allow you
> to explicitly select the default behavior?

Seems we are not the only project having trouble with this:

https://github.com/mesonbuild/meson/issues/12450

I had not realized that Apple recently wrote themselves a whole
new linker, but apparently that's why all these deprecation
warnings are showing up.  It's not exactly clear whether
"deprecation" means they actually plan to remove the feature
later, or just that some bozo decided that explicitly specifying
the default behavior is bad style.

regards, tom lane

Re: PANIC serves too many masters

2023-11-20 Thread Tom Lane

Jeff Davis  writes:
> On Sat, 2023-11-18 at 14:29 -0800, Andres Freund wrote:
>> I don't quite know what we should do. But the current situation
>> decidedly
>> doesn't seem great.

> Agreed.

+1

> Better classification is nice, but it also requires more
> discipline and it might not always be obvious which category something
> fits in. What about an error loop resulting in:
>   ereport(PANIC, (errmsg_internal("ERRORDATA_STACK_SIZE exceeded")));
> We'd want a core file, but I don't think we want to restart in that
> case, right?

Why not restart?  There's no strong reason to assume this will
repeat.

It might be worth having some independent logic in the postmaster
that causes it to give up after too many crashes in a row.  But with
many/most of these call sites, by definition we're not too sure what
is wrong.

> Also, can we do a change like this incrementally by updating a few
> PANIC sites at a time? Is it fine to leave plain PANICs in place for
> the foreseeable future, or do you want all of them to eventually move?

I'd be inclined to keep PANIC with its current meaning, and
incrementally change call sites where we decide that's not the
best behavior.  I think those will be a minority, maybe a small
minority.  (PANIC_EXIT had darn well better be a small minority.)

regards, tom lane

Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-11-20 Thread Jeff Davis

On Mon, 2023-11-20 at 15:52 -0500, Robert Haas wrote:
> I agree. Not to burden you, but do you know what the overhead is now,
> and do you have any plans to further reduce it? I don't believe
> that's
> the only thing we ought to be doing here, necessarily, but it is one
> thing that we definitely should be doing and probably the least
> controversial.

Running the simple test described here:

https://www.postgresql.org/message-id/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com

The baseline (no "SET search_path" clause on the function) is around
3800ms, and with the clause it shoots up to 8000ms. That's not good,
but it is down from about 12000ms before.

There are a few patches in the queue to bring it down further. Andres
and I were discussing some GUC hashtable optimizations here:

https://www.postgresql.org/message-id/27a7a289d5b8f42e1b1e79b1bcaeef3a40583bd2.ca...@j-davis.com

which will (if committed) bring it down into the mid 7s.

There are also a couple other patches I have here (and intend to commit
soon):

https://www.postgresql.org/message-id/e6fded24cb8a2c53d4ef069d9f69cc7baaafe9ef.camel%40j-davis.com

and those I think will get it into the mid 6s. I think a bit lower
combined with the GUC hash table optimizations above.

So we are still looking at around 50% overhead for a simple function if
all this stuff gets committed. Not great, but a lot better than before.

Of course I welcome others to profile and see what they can do. There's
a setjmp() call, and a couple allocations, and maybe some other stuff
to look at. There are also higher-level ideas, like avoiding calling
into guc.c in some cases, but that starts to get tricky as you pointed
out:

https://www.postgresql.org/message-id/CA%2BTgmoa8uKQgak5wH0%3D7sL-ukqbwnCPMXA2ZW7Ccdt7tdNGkzg%40mail.gmail.com

It seems others are also interested in this problem, so I can put some
more effort in after this round of patches goes in. I don't have a
specific target other than "low enough overhead that we can reasonably
recommend it as a best practice in multi-user environments".

Regards,
Jeff Davis

Re: PSQL error: total cell count of XXX exceeded

2023-11-20 Thread Tom Lane

Alvaro Herrera  writes:
> I think we're bound to hit this limit at some point in the future, and
> it seems easy enough to solve.  I propose the attached, which is pretty
> much what Hongxu last submitted, with some minor changes.

This bit needs more work:

-   content->cells = pg_malloc0((ncolumns * nrows + 1) * 
sizeof(*content->cells));
+   total_cells = (int64) ncolumns * nrows;
+   content->cells = pg_malloc0((total_cells + 1) * 
sizeof(*content->cells));

You've made the computation of total_cells reliable, but there's
nothing stopping the subsequent computation of the malloc argument
from overflowing (especially on 32-bit machines).  I think we need
an explicit test along the lines of

if (total_cells >= SIZE_MAX / sizeof(*content->cells))
throw error;

(">=" allows not needing to add +1.)

Also, maybe total_cells should be uint64?  We don't want
negative values to pass this test.  Alternatively, add a separate
check that total_cells >= 0.

It should be sufficient to be paranoid about this in printTableInit,
since after that we know the product of ncolumns * nrows isn't
too big.

The rest of this passes an eyeball check.

regards, tom lane

Re: PANIC serves too many masters

2023-11-20 Thread Jeff Davis

On Mon, 2023-11-20 at 17:12 -0500, Tom Lane wrote:
> I'd be inclined to keep PANIC with its current meaning, and
> incrementally change call sites where we decide that's not the
> best behavior.  I think those will be a minority, maybe a small
> minority.  (PANIC_EXIT had darn well better be a small minority.)

Is the error level the right way to express what we want to happen? It
seems like what we really want is to decide on the behavior, i.e.
restart or not, and generate core or not. That could be done a
different way, like:

  ereport(PANIC,
  (errmsg("could not locate a valid checkpoint record"),
   errabort(false),errrestart(false)));

Regards,
Jeff Davis

Re: Partial aggregates pushdown

2023-11-20 Thread Bruce Momjian

On Mon, Nov 20, 2023 at 03:51:33PM -0500, Robert Haas wrote:
> On Mon, Nov 13, 2023 at 3:26 AM fujii.y...@df.mitsubishielectric.co.jp
>  wrote:
> > In postgres_fdw.sql, I have corrected the output format for floating point 
> > numbers
> > by extra_float_digits.
> 
> Looking at this, I find that it's not at all clear to me how the
> partial aggregate function is defined. Let's look at what we have for
> documentation:
> 
> +  
> +   Paraemter AGGPARTIALFUNC optionally defines a
> +   partial aggregate function used for partial aggregate pushdown; see
> +for details.
> +  
> 
> +   Partial aggregate function (zero if none).
> +   See  for the definition
> +   of partial aggregate function.
> 
> +   Partial aggregate pushdown is an optimization for queries that contains
> +   aggregate expressions for a partitioned table across one or more remote
> +   servers. If multiple conditions are met, partial aggregate function
> 
> +   When partial aggregate pushdown is used for aggregate expressions,
> +   remote queries replace aggregate function calls with partial
> +   aggregate function calls.  If the data type of the state value is not
> 
> But there's no definition of what the behavior of the function is
> anywhere that I can see, not even in  id="partial-aggregate-pushdown">. Everywhere it only describes how the
> partial aggregate function is used, not what it is supposed to do.

Yes, I had to figure that out myself, and I was wondering how much
detail to have in our docs vs README files vs. C comments.  I think we
should put more details somewhere.

> Looking at the changes in pg_aggregate.dat, it seems like the partial
> aggregate function is a second aggregate defined in a way that mostly
> matches the original, except that (1) if the original final function
> would have returned a data type other than internal, then the final
> function is removed; and (2) if the original final function would have
> returned a value of internal type, then the final function is the
> serialization function of the original aggregate. I think that's a
> reasonable definition, but the documentation and code comments need to
> be a lot clearer.

Agreed.  I wasn't sure enough about this to add it when I was reviewing
the patch.

> I do have a concern about this, though. It adds a lot of bloat. It
> adds a whole lot of additional entries to pg_aggregate, and every new
> aggregate we add in the future will require a bonus entry for this,
> and it needs a bunch of new pg_proc entries as well. One idea that
> I've had in the past is to instead introduce syntax that just does
> this, without requiring a separate aggregate definition in each case.
> For example, maybe instead of changing string_agg(whatever) to
> string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE
> string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or
> something. Then all aggregates could be treated in a generic way. I'm
> not completely sure that's better, but I think it's worth considering.

So use an SQL keyword to indicates a pushdown call?  We could then
automate the behavior rather than requiring special catalog functions?

> I think that the control mechanism needs some thought. Right now,
> there are two possible behaviors: either we assume that the local and
> remote sides are the same unconditionally, or we assume that they're
> the same if the remote side is a new enough version. I do like having
> those behaviors available, but I wonder if we need to do something
> better or different. What if somebody wants to push down a
> non-built-in aggregate, for example? I realize that we don't have

It does allow specification of extensions that can be pushed down.

> great solutions to the problem of knowing which functions are
> push-downable in general, and I don't know that partial aggregation
> needs to be any better than anything else, but it's probably worth
> comparing and contrasting the approach we take here with the
> approaches we've taken in other, similar cases. From that point of
> view, I think check_partial_aggregate_support is a novelty: we don't
> do those kinds of checks in other cases, AFAIK. But on the other hand,
> there is the 'extensions' argument to postgres_fdw.

Right.  I am not sure how to improve what the patch does.

> I don't think the patch does a good job explaining why HAVING,
> DISTINCT, and ORDER BY are a problem. It seems to me that HAVING
> shouldn't really be a problem, because HAVING is basically a WHERE
> clause that occurs after aggregation is complete, and whether or not
> the aggregation is safe shouldn't depend on what we're going to do
> with the value afterward. The HAVING clause can't necessarily be
> pushed to the remote side, but I don't see how or why it could make
> the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a
> little trickier: if we pushed down DISTINCT, we'd still have to
> re-DISTINCT-ify when combining locally, and if we pushed down ORDER

Re: Hide exposed impl detail of wchar.c

2023-11-20 Thread Nathan Bossart

On Mon, Nov 20, 2023 at 10:50:36AM -0800, Jubilee Young wrote:
> In that case, I took a look across the codebase and saw a
> utils/ascii.h that doesn't
> seem to have gotten much love, but I suppose one could argue that it's 
> intended
> to be a backend-only header file?

That might work.  It's not #included in very many files, so adding
port/simd.h shouldn't be too bad.  And ascii.h is also pretty inexpensive,
so including it in wchar.c seems permissible, too.  I'm not certain this
doesn't cause problems with libpgcommon, but I don't see why it would,
either.

> So it should probably end up living somewhere near the UTF-8 support, and
> the easiest way to make it not go into something pgrx currently
> includes would be
> to make it a new header file, though there's a fair amount of API we
> don't touch.

Does pgrx use ascii.h at all?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/common/wchar.c b/src/common/wchar.c
index fb9d9f5c85..fbac11deb4 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -13,6 +13,7 @@
 #include "c.h"
 
 #include "mb/pg_wchar.h"
+#include "utils/ascii.h"
 
 
 /*
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 29cd5732f1..80676d9e02 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -22,8 +22,6 @@
 #ifndef PG_WCHAR_H
 #define PG_WCHAR_H
 
-#include "port/simd.h"
-
 /*
  * The pg_wchar type
  */
@@ -722,71 +720,4 @@ extern int	mic2latin_with_table(const unsigned char *mic, unsigned char *p,
 extern WCHAR *pgwin32_message_to_UTF16(const char *str, int len, int *utf16len);
 #endif
 
-
-/*
- * Verify a chunk of bytes for valid ASCII.
- *
- * Returns false if the input contains any zero bytes or bytes with the
- * high-bit set. Input len must be a multiple of the chunk size (8 or 16).
- */
-static inline bool
-is_valid_ascii(const unsigned char *s, int len)
-{
-	const unsigned char *const s_end = s + len;
-	Vector8		chunk;
-	Vector8		highbit_cum = vector8_broadcast(0);
-#ifdef USE_NO_SIMD
-	Vector8		zero_cum = vector8_broadcast(0x80);
-#endif
-
-	Assert(len % sizeof(chunk) == 0);
-
-	while (s < s_end)
-	{
-		vector8_load(&chunk, s);
-
-		/* Capture any zero bytes in this chunk. */
-#ifdef USE_NO_SIMD
-
-		/*
-		 * First, add 0x7f to each byte. This sets the high bit in each byte,
-		 * unless it was a zero. If any resulting high bits are zero, the
-		 * corresponding high bits in the zero accumulator will be cleared.
-		 *
-		 * If none of the bytes in the chunk had the high bit set, the max
-		 * value each byte can have after the addition is 0x7f + 0x7f = 0xfe,
-		 * and we don't need to worry about carrying over to the next byte. If
-		 * any input bytes did have the high bit set, it doesn't matter
-		 * because we check for those separately.
-		 */
-		zero_cum &= (chunk + vector8_broadcast(0x7F));
-#else
-
-		/*
-		 * Set all bits in each lane of the highbit accumulator where input
-		 * bytes are zero.
-		 */
-		highbit_cum = vector8_or(highbit_cum,
- vector8_eq(chunk, vector8_broadcast(0)));
-#endif
-
-		/* Capture all set bits in this chunk. */
-		highbit_cum = vector8_or(highbit_cum, chunk);
-
-		s += sizeof(chunk);
-	}
-
-	/* Check if any high bits in the high bit accumulator got set. */
-	if (vector8_is_highbit_set(highbit_cum))
-		return false;
-
-#ifdef USE_NO_SIMD
-	/* Check if any high bits in the zero accumulator got cleared. */
-	if (zero_cum != vector8_broadcast(0x80))
-		return false;
-#endif
-
-	return true;
-}
-
 #endif			/* PG_WCHAR_H */
diff --git a/src/include/utils/ascii.h b/src/include/utils/ascii.h
index 630acd9bfd..7df024dad3 100644
--- a/src/include/utils/ascii.h
+++ b/src/include/utils/ascii.h
@@ -11,6 +11,74 @@
 #ifndef _ASCII_H_
 #define _ASCII_H_
 
+#include "port/simd.h"
+
 extern void ascii_safe_strlcpy(char *dest, const char *src, size_t destsiz);
 
+/*
+ * Verify a chunk of bytes for valid ASCII.
+ *
+ * Returns false if the input contains any zero bytes or bytes with the
+ * high-bit set. Input len must be a multiple of the chunk size (8 or 16).
+ */
+static inline bool
+is_valid_ascii(const unsigned char *s, int len)
+{
+	const unsigned char *const s_end = s + len;
+	Vector8		chunk;
+	Vector8		highbit_cum = vector8_broadcast(0);
+#ifdef USE_NO_SIMD
+	Vector8		zero_cum = vector8_broadcast(0x80);
+#endif
+
+	Assert(len % sizeof(chunk) == 0);
+
+	while (s < s_end)
+	{
+		vector8_load(&chunk, s);
+
+		/* Capture any zero bytes in this chunk. */
+#ifdef USE_NO_SIMD
+
+		/*
+		 * First, add 0x7f to each byte. This sets the high bit in each byte,
+		 * unless it was a zero. If any resulting high bits are zero, the
+		 * corresponding high bits in the zero accumulator will be cleared.
+		 *
+		 * If none of the bytes in the chunk had the high bit set, the max
+		 * value each byte can have after the addition is 0x7f + 0x7f = 0xfe,
+		 * and we don't need to worry about carrying over to the next byte. If
+		 * any input bytes did have the high bit set, it do

Re: PANIC serves too many masters

2023-11-20 Thread Tom Lane

Jeff Davis  writes:
> Is the error level the right way to express what we want to happen? It
> seems like what we really want is to decide on the behavior, i.e.
> restart or not, and generate core or not. That could be done a
> different way, like:

>   ereport(PANIC,
>   (errmsg("could not locate a valid checkpoint record"),
>errabort(false),errrestart(false)));

Yeah, I was wondering about that too.  It feels to me that
PANIC_EXIT is an error level (even more severe than PANIC).
But maybe "no core dump please" should be conveyed separately,
since it's just a minor adjustment that doesn't fundamentally
change what happens.  It's plausible that you'd want a core,
or not want one, for different cases that all seem to require
PANIC_EXIT.

(Need a better name than PANIC_EXIT.  OMIGOD?)

regards, tom lane

Re: PANIC serves too many masters

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 17:55:32 -0500, Tom Lane wrote:
> Jeff Davis  writes:
> > Is the error level the right way to express what we want to happen? It
> > seems like what we really want is to decide on the behavior, i.e.
> > restart or not, and generate core or not. That could be done a
> > different way, like:
> 
> >   ereport(PANIC,
> >   (errmsg("could not locate a valid checkpoint record"),
> >errabort(false),errrestart(false)));
> 
> Yeah, I was wondering about that too.  It feels to me that
> PANIC_EXIT is an error level (even more severe than PANIC).
> But maybe "no core dump please" should be conveyed separately,
> since it's just a minor adjustment that doesn't fundamentally
> change what happens.

I guess I was thinking of an error level because that'd be easier to search
for in logs. It seems reasonable to want to specificially search for errors
that cause core dumps, since IMO they should all be "should never happen" kind
of paths.

> It's plausible that you'd want a core,
> or not want one, for different cases that all seem to require
> PANIC_EXIT.

I can't immediately think of a case where you'd want PANIC_EXIT but also want
a core dump? In my mental model to use PANIC_EXIT we'd need to have a decent
understanding that the situation isn't going to change after crash-restart -
in which case a core dump presumably isn't interesting?

> (Need a better name than PANIC_EXIT.  OMIGOD?)

CRITICAL?

I agree with the point made upthread that we'd want leave PANIC around, it's
not realistic to annotate everything, and then there's obviously also
extensions (although I hope there aren't many PANICs in extensions).

If that weren't the case, something like this could make sense:

PANIC: crash-restart
CRITICAL: crash-shutdown
BUG: crash-restart, abort()

Greetings,

Andres Freund

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 11:11:13AM -0500, Robert Haas wrote:
> I think we need more votes to make a change this big. I have a
> concern, which I think I've expressed before, that we keep whacking
> around the backup APIs, and that has a cost which is potentially
> larger than the benefits. The last time we changed the API, we changed
> pg_stop_backup to pg_backup_stop, but this doesn't do that, and I
> wonder if that's OK. Even if it is, do we really want to change this
> API around again after such a short time?

Agreed.

> That said, I don't have an intrinsic problem with moving this
> information from the backup_label to the backup_manifest file since it
> is purely informational. I do think there should perhaps be some
> additions to the test cases, though.

Yep, cool.  Even if we decide to not go with what's discussed in this
patch, I think that's useful for some users at the end to get more
redundancy, as well.  And that's in a format easier to parse.

> I am concerned about the interaction of this proposal with incremental
> backup. When you take an incremental backup, you get something that
> looks a lot like a usable data directory but isn't. To prevent that
> from causing avoidable disasters, the present version of the patch
> adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the
> backup_label. pg_combinebackup knows to look for those fields, and the
> server knows that if they are present it should refuse to start. With
> this change, though, I suppose those fields would end up in
> pg_control. But that does not feel entirely great, because we have a
> goal of keeping the amount of real data in pg_control below 512 bytes,
> the traditional sector size, and this adds another 12 bytes of stuff
> to that file that currently doesn't need to be there. I feel like
> that's kind of a problem.

I don't recall one time where the addition of new fields to the
control file was easy to discuss because of its 512B hard limit.
Anyway, putting the addition aside for a second, and I've not looked
at the incremental backup patch, does the removal of the backup_label
make the combine logic more complicated, or that's just moving a chunk
of code to do a control file lookup instead of a backup_file parsing?
Making the information less readable is definitely an issue for me.  A
different alternative that I've mentioned upthread is to keep an
equivalent of the backup_label and rename it to something like
backup.debug or similar, with a name good enough to tell people that
we don't care about it being removed.
--
Michael


signature.asc
Description: PGP signature

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 14:18:15 -0700, David G. Johnston wrote:
> On Mon, Nov 20, 2023 at 1:37 PM Andres Freund  wrote:
> 
> >
> > Given that, I wonder if what we should do is to just add a new field to
> > pg_control that says "error out if backup_label does not exist", that we
> > set
> > when creating a streaming base backup
> >
> >
> I thought this was DOA since we don't want to ever leave the cluster in a
> state where a crash requires intervention to restart.

I was trying to suggest that we'd set the field in-memory, when streaming out
a pg_basebackup style backup (by just replacing pg_control with an otherwise
identical file that has the flag set). So it'd not have any effect on the
primary.

Greetings,

Andres Freund

Re: Why is hot_standby_feedback off by default?

2023-11-20 Thread Andres Freund

On 2023-11-20 16:34:47 +0700, John Naylor wrote:
> Sounds like a TODO?

WFM. I don't personally use or update TODO, as I have my doubts about its
usefulness or state of maintenance. But please feel free to add this as a TODO
from my end...

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:
> Given that, I wonder if what we should do is to just add a new field to
> pg_control that says "error out if backup_label does not exist", that we set
> when creating a streaming base backup

That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?  I
don't quite see why it is different than what's proposed on this
thread, except that you don't need to write one file to the data
folder to store the backup label fields, but two, meaning that there's
a risk for more mistakes because a clean backup process would require
more steps. 

With the current position of the fields in ControlFileData, there are
three free bytes after backupEndRequired, so it is possible to add
that for free.  Now, would you actually need an extra field knowing
that backupStartPoint is around?
--
Michael

signature.asc
Description: PGP signature

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:
> On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:
> > Given that, I wonder if what we should do is to just add a new field to
> > pg_control that says "error out if backup_label does not exist", that we set
> > when creating a streaming base backup
>
> That would mean that one still needs to take an extra step to update a
> control file with this byte set, which is something you had a concern
> with in terms of compatibility when it comes to external backup
> solutions because more steps are necessary to take a backup, no?

I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.

Greetings,

Andres Freund

Re: Hide exposed impl detail of wchar.c

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-16 17:11:03 -0800, Jubilee Young wrote:
> We don't directly `#include` C into Rust, but use libclang to preprocess and
> compile a wrapping C header into a list of symbols Rust will look for at link
> time. Our failure is in libclang and how we steer it:
> - The Clang-C API (libclang.so) cannot determine where system headers are.
> - A clang executable can determine where system headers are, but our bindgen
> may be asked to use a libclang.so without a matching clang executable!
> - This is partly because system packagers do not agree on what clang parts
> must be installed together, nor even on the clang directory's layout.
> - Thus, it is currently impossible to, given a libclang.so, determine with
> 100% accuracy where version-appropriate system headers are and include them,
> nor does it do so implicitly.

I remember battling this in the past, independent of rust :(


What I don't quite get is why SIMD headers are particularly more problematic
than others - there's other headers that are compiler specific?

Greetings,

Andres Freund

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 05:43:17PM +0300, Nazir Bilal Yavuz wrote:
> Yes, the timings for the writes and the syncs should work. Another
> question I have in mind is the pg_stat_reset_shared() function. When
> we call it with 'io' it will reset pg_stat_wal's timings and when we
> call it with 'wal' it won't reset them, right?

pg_stat_reset_shared() with a target is IMO a very edge case, so I'm
OK with the approach of resetting timings in pg_stat_wal even if 'io'
was implied because pg_stat_wal would feed partially from pg_stat_io.
I'd take that as a side-cost in favor of compatibility while making
the stats gathering cheaper overall.  I'm OK as well if people
counter-argue on this point, though that would mean to keep entirely
separate views with duplicated fields that serve the same purpose,
impacting all deployments because it would make the stats gathering
heavier for all.
--
Michael


signature.asc
Description: PGP signature

Re: POC, WIP: OR-clause support for indexes

2023-11-20 Thread Alena Rybakina


On 20.11.2023 11:52, Andrei Lepikhov wrote:
Looking into the patch, I found some trivial improvements (see 
attachment).
Also, it is not obvious that using a string representation of the 
clause as a hash table key is needed here. Also, by making a copy of 
the node in the get_key_nconst_node(), you replace the location field, 
but in the case of complex expression, you don't do the same with 
other nodes.
I propose to generate expression hash instead + prove the equality of 
two expressions by calling equal().


I was thinking about your last email and a possible error where the 
location field may not be cleared in complex expressions. Unfortunately, 
I didn't come up with any examples either, but I think I was able to 
handle this with a special function that removes location-related 
patterns. The alternative to this is to bypass this expression, but I 
think it will be more invasive. In addition, I have added changes 
related to the hash table: now the key is of type int.


All changes are displayed in the attached 
v9-0001-Replace-OR-clause-to_ANY.diff.txt file.


I haven't measured it yet. But what do you think about these changes?


--
Regards,
Alena Rybakina
Postgres Professional
From 8e579b059ffd415fc477e0e8930e718084e58683 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Tue, 21 Nov 2023 03:22:13 +0300
Subject: [PATCH] [PATCH] Replace OR clause to ANY expressions. Replace (X=N1)
 OR (X=N2) ... with X = ANY(N1, N2) on the stage of the optimiser when we are
 still working with a tree expression. Firstly, we do not try to make a
 transformation for "non-or" expressions or inequalities and the creation of a
 relation with "or" expressions occurs according to the same scenario.
 Secondly, we do not make transformations if there are less than set
 or_transform_limit. Thirdly, it is worth considering that we consider "or"
 expressions only at the current level.

Authors: Alena Rybakina , Andrey Lepikhov 
Reviewed-by: Peter Geoghegan , Ranier Vilela 
Reviewed-by: Alexander Korotkov , Robert Haas 
---
 src/backend/parser/parse_expr.c   | 280 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 115 +++
 src/test/regress/expected/guc.out |   3 +-
 src/test/regress/expected/join.out|  50 
 src/test/regress/expected/partition_prune.out | 156 ++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  17 ++
 src/test/regress/sql/create_index.sql |  32 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   1 +
 14 files changed, 703 insertions(+), 3 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 64c582c344c..290422005a3 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -18,6 +18,7 @@
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "common/hashfn.h"
 #include "commands/dbcommands.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -43,6 +44,7 @@
 
 /* GUC parameters */
 bool		Transform_null_equals = false;
+bool		enable_or_transformation = false;
 
 
 static Node *transformExprRecurse(ParseState *pstate, Node *expr);
@@ -98,7 +100,283 @@ static Expr *make_distinct_op(ParseState *pstate, List *opname,
 			  Node *ltree, Node *rtree, int location);
 static Node *make_nulltest_from_distinct(ParseState *pstate,
 		 A_Expr *distincta, Node *arg);
+typedef struct OrClauseGroupEntry
+{
+	int32			hash_leftvar_key;
+
+	Node		   *node;
+	List		   *consts;
+	Oidscalar_type;
+	Oidopno;
+	Node 		   *expr;
+} OrClauseGroupEntry;
+
+/*
+ * TODO: consider different algorithm to manage complexity
+ * in the case of many different clauses,
+ * like Rabin-Karp or Boyer–Moore algorithms.
+ */
+static char *
+clear_patterns(const char *str, const char *start_pattern)
+{
+	int			i = 0;
+	int			j = 0;
+	int			start_pattern_len = strlen(start_pattern);
+	char	   *res = palloc0(sizeof(*res) * (strlen(str) + 1));
+
+	for (i = 0; str[i];)
+	{
+		if (i >= start_pattern_len && strncmp(&str[i - start_pattern_len],
+			  start_pattern,
+			  start_pattern_len) == 0)
+		{
+			while (str[i] && !(str[i] == '{' || str[i] == '}'))
+i++;
+		}
+		if (str[i])
+			res[j++] = str[i++];
+	}
+
+	return res;
+}
+
+static int32
+get_key_nconst_node(Node *nconst_node)
+{
+	char *str = nodeToString(nconst_node);
+
+	str = clear_patterns(str, " :location");
+
+	if (str == NULL)
+		return 0;
+
+	return DatumGetInt32(hash_any((unsigned char *)str, strlen(str)));
+}
+
+static Node *
+transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig)
+{
+	List		   *or_list = NIL;
+

Re: Faster "SET search_path"

2023-11-20 Thread Jeff Davis

On Thu, 2023-11-16 at 16:46 -0800, Jeff Davis wrote:
> While I considered OOM during hash key initialization, I missed some
> other potential out-of-memory hazards. Attached a fixup patch 0003,
> which re-introduces one list copy but it simplifies things
> substantially in addition to being safer around OOM conditions.

Committed 0003 fixup.

> > > 0004: Use the same cache to optimize check_search_path().

Committed 0004.

> > > 0005: Optimize cache for repeated lookups of the same value.

Will commit 0005 soon.

I also attached a trivial 0006 patch that uses SH_STORE_HASH. I wasn't
able to show much benefit, though, even when there's a bucket
collision. Perhaps there just aren't enough elements to matter -- I
suppose there would be a benefit if there are lots of unique
search_path strings, but that doesn't seem very plausible to me. If
someone thinks it's worth committing, then I will, but I don't see any
real upside or downside.

Regards,
Jeff Davis

From 5f41c0ecc602dd183b7f6e2f23cd28c9338b3c5b Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sun, 19 Nov 2023 14:47:04 -0800
Subject: [PATCH v10 2/2] Use SH_STORE_HASH for search_path cache.

Does not change performance in expected cases, but makes performance
more resilient in case of bucket collisions.

Discussion: https://postgr.es/m/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com
Discussion: https://postgr.es/m/27a7a289d5b8f42e1b1e79b1bcaeef3a40583bd2.ca...@j-davis.com
---
 src/backend/catalog/namespace.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 37a69e9023..0b6e0d711c 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -171,11 +171,10 @@ typedef struct SearchPathCacheEntry
 	List	   *oidlist;		/* namespace OIDs that pass ACL checks */
 	List	   *finalPath;		/* cached final computed search path */
 	Oid			firstNS;		/* first explicitly-listed namespace */
+	uint32		hash;			/* needed for simplehash */
 	bool		temp_missing;
 	bool		forceRecompute; /* force recompute of finalPath */
-
-	/* needed for simplehash */
-	char		status;
+	char		status;			/* needed for simplehash */
 }			SearchPathCacheEntry;
 
 /*
@@ -270,6 +269,8 @@ spcachekey_equal(SearchPathCacheKey a, SearchPathCacheKey b)
 #define SH_EQUAL(tb, a, b)		spcachekey_equal(a, b)
 #define SH_SCOPE		static inline
 #define SH_DECLARE
+#define SH_GET_HASH(tb, a)		a->hash
+#define SH_STORE_HASH
 #define SH_DEFINE
 #include "lib/simplehash.h"
 
-- 
2.34.1

From 46e09b225217bef79a57cc4f8450ed19be8f21ba Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 19 Oct 2023 15:48:16 -0700
Subject: [PATCH v10 1/2] Optimize SearchPathCache by saving the last entry.

Repeated lookups are common, so it's worth it to check the last entry
before doing another hash lookup.

Discussion: https://postgr.es/m/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com
---
 src/backend/catalog/namespace.c | 88 +
 1 file changed, 57 insertions(+), 31 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 5027efc91d..37a69e9023 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -241,7 +241,8 @@ static bool MatchNamedCall(HeapTuple proctup, int nargs, List *argnames,
  *
  * The search path cache is based on a wrapper around a simplehash hash table
  * (nsphash, defined below). The spcache wrapper deals with OOM while trying
- * to initialize a key, and also offers a more convenient API.
+ * to initialize a key, optimizes repeated lookups of the same key, and also
+ * offers a more convenient API.
  */
 
 static inline uint32
@@ -281,6 +282,7 @@ spcachekey_equal(SearchPathCacheKey a, SearchPathCacheKey b)
 #define SPCACHE_RESET_THRESHOLD		256
 
 static nsphash_hash * SearchPathCache = NULL;
+static SearchPathCacheEntry * LastSearchPathCacheEntry = NULL;
 
 /*
  * Create or reset search_path cache as necessary.
@@ -295,6 +297,7 @@ spcache_init(void)
 		return;
 
 	MemoryContextReset(SearchPathCacheContext);
+	LastSearchPathCacheEntry = NULL;
 	/* arbitrary initial starting size of 16 elements */
 	SearchPathCache = nsphash_create(SearchPathCacheContext, 16, NULL);
 	searchPathCacheValid = true;
@@ -307,12 +310,25 @@ spcache_init(void)
 static SearchPathCacheEntry *
 spcache_lookup(const char *searchPath, Oid roleid)
 {
-	SearchPathCacheKey cachekey = {
-		.searchPath = searchPath,
-		.roleid = roleid
-	};
+	if (LastSearchPathCacheEntry &&
+		LastSearchPathCacheEntry->key.roleid == roleid &&
+		strcmp(LastSearchPathCacheEntry->key.searchPath, searchPath) == 0)
+	{
+		return LastSearchPathCacheEntry;
+	}
+	else
+	{
+		SearchPathCacheEntry *entry;
+		SearchPathCacheKey cachekey = {
+			.searchPath = searchPath,
+			.roleid = roleid
+		};
+
+		entry = nsphash_lookup(SearchPathCache, cachekey);
 
-	return nsphash_lookup(SearchPathCache, cachekey);
+		LastSearchPath

Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 04:53:45PM +0530, Ashutosh Bapat wrote:
> On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier  wrote:
>> I have added some documentation to explain that, as well.  I am not
>> wedded to the name proposed in the patch, so if you feel there is
>> better, feel free to propose ideas.
> 
> Actually with Attach and Detach terminology, INJECTION_POINT becomes
> the place where we "declare" the injection point. So the documentation
> needs to first explain INJECTION_POINT and then explain the other
> operations.

Sure.

>> This facility is hidden behind a specific configure/Meson switch,
>> making it a no-op by default:
>> --enable-injection-points
>> -Dinjection_points={ true | false }
> 
> That's useful, but we will also see demand to enable those in
> production (under controlled circumstances). But let the functionality
> mature under a separate flag and developer builds before used in
> production.

Hmm.  Okay.  I'd still keep that under a compile switch for now
anyway to keep a cleaner separation and avoid issues where it would be
possible to inject code in a production build.  Note that I was
planning to switch one of my buildfarm animals to use it should this
stuff make it into the tree.  And people would be free to use it if
they want.  If in production, that would be a risk, IMO.

> +/*
> + * Local cache of injection callbacks already loaded, stored in
> + * TopMemoryContext.
> + */
> +typedef struct InjectionPointArrayEntry
> +{
> + char name[INJ_NAME_MAXLEN];
> + InjectionPointCallback callback;
> +} InjectionPointArrayEntry;
> +
> +static InjectionPointArrayEntry *InjectionPointCacheArray = NULL;
> 
> Initial size of this array is small, but given the size of code in a
> given path to be tested, I fear that this may not be sufficient. I
> feel it's better to use hash table whose APIs are already available.

I've never seen in recent years a need for a given test to use more
than 4~5 points.  But I guess that you've seen more than that wanted
in a prod environment with a fork of Postgres?

I'm OK to switch that to use a hash table in the initial
implementation, even for a low number of entries with points that are
not in hot code paths that should be OK.  At least for my cases
related to testing that's OK.  Am I right to assume that you mean a
HTAB in TopMemoryContext?

> I find it pretty useless to expose that as a SQL API. Also adding it
> in tests would make it usable as an example. In this patchset
> INJECTION_POINT should be the only way to trigger an injection point.

That's useful to test the cache logic while providing simple coverage
for the core API, and that's cheap.  So I'd rather keep this test
module around with these tests.

> That also brings another question. The INJECTION_POINT needs to be added in 
> the
> core code but can only be used through an extension. I think there should be
> some rudimentary albeit raw test in core for this. Extension does some good
> things like provides built-in actions when the injection is triggered. So, we
> should keep those too.

Yeah, I'd like to agree with that but everything I've seen in the
recent years is that every setup finishes by having different
assumptions about what they want to do, so my intention is to trim
down the core of the patch to a bare acceptable minimum and then work
on top of that.

> I am glad that it covers the frequently needed injections error, notice and
> wait. If someone adds multiple injection points for wait and all of them are
> triggered (in different backends), test_injection_points_wake() would
> wake them all. When debugging cascaded functionality it's not easy to
> follow sequence add, trigger, wake for multiple injections. We should
> associate a conditional variable with the required injection points. Attach
> should create conditional variable in the shared memory for that injection
> point and detach should remove it. I see this being mentioned in the commit
> message, but I think it's something we need in the first version of "wait" to
> be useful.

More to the point, I actually disagree with that, because it could be
possible as well that the same condition variable is used across
multiple points.  At the end, IMHO, the central hash table should hold
zero meta-data associated to an injection point like a counter, an
elog, a condition variable, a sleep time, etc. or any combination of
all these ones, and should just know about how to load a callback with
a library path and a routine name.  I understand that this is leaving
the responsibility of what a callback should do down to the individual
developer implementing a callback into their own extension, where they
should be free to have conditions of their own.

Something that I agree would be very useful for the in-core APIs,
depending on the cases, is to be able to push some information to the
callback at runtime to let a callback decide what to do depending on a
running state, including a condition registered when a point was
attached.

Re: About #13489, array dimensions and CREATE TABLE ... LIKE

2023-11-20 Thread Bruce Momjian

On Fri, Sep  8, 2023 at 05:10:51PM -0400, Bruce Momjian wrote:
> I knew we only considered the array dimension sizes to be documentation
> _in_ the query, but I thought we at least properly displayed the number
> of dimensions specified at creation when we described the table in psql,
> but it seems we don't do that either.
> 
> A big question is why we even bother to record the dimensions in
> pg_attribute if is not accurate for LIKE and not displayed to the user
> in a meaningful way by psql.
> 
> I think another big question is whether the structure we are using to
> supply the column information to BuildDescForRelation is optimal.  The
> typmod that has to be found for CREATE TABLE uses:
> 
> typenameTypeIdAndMod(NULL, entry->typeName, &atttypid, &atttypmod);
> 
> which calls typenameTypeIdAndMod() -> typenameType() -> LookupTypeName()
> -> LookupTypeNameExtended() -> typenameTypeMod().  This seems very
> complicated because the ColumnDef, at least in the LIKE case,  already
> has the value.  Is there a need to revisit how we handle type such
> cases?

(Bug report moved to hackers, previous bug reporters added CCs.)

I looked at this some more and found more fundamental problems.  We have
pg_attribute.attndims which does record the array dimensionality:

CREATE TABLE test (data integer, data_array integer[5][5]);

SELECT attndims
FROM pg_attribute
WHERE attrelid = 'test'::regclass AND
  attname = 'data_array';
 attndims
--
2

The first new problem I found is that we don't dump the dimensionality:

$ pg_dump test
...
CREATE TABLE public.test (
data integer,
--> data_array integer[]
);

and psql doesn't display the dimensionality:

\d test
   Table "public.test"
   Column   |   Type| Collation | Nullable | Default
+---+---+--+-
 data   | integer   |   |  |
-->  data_array | integer[] |   |  |

A report from 2015 reports that CREATE TABLE ... LIKE and CREATE TABLE
... AS doesn't propagate the dimensionality:

https://www.postgresql.org/message-id/flat/20150707072942.1186.98151%40wrigleys.postgresql.org

and this thread from 2018 supplied a fix:

https://www.postgresql.org/message-id/flat/7862e882-8b9a-0c8e-4a38-40ad374d3634%40brandwatch.com

though in my testing it only fixes LIKE, not CREATE TABLE ... AS.  This
report from April of this year also complains about LIKE:

https://www.postgresql.org/message-id/flat/ZD%2B14YZ4IUue8Rhi%40gendo.asyd.net

Here is the output from master for LIKE:

CREATE TABLE test2 (LIKE test);

SELECT attndims
FROM pg_attribute
WHERE attrelid = 'test2'::regclass AND
 attname = 'data_array';
 attndims
--
--> 0

and this is the output for CREATE TABLE ... AS:

CREATE TABLE test3 AS SELECT * FROM test;

SELECT attndims
FROM pg_attribute
WHERE attrelid = 'test3'::regclass AND
  attname = 'data_array';
 attndims
--
--> 0

The attached patch fixes pg_dump:

$ pg_dump test
...
CREATE TABLE public.test2 (
data integer,
--> data_array integer[][]
);

It uses repeat() at the SQL level rather then modifying format_type() at
the SQL or C level.  It seems format_type() is mostly used to get the
type name, e.g. int4[], rather than the column definition so I added
brackets at the call site.  I used a similar fix for psql output:

\d test
Table "public.test"
   Column   |Type | Collation | Nullable | Default
+-+---+--+-
 data   | integer |   |  |
-->  data_array | integer[][] |   |  |

The 2018 patch from Alexey Bashtanov fixes the LIKE case:

CREATE TABLE test2 (LIKE test);

\d test2
   Table "public.test2"
   Column   |Type | Collation | Nullable | Default
+-+---+--+-
 data   | integer |   |  |
-->  data_array | integer[][] |   |  |

It does not fix CREATE TABLE ... AS because it looks like fixing that
would require adding an ndims column to Var for WITH NO DATA and adding
ndims to TupleDesc for WITH DATA.  I am not sure if that overhead is
warrented to fix this item.  I have added C comments where they should
be added.

I would like to apply this patch to master because I think our current
deficiencies in this area are unacceptable.  An alternate approach would
be to remove pg_attribute.attndims so we don't even try to preserve 
dimensionality.

-

Re: pg_upgrade and logical replication

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 09:49:41AM +0530, Amit Kapila wrote:
> On Tue, Nov 14, 2023 at 7:21 AM vignesh C  wrote:
>> There are couple of things happening here: a) In the first part we
>> take care of setting subscription relation to SYNCDONE and dropping
>> the replication slot at publisher node, only if drop replication slot
>> is successful the relation state will be set to SYNCDONE , if drop
>> replication slot fails the relation state will still be in
>> FINISHEDCOPY. So if there is a failure in the drop replication slot we
>> will not have an issue as the tablesync worker will be in
>> FINISHEDCOPYstate and this state is not allowed for upgrade. When the
>> state is in SYNCDONE the tablesync slot will not be present. b) In the
>> second part we drop the replication origin, even if there is a chance
>> that drop replication origin fails due to some reason, there will be
>> no problem as we do not copy the table sync replication origin to the
>> new cluster while upgrading. Since the table sync replication origin
>> is not copied to the new cluster there will be no replication origin
>> leaks.
> 
> And, this will work because in the SYNCDONE state, while removing the
> origin, we are okay with missing origins. It seems not copying the
> origin for tablesync workers in this state (SYNCDONE) relies on the
> fact that currently, we don't use those origins once the system
> reaches the SYNCDONE state but I am not sure it is a good idea to have
> such a dependency and that upgrade assuming such things doesn't seems
> ideal to me.

Hmm, yeah, you mean the replorigin_drop_by_name() calls in
tablesync.c.  I did not pay much attention about that in the code, but
your point sounds sensible.

(I have not been able to complete an analysis of the risks behind 's'
to convince myself that it is entirely safe, but leaks are scary as
hell if this gets automated across a large fleet of nodes..)

> Personally, I think allowing an upgrade in  'i'
> (initialize) state or 'r' (ready) state seems safe because in those
> states either slots/origins don't exist or are dropped. What do you
> think?

I share a similar impression about 's'.  From a design point of view,
making the conditions to reach harder in the first implementation
makes the user experience stricter, but that's safer regarding leaks
and it is still possible to relax these choices in the future
depending on the improvement pieces we are able to figure out.
--
Michael


signature.asc
Description: PGP signature

Re: meson documentation build open issues

2023-11-20 Thread Andres Freund

Hi,

On 2023-11-20 08:27:48 +0100, Peter Eisentraut wrote:
> On 17.11.23 19:53, Andres Freund wrote:
> > I pushed the first two commits (the selinux stuff) and worked a bit more on
> > the subsequent ones.
> 
> Patches 0001 through 0004 look good to me.

Cool, I pushed them now.


> Some possible small tweaks in 0004:
> 
> +perl, '-ne', 'next if /^#/; print',
> 
> If you're going for super-brief mode, you could also use "perl -p" and drop
> the "print".

I thought this didn't add much, so I didn't go there.


> Put at least two spaces between the "columns" in targets-meson.txt:
> 
> +  doc/src/sgml/postgres-A4.pdf  Build documentation in PDF format, with
>^^

I did adopt this.


One remaining question is whether we should adjust install-doc-{html,man} to
be install-{html,man}, to match the docs targets.

Greetings,

Andres Freund

Re:Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?

2023-11-20 Thread yuansong

thanks，After reconsideration, I realized that what I really want is for other 
connections to remain unaffected when a process crashes. This is something that 
a connection pool cannot solve.

At 2023-11-14 09:41:03, "Thomas wen"  wrote:

Hi yuansong
  there is connnection pool path 
(https://commitfest.postgresql.org/34/3043/) ,but it  has been dormant for few 
years,You can check this patch to get what you want to need
发件人: yuansong 
发送时间: 2023年11月13日 17:13
收件人: Laurenz Albe 
抄送: pgsql-hackers@lists.postgresql.org 
主题: Re:Re: How to solve the problem of one backend process crashing and causing 
other processes to restart?

Enhancing the overall fault tolerance of the entire system for this feature is 
quite important. No one can avoid bugs, and I don't believe that this approach 
is a more advanced one. It might be worth considering adding it to the roadmap 
so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all 
connections has a significant impact on other connections. Is it possible to 
only disconnect the crashed connection and have the other connections simply 
roll back the current transaction without reconnecting? Perhaps this problem 
can be mitigated through the use of a connection pool.

If we want to implement this feature, would it be sufficient to clean up or 
restore the shared memory and disk changes caused by the crashed backend? 
Besides clearing various known locks, what else needs to be changed? Does 
anyone have any insights? Please help me. Thank you.

At 2023-11-13 13:53:29, "Laurenz Albe"  wrote:
>On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:
>> yuansong  writes:
>> > In PostgreSQL, when a backend process crashes, it can cause other backend
>> > processes to also require a restart, primarily to ensure data consistency.
>> > I understand that the correct approach is to analyze and identify the
>> > cause of the crash and resolve it. However, it is also important to be
>> > able to handle a backend process crash without affecting the operation of
>> > other processes, thus minimizing the scope of negative impact and
>> > improving availability. To achieve this goal, could we mimic the Oracle
>> > process by introducing a "pmon" process dedicated to rolling back crashed
>> > process transactions and performing resource cleanup? I wonder if anyone
>> > has attempted such a strategy or if there have been previous discussions
>> > on this topic.
>> 
>> The reason we force a database-wide restart is that there's no way to
>> be certain that the crashed process didn't corrupt anything in shared
>> memory.  (Even with the forced restart, there's a window where bad
>> data could reach disk before we kill off the other processes that
>> might write it.  But at least it's a short window.)  "Corruption"
>> here doesn't just involve bad data placed into disk buffers; more
>> often it's things like unreleased locks, which would block other
>> processes indefinitely.
>> 
>> I seriously doubt that anything like what you're describing
>> could be made reliable enough to be acceptable.  "Oracle does
>> it like this" isn't a counter-argument: they have a much different
>> (and non-extensible) architecture, and they also have an army of
>> programmers to deal with minutiae like undoing resource acquisition.
>> Even with that, you'd have to wonder about the number of bugs
>> existing in such necessarily-poorly-tested code paths.
>
>Yes.
>I think that PostgreSQL's approach is superior: rather than investing in
>code to mitigate the impact of data corruption caused by a crash, invest
>in quality code that doesn't crash in the first place.
>
>Euphemistically naming a crash "ORA-600 error" seems to be part of
>their strategy.
>
>Yours,
>Laurenz Albe
>

Re: About #13489, array dimensions and CREATE TABLE ... LIKE

2023-11-20 Thread Tom Lane

Bruce Momjian  writes:
> I would like to apply this patch to master because I think our current
> deficiencies in this area are unacceptable.

I do not think this is a particularly good idea, because it creates
the impression in a couple of places that we track this data, when
we do not really do so to any meaningful extent.

> An alternate approach would
> be to remove pg_attribute.attndims so we don't even try to preserve 
> dimensionality.

I could get behind that, perhaps.  It looks like we're not using the
field in any meaningful way, and we could simplify TupleDescInitEntry
and perhaps some other APIs.

regards, tom lane

Re: About #13489, array dimensions and CREATE TABLE ... LIKE

2023-11-20 Thread Bruce Momjian

On Mon, Nov 20, 2023 at 09:04:21PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > I would like to apply this patch to master because I think our current
> > deficiencies in this area are unacceptable.
> 
> I do not think this is a particularly good idea, because it creates
> the impression in a couple of places that we track this data, when
> we do not really do so to any meaningful extent.

Okay, I thought we could get by without tracking the CREATE TABLE AS
case, but it is inconsistent.  My patch just makes it less
inconsistent.

> > An alternate approach would
> > be to remove pg_attribute.attndims so we don't even try to preserve 
> > dimensionality.
> 
> I could get behind that, perhaps.  It looks like we're not using the
> field in any meaningful way, and we could simplify TupleDescInitEntry
> and perhaps some other APIs.

So should I work on that patch or do you want to try?  I think we should
do something.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key

2023-11-20 Thread Jeff Davis

Patch attached.

The caller could do something similar, so this option is not necessary,
but it seems like it could be generally useful. It speeds things up for
the search_path cache (and is an alternative to another patch I have
that implements the same thing in the caller).

Thoughts?

Regards,
Jeff Davis

From b878af835da794f3384f870db57b34e236b1efba Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Mon, 20 Nov 2023 17:42:07 -0800
Subject: [PATCH] Add SH_OPTIMIZE_REPEAT option to simplehash.h.

Callers which expect to look up the same value repeatedly can specify
SH_OPTIMIZE_REPEAT. That option causes simplehash to quickly check if
the last entry returned has a the same key, and return it immediately
if so (before calculating the hash).
---
 src/include/lib/simplehash.h | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index b1a3d7f927..9a3dbfecfa 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -49,6 +49,7 @@
  *	  - SH_EQUAL(table, a, b) - compare two table keys
  *	  - SH_HASH_KEY(table, key) - generate hash for the key
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
+ *	  - SH_OPTIMIZE_REPEAT - optimize for repeated lookups of the same key
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
  *
  *	  The element type is required to contain a "status" member that can store
@@ -163,6 +164,11 @@ typedef struct SH_TYPE
 	/* hash buckets */
 	SH_ELEMENT_TYPE *data;
 
+#ifdef SH_OPTIMIZE_REPEAT
+	/* last element found */
+	SH_ELEMENT_TYPE *previous;
+#endif
+
 #ifndef SH_RAW_ALLOCATOR
 	/* memory context to use for allocations */
 	MemoryContext ctx;
@@ -777,7 +783,16 @@ restart:
 SH_SCOPE	SH_ELEMENT_TYPE *
 SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found)
 {
-	uint32		hash = SH_HASH_KEY(tb, key);
+	uint32		hash;
+
+#ifdef SH_OPTIMIZE_REPEAT
+	if (tb->previous != NULL &&
+		tb->previous->status == SH_STATUS_IN_USE &&
+		SH_EQUAL(tb, key, tb->previous->SH_KEY))
+		return tb->previous;
+#endif
+
+	hash = SH_HASH_KEY(tb, key);
 
 	return SH_INSERT_HASH_INTERNAL(tb, key, hash, found);
 }
@@ -815,7 +830,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
 		Assert(entry->status == SH_STATUS_IN_USE);
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
+		{
+#ifdef SH_OPTIMIZE_REPEAT
+			tb->previous = entry;
+#endif
 			return entry;
+		}
 
 		/*
 		 * TODO: we could stop search based on distance. If the current
@@ -834,7 +854,16 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
 SH_SCOPE	SH_ELEMENT_TYPE *
 SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key)
 {
-	uint32		hash = SH_HASH_KEY(tb, key);
+	uint32		hash;
+
+#ifdef SH_OPTIMIZE_REPEAT
+	if (tb->previous != NULL &&
+		tb->previous->status == SH_STATUS_IN_USE &&
+		SH_EQUAL(tb, key, tb->previous->SH_KEY))
+		return tb->previous;
+#endif
+
+	hash = SH_HASH_KEY(tb, key);
 
 	return SH_LOOKUP_HASH_INTERNAL(tb, key, hash);
 }
@@ -1152,6 +1181,7 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_DEFINE
 #undef SH_GET_HASH
 #undef SH_STORE_HASH
+#undef SH_OPTIMIZE_REPEAT
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
 
-- 
2.34.1

Re: About #13489, array dimensions and CREATE TABLE ... LIKE

2023-11-20 Thread Tom Lane

Bruce Momjian  writes:
> On Mon, Nov 20, 2023 at 09:04:21PM -0500, Tom Lane wrote:
>> Bruce Momjian  writes:
>>> An alternate approach would
>>> be to remove pg_attribute.attndims so we don't even try to preserve 
>>> dimensionality.

>> I could get behind that, perhaps.  It looks like we're not using the
>> field in any meaningful way, and we could simplify TupleDescInitEntry
>> and perhaps some other APIs.

> So should I work on that patch or do you want to try?  I think we should
> do something.

Let's wait for some other opinions, first ...

regards, tom lane

Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread Michael Paquier

On Mon, Nov 20, 2023 at 03:58:55PM -0800, Andres Freund wrote:
> I was thinking we'd just set it in the pg_basebackup style path, and we'd
> error out if it's set and backup_label is present. But we'd still use
> backup_label without the pg_control flag set.
>
> So it'd just provide a cross-check that backup_label was not removed for
> pg_basebackup style backup, but wouldn't do anything for external backups. But
> imo the proposal to just us pg_control doesn't actually do anything for
> external backups either - which is why I think my proposal would achieve as
> much, for a much lower price.

I don't see why not.  It does not increase the number of steps when
doing a backup, and backupStartPoint alone would not be able to offer
this much protection.
--
Michael


signature.asc
Description: PGP signature

1 2 >

1 - 100 of 125 matches

Mail list logo