Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-22 Thread Alvaro Herrera
On 2022-Feb-22, Imseih (AWS), Sami wrote:

> On 13.5 a wal flush PANIC is encountered after a standby is promoted.
> 
> With debugging, it was found that when a standby skips a missing
> continuation record on recovery, the missingContrecPtr is not
> invalidated after the record is skipped. Therefore, when the standby
> is promoted to a primary it writes an overwrite_contrecord with an LSN
> of the missingContrecPtr, which is now in the past. On flush time,
> this causes a PANIC. From what I can see, this failure scenario can
> only occur after a standby is promoted.

Ooh, nice find and diagnosys.  I can confirm that the test fails as you
described without the code fix, and doesn't fail with it.

I attach the same patch, with the test file put in its final place
rather than as a patch.  Due to recent xlog.c changes this need a bit of
work to apply to back branches; I'll see about getting it in all
branches soon.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg0.php
>From 7948d1acdcd25b5b425c7804fad0d46cfb4b14b0 Mon Sep 17 00:00:00 2001
From: "Sami Imseih (AWS)" 
Date: Tue, 22 Feb 2022 19:09:36 +
Subject: [PATCH v2] Fix "missing continuation record" after standby promotion

Fix a condition where a recently promoted standby attempts to write an
OVERWRITE_RECORD with an LSN of the previously read aborted record.

Author: Sami Imseih 
Discussion: https://postgr.es/m/44d259de-7542-49c4-8a52-2ab01534d...@amazon.com
---
 src/backend/access/transam/xlog.c |  16 ++-
 .../t/029_overwrite_contrecord_promotion.pl   | 111 ++
 2 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100644 src/test/recovery/t/029_overwrite_contrecord_promotion.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..62e942f41a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5423,11 +5423,25 @@ StartupXLOG(void)
 	 * made it through and start writing after the portion that persisted.
 	 * (It's critical to first write an OVERWRITE_CONTRECORD message, which
 	 * we'll do as soon as we're open for writing new WAL.)
+	 *
+	 * If the last wal record is ahead of the missing contrecord, this is a
+	 * recently promoted primary and we should not write an overwrite
+	 * contrecord.
 	 */
 	if (!XLogRecPtrIsInvalid(missingContrecPtr))
 	{
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
-		EndOfLog = missingContrecPtr;
+		if (endOfRecoveryInfo->lastRec < missingContrecPtr)
+		{
+			elog(DEBUG2, "setting end of wal to missing continuation record %X/%X",
+ LSN_FORMAT_ARGS(missingContrecPtr));
+			EndOfLog = missingContrecPtr;
+		}
+		else
+		{
+			elog(DEBUG2, "resetting aborted record");
+			abortedRecPtr = InvalidXLogRecPtr;
+		}
 	}
 
 	/*
diff --git a/src/test/recovery/t/029_overwrite_contrecord_promotion.pl b/src/test/recovery/t/029_overwrite_contrecord_promotion.pl
new file mode 100644
index 00..ea4ebb32c0
--- /dev/null
+++ b/src/test/recovery/t/029_overwrite_contrecord_promotion.pl
@@ -0,0 +1,111 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Tests for resetting the "aborted record" after a promotion.
+
+use strict;
+use warnings;
+
+use FindBin;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test: Create a physical replica that's missing the last WAL file,
+# then restart the primary to create a divergent WAL file and observe
+# that the replica resets the "aborted record" after a promotion.
+
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init(allows_streaming => 1);
+# We need these settings for stability of WAL behavior.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+wal_keep_size = 1GB
+log_min_messages = DEBUG2
+));
+$node->start;
+
+$node->safe_psql('postgres', 'create table filler (a int, b text)');
+
+# Now consume all remaining room in the current WAL segment, leaving
+# space enough only for the start of a largish record.
+$node->safe_psql(
+	'postgres', q{
+DO $$
+DECLARE
+wal_segsize int := setting::int FROM pg_settings WHERE name = 'wal_segment_size';
+remain int;
+iters  int := 0;
+BEGIN
+LOOP
+INSERT into filler
+select g, repeat(md5(g::text), (random() * 60 + 1)::int)
+from generate_series(1, 10) g;
+
+remain := wal_segsize - (pg_current_wal_insert_lsn() - '0/0') % wal_segsize;
+IF remain < 2 * setting::int from pg_settings where name = 'block_size' THEN
+RAISE log 'exiting after % iterations, % bytes to end of WAL segment', iters, remain;
+EXIT;
+END IF;
+iters := iters + 1;
+END LOOP;
+END
+$$;
+});
+
+

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-02-23 Thread Alvaro Herrera
I think the change to ImmediateCheckpointRequested() makes no sense.
Before this patch, that function merely inquires whether there's an
immediate checkpoint queued.  After this patch, it ... changes a
progress-reporting flag?  I think it would make more sense to make the
progress-report flag change in whatever is the place that *requests* an
immediate checkpoint rather than here.

I think the use of capitals in CHECKPOINT and CHECKPOINTER in the
documentation is excessive.  (Same for terms such as MULTIXACT and
others in those docs; we typically use those in lowercase when
user-facing; and do we really use term CLOG anymore? Don't we call it
"commit log" nowadays?)

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)




Re: convert libpq uri-regress tests to tap test

2022-02-23 Thread Alvaro Herrera
On 2022-Feb-23, Andres Freund wrote:

> When verifying that the meson port actually runs all perl based tests I came
> across src/interfaces/libpq/test/regress.pl.  Instead of running tests yet
> another way, it seems better to convert it to a tap test.
> 
> I hope others agree?

WFM.

> Perhaps we should just rename src/test/modules/libpq_pipeline to
> src/test/modules/libpq and move uri-regress in there as well?

Well, building multiple binaries would require some trickery that might
be excessive for this little tool.  But +1 to that on principle.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)




Re: convert libpq uri-regress tests to tap test

2022-02-23 Thread Alvaro Herrera
On 2022-Feb-23, Andres Freund wrote:

> Separately: I don't really understand why we do the whole if USE_PGXS dance in
> src/test/modules?

Yeah, it seems a bit silly.  I'm not opposed to making these makefiles
unconditionally do the PGXS thing -- or the non-PGXS thing, if that
makes it easier to build multiple binaries.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"La grandeza es una experiencia transitoria.  Nunca es consistente.
Depende en gran parte de la imaginación humana creadora de mitos"
(Irulan)




Re: Add id's to various elements in protocol.sgml

2022-02-24 Thread Alvaro Herrera
On 2022-Feb-24, Dagfinn Ilmari Mannsåker wrote:

> Peter Eisentraut  writes:

> > Is there a way to obtain those URLs other than going into the HTML
> > sources and checking if there is an anchor near where you want go?
> 
> I use the jump-to-anchor extension: https://github.com/brettz9/jump-to-anchor/
> 
> Some sites have javascript that adds a link next to the element that
> becomes visible when hovering, e.g. the NAME and other headings on
> https://metacpan.org/pod/perl.

Would it be possible to create such anchor links as part of the XSL
stylesheets for HTML?

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




Re: support for MERGE

2022-02-27 Thread Alvaro Herrera
On 2022-Jan-28, Andres Freund wrote:

> Any chance you could split this into something more reviewable? The overall
> diff stats are: 102 files changed, 8589 insertions(+), 234 deletions(-) thats
> pretty hard to really review. Incremental commits don't realy help with that.

I'll work on splitting this next week.

> One thing from skimming: There's not enough documentation about the approach
> imo - it's a complicated enough feature to deserve more than the one paragraph
> in src/backend/executor/README.

Sure, I'll have a look at that.

> I'm still [1],[2] uncomfortable with execMerge.c kinda-but-not-really being an
> executor node.

I think we should make a decision on code arrangement here.  From my
perspective, MERGE isn't its own executor node; rather it's just another
"method" in ModifyTable.  Which makes sense, given that all it does is
call parts of INSERT, UPDATE, DELETE which are the other ModifyTable
methods.  Having a separate file doesn't strike me as great, but on the
other hand it's true that merely moving all the execMerge.c code into
nodeModifyTable.c makes the latter too large.  However I don't want to
create a .h file that means exposing all those internal functions to the
outside world.  My ideal would be to have each INSERT, UPDATE, DELETE,
MERGE as its own separate .c file, which would be #included from
nodeModifyTable.c.  We don't use that pattern anywhere though.  Any
opposition to that?  (The prototypes for each file would have to live in
nodeModifyTable.c itself.)


> > diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
> > index 1a9c1ac290..280ac40e63 100644
> > --- a/src/backend/commands/trigger.c
> > +++ b/src/backend/commands/trigger.c
> 
> This stuff seems like one candidate for splitting out.

Yeah, I had done that.  It's now posted as 0001.

> > +   /*
> > +* We maintain separate transition tables for UPDATE/INSERT/DELETE since
> > +* MERGE can run all three actions in a single statement. Note that 
> > UPDATE
> > +* needs both old and new transition tables whereas INSERT needs only 
> > new,
> > +* and DELETE needs only old.
> > +*/
> > +
> > +   /* "old" transition table for UPDATE, if any */
> > +   Tuplestorestate *old_upd_tuplestore;
> > +   /* "new" transition table for UPDATE, if any */
> > +   Tuplestorestate *new_upd_tuplestore;
> > +   /* "old" transition table for DELETE, if any */
> > +   Tuplestorestate *old_del_tuplestore;
> > +   /* "new" transition table INSERT, if any */
> > +   Tuplestorestate *new_ins_tuplestore;
> > +
> > TupleTableSlot *storeslot;  /* for converting to tuplestore's 
> > format */
> >  };
> 
> Do we need to do something slightly clever about the memory limits for the
> tuplestore? Each of them having a separate work_mem makes nodeModifyTable.c
> one memory hungry node (the worst of all maybe?).

Not sure how that would work.  The memory handling is inside
tuplestore.c IIRC ... I think that would require some largish
refactoring.  Merely dividing by four doesn't seem like a great answer
either.  Perhaps we could divide by two and be optimistic about it.

> > +   /*
> > +* Project the tuple.  In case of a partitioned 
> > table, the
> > +* projection was already built to use the 
> > root's descriptor,
> > +* so we don't need to map the tuple here.
> > +*/
> > +   actionInfo.actionState = action;
> > +   insert_slot = ExecProject(action->mas_proj);
> > +
> > +   (void) ExecInsert(mtstate, rootRelInfo,
> > + insert_slot, 
> > slot,
> > + estate, 
> > &actionInfo,
> > + 
> > mtstate->canSetTag);
> > +   InstrCountFiltered1(&mtstate->ps, 1);
> 
> What happens if somebody concurrently inserts a conflicting row?

An open question to which I don't have any good answer RN.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Cuando mañana llegue pelearemos segun lo que mañana exija" (Mowgli)




Re: support for MERGE

2022-03-12 Thread Alvaro Herrera
FYI I intend to get the ModifyTable split patch (0001+0003) pushed
hopefully on Tuesday or Wednesday next week, unless something really
ugly is found on it.

As for MERGE proper, I'm aiming at getting that one pushed on the week
starting March 21st, though of course I'll spend some more time on it
and will probably post further versions of it before that.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)




Re: Support logical replication of DDLs

2022-03-16 Thread Alvaro Herrera
Hello

I think this is a pretty interesting and useful feature.

Did you see some old code I wrote towards this goal?
https://www.postgresql.org/message-id/20150215044814.gl3...@alvh.no-ip.org
The intention was that DDL would produce some JSON blob that accurately
describes the DDL that was run; the caller can acquire that and use it
to produce working DDL that doesn't depend on runtime conditions.  There
was lots of discussion on doing things this way.  It was ultimately
abandoned, but I think it's valuable.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Porque francamente, si para saber manejarse a uno mismo hubiera que
rendir examen... ¿Quién es el machito que tendría carnet?"  (Mafalda)




Re: support for MERGE

2022-03-17 Thread Alvaro Herrera
On 2022-Mar-17, Alvaro Herrera wrote:

> I'll see what to do about Instrumentation->nfiltered{1,2,3} that was
> complained about by Andres upthread.  Maybe some additional macros will
> help.

This turns out not to be as simple as I expected, mostly because we want
to keep Instrumentation as a node-agnostic struct.  Things are already a
bit wonky with nfiltered/nfiltered2, and the addition of nfiltered3
makes things a lot uglier, and my impulse of using a union to separate
what is used for scans/joins vs. what is used for MERGE results in an
even more node-specific definition rather than the other way around.

Looking at the history I came across this older thread where this was
discussed
https://postgr.es/m/20180409215851.idwc75ct2bzi6tea@alvherre.pgsql
particularly this message from Robert,
https://www.postgresql.org/message-id/CA%2BTgmoaE3R6%3DV0G6zbht2L_LE%2BTsuYuSTPJXjLW%2B9_tpMGubZQ%40mail.gmail.com

I'll keep looking at this in the coming days, see if I can come up with
something sensible.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Y una voz del caos me habló y me dijo
"Sonríe y sé feliz, podría ser peor".
Y sonreí. Y fui feliz.
Y fue peor.




Re: a misbehavior of partition row movement (?)

2022-03-18 Thread Alvaro Herrera
I rebased this patch; v15 attached.  Other than fixing the (very large)
conflicts due to nodeModifyTable.c rework, the most important change is
moving GetAncestorResultRels into execMain.c and renaming it to have the
"Exec-" prefix.  The reason is that what this code is doing is affect
struct ResultRelInfo, which is owned by execMain.c, so it seemed bogus
to do that in nodeModifyTable.c and then let execMain.c's
ExecCloseResultRelations() do the cleanup.  I added a little commentary
in the latter routine too.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"(Andrew Morton)
>From 345ed49718708d8ebde9e2dcf06bf963190bc5c8 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 18 Mar 2022 11:01:24 +0100
Subject: [PATCH v15] Enforce foreign key correctly during cross-partition
 updates

When an update on a partitioned table referenced in foreign keys
constraint causes a row to move from one partition to another, which
is implemented by deleting the old row from the source leaf partition
followed by inserting the new row into the destination leaf partition,
firing the foreign key triggers on that delete event can result in
surprising outcomes for those keys.  For example, a given foreign
key's delete trigger which implements the ON DELETE CASCADE clause of
that key will delete any referencing rows when triggerred for that
internal DELETE, although it should not, because the referenced row
is simply being moved from one partition of the referenced root
partitioned table into another, not being deleted from it.

This commit teaches trigger.c to skip queuing such delete trigger
events on the leaf partitions in favor of an UPDATE event fired on
the root target relation.  Doing so makes sense because both the
old and the new tuple "logically" belong to the root relation.

The after trigger event queuing interface now allows passing the
source and the destination partitions of a particular cross-partition
update when registering the update event for the root partitioned
table.  Along with the 2 ctids of the old and the new tuple, an after
trigger event now also stores the OIDs of those partitions. The tuples
fetched from the source and the destination partitions are converted
into the root table format before they are passed to the trigger
function.

The implementation currently has a limitation that only the foreign
keys pointing into the query's target relation are considered, not
those of its sub-partitioned partitions.  That seems like a
reasonable limitation, because it sounds rare to have distinct
foreign keys pointing into sub-partitioned partitions, but not into
the root table.

Author: Amit Langote
---
 doc/src/sgml/ref/update.sgml  |   7 +
 src/backend/commands/trigger.c| 389 +++---
 src/backend/executor/execMain.c   |  86 -
 src/backend/executor/execReplication.c|   5 +-
 src/backend/executor/nodeModifyTable.c| 155 -
 src/backend/utils/adt/ri_triggers.c   |   6 +
 src/include/commands/trigger.h|   8 +-
 src/include/executor/executor.h   |   4 +-
 src/include/nodes/execnodes.h |   6 +
 src/test/regress/expected/foreign_key.out | 204 +++-
 src/test/regress/sql/foreign_key.sql  | 135 +++-
 11 files changed, 925 insertions(+), 80 deletions(-)

diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 3fa54e5f70..3ba13010e7 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -316,6 +316,13 @@ UPDATE count
partition (provided the foreign data wrapper supports tuple routing), they
cannot be moved from a foreign-table partition to another partition.
   
+
+  
+   An attempt of moving a row from one partition to another will fail if a
+   foreign key is found to directly reference a non-root partitioned table
+   in the partition tree, unless that table is also directly mentioned
+   in the UPDATEquery.
+  
  
 
  
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index e08bd9a370..a9aa043981 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -95,10 +95,13 @@ static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
 	 Instrumentation *instr,
 	 MemoryContext per_tuple_context);
 static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
+  ResultRelInfo *src_partinfo,
+  ResultRelInfo *dst_partinfo,
   int event, bool row_trigger,
   TupleTableSlot *oldtup, TupleTableSlot *newtup,
   List *recheckIndexes, Bitmapset *modifiedCols,
-  TransitionCaptureState *transition_capture);
+  TransitionCaptureState *transition_capture,
+  bool is_crosspart_update);
 static void AfterTriggerEnlargeQueryState(void);
 static bool before_stmt_triggers_fired(Oi

Re: support for MERGE

2022-03-19 Thread Alvaro Herrera
On 2022-Mar-10, Simon Riggs wrote:

> Duplicate rows should produce a uniqueness violation error in one of
> the transactions, assuming there is a constraint to define the
> conflict. Without such a constraint there is no conflict.
> 
> Concurrent inserts are checked by merge-insert-update.spec, which
> correctly raises an ERROR in this case, as documented.

Agreed -- I think this is reasonable.

> Various cases were not tested in the patch - additional patch
> attached, but nothing surprising there.

Thank you, I've included this in all versions of the patch since you
sent it.

> ExecInsert() does not return from such an ERROR, so the code fragment
> appears correct to me.

I think trying to deal with it in a different way (namely: suspend
processing the inserting WHERE NOT MATCHED clause and switch to
processing the row using WHERE MATCHED clauses) would require us use
speculative tokens, similar to the way INSERT ON CONFLICT does.  I'm not
sure we want to go there, but it seems okay to leave that for a later
patch.  Moreover, there would be no compatibility hit from doing so.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"La persona que no quería pecar / estaba obligada a sentarse
 en duras y empinadas sillas/ desprovistas, por cierto
 de blandos atenuantes"  (Patricio Vogel)




Re: a misbehavior of partition row movement (?)

2022-03-19 Thread Alvaro Herrera
On 2022-Mar-18, Zhihong Yu wrote:

> +#define AFTER_TRIGGER_OFFSET   0x07FF  /* must be low-order
> bits */
> +#define AFTER_TRIGGER_DONE 0x8000
> +#define AFTER_TRIGGER_IN_PROGRESS  0x4000
> 
> Is it better if the order of AFTER_TRIGGER_DONE
> and AFTER_TRIGGER_IN_PROGRESS is swapped (for the ordinal values to be
> sequential) ?

They *are* sequential -- See
https://www.postgresql.org/message-id/202201172215.2tse3vjjgi2b%40alvherre.pgsql

> +#define AFTER_TRIGGER_CP_UPDATE0x0800
> 
> It would be better to add a comment for this constant, explaining what CP
> means (cross partition).

Sure.

> +   if (!partRel->rd_rel->relispartition)
> +   elog(ERROR, "cannot find ancestors of a non-partition result
> relation");
> 
> It would be better to include the relation name in the error message.

I don't think it matters.  We don't really expect to hit this.

> +   /* Ignore the root ancestor, because ...?? */
> 
> Please fill out the remainder of the comment.

I actually would like to know what's the rationale for this myself.
Amit?

> +   if (!trig->tgisclone &&
> +   RI_FKey_trigger_type(trig->tgfoid) == RI_TRIGGER_PK)
> +   {
> +   has_noncloned_fkey = true;
> 
> The variable says fkey, but the constant is not RI_TRIGGER_FK. Maybe add a
> comment explaining why.

Well, the constant is about the trigger *function*, not about any
constraint.  This code is testing "is this a noncloned trigger, and does
that trigger use an FK-related function?"  If you have a favorite
comment to include, I'm all ears.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html
>From a71baa9ab81d6f9ed04ce2c37c86d806ef36aa8b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 18 Mar 2022 11:01:24 +0100
Subject: [PATCH v16] Enforce foreign key correctly during cross-partition
 updates

When an update on a partitioned table referenced in foreign keys
constraint causes a row to move from one partition to another, which
is implemented by deleting the old row from the source leaf partition
followed by inserting the new row into the destination leaf partition,
firing the foreign key triggers on that delete event can result in
surprising outcomes for those keys.  For example, a given foreign
key's delete trigger which implements the ON DELETE CASCADE clause of
that key will delete any referencing rows when triggerred for that
internal DELETE, although it should not, because the referenced row
is simply being moved from one partition of the referenced root
partitioned table into another, not being deleted from it.

This commit teaches trigger.c to skip queuing such delete trigger
events on the leaf partitions in favor of an UPDATE event fired on
the root target relation.  Doing so makes sense because both the
old and the new tuple "logically" belong to the root relation.

The after trigger event queuing interface now allows passing the
source and the destination partitions of a particular cross-partition
update when registering the update event for the root partitioned
table.  Along with the 2 ctids of the old and the new tuple, an after
trigger event now also stores the OIDs of those partitions. The tuples
fetched from the source and the destination partitions are converted
into the root table format before they are passed to the trigger
function.

The implementation currently has a limitation that only the foreign
keys pointing into the query's target relation are considered, not
those of its sub-partitioned partitions.  That seems like a
reasonable limitation, because it sounds rare to have distinct
foreign keys pointing into sub-partitioned partitions, but not into
the root table.

Author: Amit Langote
---
 doc/src/sgml/ref/update.sgml  |   7 +
 src/backend/commands/trigger.c| 394 +++---
 src/backend/executor/execMain.c   |  86 -
 src/backend/executor/execReplication.c|   5 +-
 src/backend/executor/nodeModifyTable.c| 151 -
 src/backend/utils/adt/ri_triggers.c   |   6 +
 src/include/commands/trigger.h|   8 +-
 src/include/executor/executor.h   |   4 +-
 src/include/nodes/execnodes.h |   6 +
 src/test/regress/expected/foreign_key.out | 204 ++-
 src/test/regress/sql/foreign_key.sql  | 135 +++-
 11 files changed, 926 insertions(+), 80 deletions(-)

diff --git a/doc/src/sgml/ref/update.sgml b/doc/s

Re: a misbehavior of partition row movement (?)

2022-03-20 Thread Alvaro Herrera
On 2022-Mar-20, Amit Langote wrote:

> On Sun, Mar 20, 2022 at 5:13 AM Alvaro Herrera  
> wrote:
> > On 2022-Mar-18, Zhihong Yu wrote:

> > > +   if (!partRel->rd_rel->relispartition)
> > > +   elog(ERROR, "cannot find ancestors of a non-partition result
> > > relation");
> > >
> > > It would be better to include the relation name in the error message.
> >
> > I don't think it matters.  We don't really expect to hit this.
> 
> I tend to think maybe showing at least the OID in the error message
> doesn't hurt, but maybe we don't need to.

Since we don't even know of a situation in which this error message
would be raised, I'm hardly bothered by failing to print the OID.  If
any users complain, we can add more detail.

> I've fixed the comment as:
> 
> -   /* Ignore the root ancestor, because ...?? */
> +   /* Root ancestor's triggers will be processed. */

Okay, included that.

> A description of what we're looking for with this code is in the
> comment above the loop:
> 
> /*
>  * For any foreign keys that point directly into a non-root ancestors of
>  * the source partition,...
> 
> So finding triggers in those non-root ancestors whose function is
> RI_TRIGGER_PK tells us that those relations have foreign keys pointing
> into it or that it is the PK table in that relationship.  Other than
> the comment, the code itself seems self-documenting with regard to
> what's being done (given the function/macro/variable naming), so maybe
> we're better off without additional commentary here.

Yeah, WFM.

> I've attached a delta patch on v16 for the above comment and a couple
> of other changes.

Merged that in, and pushed.  I made a couple of wording changes in
comments here and there as well.

I lament the fact that this fix is not going to hit Postgres 12-14, but
ratio of effort to reward seems a bit too high.  I think we could
backpatch the two involved commits if someone is motivated enough to
verify everything and come up with solutions for the necessary ABI
changes.

Thank you, Amit, for your perseverance in getting this bug fixed!  

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-03-21 Thread Alvaro Herrera
On 2022-Mar-04, Michael Paquier wrote:

> d6d317d as solved the issue of tablespace paths across multiple nodes
> with the new GUC called allow_in_place_tablespaces, and is getting
> successfully used in the recovery tests as of 027_stream_regress.pl.

OK, but that means that the test suite is now not backpatchable.  The
implication here is that either we're going to commit the fix without
any tests at all on older branches, or that we're going to fix it only
in branch master.  Are you thinking that it's okay to leave this bug
unfixed in older branches?  That seems embarrasing.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto.  No es cierto, y si fuera cierto,
 no me acuerdo." (Augusto Pinochet a una corte de justicia)




Re: Column Filtering in Logical Replication

2022-03-21 Thread Alvaro Herrera
Hello,

Please add me to the list of authors of this patch.  I made a large
number of nontrivial changes to it early on.  Thanks.  I have modified
the entry in the CF app (which sorts alphabetically, it was not my
intention to put my name first.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-03-21 Thread Alvaro Herrera
I had a look at this latest version of the patch, and found some things
to tweak.  Attached is v21 with three main changes from Kyotaro's v20:

1. the XLogFlush is only done if consistent state has not been reached.
As I understand, it's not needed in normal mode.  (In any case, if we do
call XLogFlush in normal mode, what it does is not advance the recovery
point, so the comment would be incorrect.)

2. use %u to print OIDs rather than %d

3. I changed the warning message wording to this:

+   ereport(WARNING,
+   (errmsg("skipping replay of database creation WAL record"),
+errdetail("The source database directory \"%s\" was not 
found.",
+  src_path),
+errhint("A future WAL record that removes the directory 
before reaching consistent mode is expected.")));

I also renamed the function XLogReportMissingDir to
XLogRememberMissingDir (which matches the "forget" part) and changed the
DEBUG2 messages in that function to DEBUG1 (all the calls in other
functions remain DEBUG2, because ISTM they are not as interesting).
Finally, I made the TAP test search the WARNING line in the log.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No tengo por qué estar de acuerdo con lo que pienso"
 (Carlos Caszeli)
>From 6a6fc73a93768a44ec026720c115f77c67d5cda2 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 21 Mar 2022 12:34:34 +0100
Subject: [PATCH v21] Fix replay of create database records on standby

Crash recovery on standby may encounter missing directories when
replaying create database WAL records.  Prior to this patch, the
standby would fail to recover in such a case.  However, the
directories could be legitimately missing.  Consider a sequence of WAL
records as follows:

CREATE DATABASE
DROP DATABASE
DROP TABLESPACE

If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.

This patch adds mechanism similar to invalid page hash table, to track
missing directories during crash recovery.  If all the missing
directory references are matched with corresponding drop records at
the end of crash recovery, the standby can safely enter archive
recovery.

Diagnosed-by: Paul Guo 
Author: Paul Guo 
Author: Kyotaro Horiguchi 
Author: Asim R Praveen 
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27dn5xwj2p9-roh16e4...@mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c   |   6 +
 src/backend/access/transam/xlogutils.c  | 159 +++-
 src/backend/commands/dbcommands.c   |  57 +++
 src/backend/commands/tablespace.c   |  17 +++
 src/include/access/xlogutils.h  |   4 +
 src/test/recovery/t/029_replay_tsp_drops.pl |  67 +
 src/tools/pgindent/typedefs.list|   2 +
 7 files changed, 311 insertions(+), 1 deletion(-)
 create mode 100644 src/test/recovery/t/029_replay_tsp_drops.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9feea3e6ec..f48d8d51fb 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2043,6 +2043,12 @@ CheckRecoveryConsistency(void)
 		 */
 		XLogCheckInvalidPages();
 
+		/*
+		 * Check if the XLOG sequence contained any unresolved references to
+		 * missing directories.
+		 */
+		XLogCheckMissingDirs();
+
 		reachedConsistency = true;
 		ereport(LOG,
 (errmsg("consistent recovery state reached at %X/%X",
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f186f..8c1b8216be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -54,6 +54,164 @@ bool		InRecovery = false;
 /* Are we in Hot Standby mode? Only valid in startup process, see xlogutils.h */
 HotStandbyState standbyState = STANDBY_DISABLED;
 
+
+/*
+ * If a create database WAL record is being replayed more than once during
+ * crash recovery on a standby, it is possible that either the tablespace
+ * directory or the template database directory is missing.  This happens when
+ * the directories are removed by replay of subsequent drop records.  Note
+ * that this problem happens only on standby and not on master.  On master, a
+ * checkpoint is created at the end of create database operation. On standby,
+ * however, such a strategy (creating restart points during replay) is not
+ * viable because it will slow down WAL replay.
+ *
+ * The alternative is to track references to each missing directory
+ * encountered when performing crash recovery in the following hash table.
+ * Similar to invalid page table above, the expectation is that each missing
+ * di

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-03-21 Thread Alvaro Herrera
On 2022-Mar-14, Robert Haas wrote:

> 2. Why not instead change the code so that the operation can succeed,
> by creating the prerequisite parent directories? Do we not have enough
> information for that? I'm not saying that we definitely should do it
> that way rather than this way, but I think we do take that approach in
> some cases.

It seems we can choose freely between these two implementations -- I
mean I don't see any upsides or downsides to either one.

The current one has the advantage that it never makes the datadir
"dirty", to use Kyotaro's term.  It verifies that the creation/drop form
a pair.  A possible downside is that if there's a bug, we could end up
with a spurious PANIC at the end of recovery, and no way to recover.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: LogwrtResult contended spinlock

2022-03-22 Thread Alvaro Herrera
On 2022-Mar-21, Andres Freund wrote:

> This patch currently doesn't apply: http://cfbot.cputube.org/patch_37_2716.log

Updated.

> Are you aiming this for v15? Otherwise I'd like to move the entry to the next
> CF. Marked as waiting-on-author.

I'd like to get 0001 pushed to pg15, yes.  I'll let 0002 sit here for
discussion, but I haven't seen any evidence that we need it.  If others
vouch for it, I can push that one too, but I'd rather have it be a
separate thing.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)
>From 4d5a8d915b497b79bbe62e18352f7b9d8c1b9bca Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH v6 1/2] Change XLogCtl->LogwrtResult to use atomic ops
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, access to LogwrtResult is protected by a spinlock.  This
becomes severely contended in some scenarios, such as with a largish
replication flock: walsenders all calling GetFlushRecPtr repeatedly
cause the processor heat up to the point where eggs can be fried on top.

This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult
with a struct containing a pair of atomically accessed variables.

In addition, this commit splits the process-local copy of these values
(kept in the freestanding LogwrtResult struct) into two separate
LogWriteResult and LogFlushResult.  This is not strictly necessary, but
it makes it clearer that these are updated separately, each on their own
schedule.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
---
 src/backend/access/transam/xlog.c | 195 +++---
 src/include/port/atomics.h|  29 +
 src/tools/pgindent/typedefs.list  |   1 +
 3 files changed, 125 insertions(+), 100 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4ac3871c74..311a8a1192 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -285,16 +285,14 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * each member of LogwrtResult (LogWriteResult and LogFlushResult), each of
+ * which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -317,18 +315,18 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 of write position */
+	pg_atomic_uint64 Flush;		/* last byte + 1 of flush position */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
 	XLogRecPtr	Flush;			/* last byte + 1 to flush */
 } XLogwrtRqst;
 
-typedef struct XLogwrtResult
-{
-	XLogRecPtr	Write;			/* last byte + 1 written out */
-	XLogRecPtr	Flush;			/* last byte + 1 flushed */
-} XLogwrtResult;
-
 /*
  * Inserting to WAL is protected by a small fixed number of WAL insertion
  * locks. To insert to the WAL, you must hold one of the locks - it doesn't
@@ -480,6 +478,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -497,12 +496,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -622,10 

Re: SQL/JSON: JSON_TABLE

2022-03-22 Thread Alvaro Herrera
On 2022-Mar-22, Andrew Dunstan wrote:

> I'm planning on pushing the functions patch set this week and json-table
> next week.

I think it'd be a good idea to push the patches one by one and let a day
between each.  That would make it easier to chase buildfarm issues
individually, and make sure they are fixed before the next step.
Each patch in each series is already big enough.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: LogwrtResult contended spinlock

2022-03-22 Thread Alvaro Herrera
On 2022-Mar-22, Tom Lane wrote:

> I looked briefly at 0001, and I've got to say that I disagree with
> your decision to rearrange the representation of the local LogwrtResult
> copy.  It clutters the patch tremendously and makes it hard to
> understand what the actual functional change is.  Moreover, I'm
> not entirely convinced that it's a notational improvement in the
> first place.
> 
> Perhaps it'd help if you split 0001 into two steps, one to do the
> mechanical change of the representation and then a second patch that
> converts the shared variable to atomics.  Since you've moved around
> the places that read the shared variable, that part is subtler than
> one could wish and really needs to be studied on its own.

Hmm, I did it the other way around: first change to use atomics, then
the mechanical change.  I think that makes the usefulness of the change
more visible, because before the atomics use the use of the combined
struct as a unit remains sensible.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")
>From dd9b53878faeedba921ea7027e98ddbb433e8647 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH v7 1/3] Change XLogCtl->LogwrtResult to use atomic ops
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, access to LogwrtResult is protected by a spinlock.  This
becomes severely contended in some scenarios, such as with a largish
replication flock: walsenders all calling GetFlushRecPtr repeatedly
cause the processor heat up to the point where eggs can be fried on top.

This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult
with a struct containing a pair of atomically accessed variables.

In addition, this commit splits the process-local copy of these values
(kept in the freestanding LogwrtResult struct) into two separate
LogWriteResult and LogFlushResult.  This is not strictly necessary, but
it makes it clearer that these are updated separately, each on their own
schedule.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
---
 src/backend/access/transam/xlog.c | 85 +++
 src/include/port/atomics.h| 29 +++
 src/tools/pgindent/typedefs.list  |  1 +
 3 files changed, 72 insertions(+), 43 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4ac3871c74..6f2eb494fe 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -285,16 +285,13 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -317,6 +314,12 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 written out */
+	pg_atomic_uint64 Flush;		/* last byte + 1 flushed */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
@@ -480,6 +483,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -497,12 +501,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtRes

Re: LogwrtResult contended spinlock

2022-03-22 Thread Alvaro Herrera
So I've been wondering about this block at the bottom of XLogWrite:

/*
 * Make sure that the shared 'request' values do not fall behind the
 * 'result' values.  This is not absolutely essential, but it saves some
 * code in a couple of places.
 */
{
SpinLockAcquire(&XLogCtl->info_lck);
if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write)
XLogCtl->LogwrtRqst.Write = LogwrtResult.Write;
if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush)
XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush;
SpinLockRelease(&XLogCtl->info_lck);
}

I just noticed that my 0001 makes the comment a lie: it is now quite
possible that 'result' is advanced beyond 'request'.  Before the patch
that never happened because they were both advanced in the region locked
by the spinlock.

I think we could still maintain this promise if we just moved this
entire block before the first pg_atomic_monotonic_advance_u64 setting
XLogCtl->LogwrtResult.Write.  Or we could halve the whole block, and put
one acquire/test/set/release stanza before each monotonic increase of
the corresponding variable.


However, I wonder if this is still necessary.  This code was added in
4d14fe0048c (March 2001) and while everything else was quite different
back then, this hasn't changed at all.  I can't quite figure out what
are those "couple of places" that would need additional code if this
block is just removed.  I tried running the tests (including
wal_consistency_checking), and nothing breaks.  Reading the code
surrounding the other accesses of XLogCtl->LogwrtRqst, there's nothing
that looks to me like it depends on these values not lagging behind
LogwrtResult.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it."   (New York Times, about Microsoft PowerPoint)




Re: Column Filtering in Logical Replication

2022-03-22 Thread Alvaro Herrera
On 2022-Mar-19, Tomas Vondra wrote:

> @@ -174,7 +182,13 @@ ALTER PUBLICATION noinsert SET (publish = 'update, 
> delete');
>
> Add some tables to the publication:
>  
> -ALTER PUBLICATION mypublication ADD TABLE users, departments;
> +ALTER PUBLICATION mypublication ADD TABLE users (user_id, firstname), 
> departments;
> +
> +
> +  
> +   Change the set of columns published for a table:
> +
> +ALTER PUBLICATION mypublication SET TABLE users (user_id, firstname, 
> lastname), TABLE departments;
>  
>  
>

Hmm, it seems to me that if you've removed the feature to change the set
of columns published for a table, then the second example should be
removed as well.

> +/*
> + * Transform the publication column lists expression for all the relations
> + * in the list.
> + *
> + * XXX The name is a bit misleading, because we don't really transform
> + * anything here - we merely check the column list is compatible with the
> + * definition of the publication (with publish_via_partition_root=false)
> + * we only allow column lists on the leaf relations. So maybe rename it?
> + */
> +static void
> +TransformPubColumnList(List *tables, const char *queryString,
> +bool pubviaroot)
> +{

I agree with renaming this function.  Maybe CheckPubRelationColumnList() ?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"This is a foot just waiting to be shot"(Andrew Dunstan)




Re: ExecRTCheckPerms() and many prunable partitions

2022-03-23 Thread Alvaro Herrera
On 2022-Mar-23, Amit Langote wrote:

> As the changes being made with the patch are non-trivial and the patch
> hasn't been reviewed very significantly since Alvaro's comments back
> in Sept 2021 which I've since addressed, I'm thinking of pushing this
> one into the version 16 dev cycle.

Let's not get ahead of ourselves.  The commitfest is not yet over.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)




Re: Column Filtering in Logical Replication

2022-03-23 Thread Alvaro Herrera
On 2022-Mar-23, Amit Kapila wrote:

> On Wed, Mar 23, 2022 at 12:54 AM Alvaro Herrera  
> wrote:
> >
> > On 2022-Mar-19, Tomas Vondra wrote:
> >
> > > @@ -174,7 +182,13 @@ ALTER PUBLICATION noinsert SET (publish = 'update, 
> > > delete');
> > >
> > > Add some tables to the publication:
> > >  
> > > -ALTER PUBLICATION mypublication ADD TABLE users, departments;
> > > +ALTER PUBLICATION mypublication ADD TABLE users (user_id, firstname), 
> > > departments;
> > > +
> > > +
> > > +  
> > > +   Change the set of columns published for a table:
> > > +
> > > +ALTER PUBLICATION mypublication SET TABLE users (user_id, firstname, 
> > > lastname), TABLE departments;
> > >  
> > >
> > >
> >
> > Hmm, it seems to me that if you've removed the feature to change the set
> > of columns published for a table, then the second example should be
> > removed as well.
> 
> As per my understanding, the removed feature is "Alter Publication ...
> Alter Table ...". The example here "Alter Publication ... Set Table
> .." should still work as mentioned in my email[1].

Ah, I see.  Yeah, that makes sense.  In that case, the leading text
seems a bit confusing.  I would suggest "Change the set of tables in the
publication, specifying a different set of columns for one of them:"

I think it would make the example more useful if we table for which the
columns are changing is a different one.  Maybe do this:

Add some tables to the publication:
 
-ALTER PUBLICATION mypublication ADD TABLE users, departments;
+ALTER PUBLICATION mypublication ADD TABLE users (user_id, firstname), 
departments;
+
+
+  
+   Change the set of tables in the publication, keeping the column list
+   in the users table and specifying a different column list for the
+   departments table. Note that previously published tables not mentioned
+   in this command are removed from the publication:
+
+
+ALTER PUBLICATION mypublication SET TABLE users (user_id, firstname), TABLE 
departments (dept_id, deptname);
 

so that it is clear that if you want to keep the column list unchanged
in one table, you are forced to specify it again.

(Frankly, this ALTER PUBLICATION SET command seems pretty useless from a
user PoV.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Investigación es lo que hago cuando no sé lo que estoy haciendo"
(Wernher von Braun)




Re: prevent immature WAL streaming

2022-03-23 Thread Alvaro Herrera
On 2021-Sep-03, Alvaro Herrera wrote:

> The last commit is something I noticed in pg_rewind ...

I had missed this one; it's pushed now.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I can see support will not be a problem.  10 out of 10."(Simon Wittber)
  (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)




Re: [DOC] Document concurrent index builds waiting on each other

2020-11-30 Thread Alvaro Herrera
On 2020-Sep-30, Michael Paquier wrote:

> +  
> +   CREATE INDEX (including the 
> CONCURRENTLY
> +   option) commands are included when VACUUM calculates 
> what
> +   dead tuples are safe to remove even on tables other than the one being 
> indexed.
> +  
> FWIW, this is true as well for REINDEX CONCURRENTLY because both use
> the same code paths for index builds and validation, with basically
> the same waiting phases.  But is CREATE INDEX the correct place for
> that?  Wouldn't it be better to tell about such things on the VACUUM
> doc?

Yeah, I think it might be more sensible to document this in
maintenance.sgml, as part of the paragraph that discusses removing
tuples "to save space".  But making it inline with the rest of the flow,
it seems to distract from higher-level considerations, so I suggest to
make it a footnote instead.

I'm not sure on the wording to use; what about this?

>From 6c9ad72e4e61dbf05f34146cb67706dd675a38f0 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 30 Nov 2020 18:50:06 -0300
Subject: [PATCH v5] Note CIC and RC in vacuum's doc

Per James Coleman
---
 doc/src/sgml/maintenance.sgml | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4d8ad754f8..d68d7f936e 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -159,7 +159,17 @@
 concurrency control (MVCC, see ): the row version
 must not be deleted while it is still potentially visible to other
 transactions. But eventually, an outdated or deleted row version is no
-longer of interest to any transaction. The space it occupies must then be
+longer of interest to any transaction.
+
+ 
+  Note that VACUUM needs to retain tuples that are
+  nominally visible to transactions running
+  CREATE INDEX CONCURRENTLY or
+  REINDEX CONCURRENTLY, even when run on tables
+  other than the tables being indexed.
+ 
+
+The space it occupies must then be
 reclaimed for reuse by new rows, to avoid unbounded growth of disk
 space requirements. This is done by running VACUUM.

-- 
2.20.1



Re: runtime error copying oids field

2020-11-30 Thread Alvaro Herrera
On 2020-Nov-30, Zhihong Yu wrote:

> This was the line runtime error was raised:
> 
> memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
> 
> From RelationBuildPartitionDesc we can see that:
> 
>   if (nparts > 0)
>   {
>   PartitionBoundInfo boundinfo;
>   int*mapping;
>   int next_index = 0;
> 
>   result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
> 
> The cause was oids field was not assigned due to nparts being 0.
> This is verified by additional logging added just prior to the memcpy call.
> 
> I want to get the community's opinion on whether a null check should be
> added prior to the memcpy() call.

As far as I understand, we do want to avoid memcpy's of null pointers;
see [1].

In this case I think it'd be sane to skip the complete block, not just
the memcpy, something like

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca24620fd0..d35deb433a 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1163,15 +1163,17 @@ DefineIndex(Oid relationId,
 
if (partitioned)
{
+   PartitionDesc partdesc;
+
/*
 * Unless caller specified to skip this step (via ONLY), 
process each
 * partition to make sure they all contain a corresponding 
index.
 *
 * If we're called internally (no stmt->relation), recurse 
always.
 */
-   if (!stmt->relation || stmt->relation->inh)
+   partdesc = RelationGetPartitionDesc(rel);
+   if ((!stmt->relation || stmt->relation->inh) && 
partdesc->nparts > 0)
{
-   PartitionDesc partdesc = RelationGetPartitionDesc(rel);
int nparts = partdesc->nparts;
Oid*part_oids = palloc(sizeof(Oid) * 
nparts);
boolinvalidate_parent = false;

[1] 
https://www.postgresql.org/message-id/flat/20200904023648.GB3426768%40rfd.leadboat.com




Re: error_severity of brin work item

2020-11-30 Thread Alvaro Herrera
The more I look at this, the less I like it.  This would set a precedent
that any action that can be initiated from an autovac work-item has a
requirement of silently being discarded when it referenced a
non-existant relation.

I'd rather have the code that drops the index go through the list of
work-items and delete those that reference that relation.

I'm not sure if this is something that ought to be done in index_drop();
One objection might be that if the drop is rolled back, the work-items
are lost.  It's the easiest, though; and work-items are supposed to be
lossy anyway, and vacuum would fix the lack of summarization eventually.
So, not pretty, but not all that bad.  (Hopefully rolled-back drops are
not all that common.)




Re: Huge memory consumption on partitioned table with FKs

2020-11-30 Thread Alvaro Herrera
On 2020-Nov-26, Kyotaro Horiguchi wrote:

> This shares RI_ConstraintInfo cache by constraints that shares the
> same parent constraints. But you forgot that the cache contains some
> members that can differ among partitions.
> 
> Consider the case of attaching a partition that have experienced a
> column deletion.

I think this can be solved easily in the patch, by having
ri_BuildQueryKey() compare the parent's fk_attnums to the parent; if
they are equal then use the parent's constaint_id, otherwise use the
child constraint.  That way, the cache entry is reused in the common
case where they are identical.

I would embed all this knowledge in ri_BuildQueryKey though, without
adding the new function ri_GetParentConstOid.  I don't think that
function meaningful abstraction value, and instead it would make what I
suggest more difficult.




Re: runtime error copying oids field

2020-12-01 Thread Alvaro Herrera
On 2020-Nov-30, Zhihong Yu wrote:

> Alvaro, et al:
> Please let me know how to proceed with the patch.
> 
> Running test suite with the patch showed no regression.

That's good to hear.  I'll get it pushed today.  Thanks for following
up.




Re: error_severity of brin work item

2020-12-01 Thread Alvaro Herrera
On 2020-Nov-30, Justin Pryzby wrote:

> On Mon, Nov 30, 2020 at 08:47:32PM -0300, Alvaro Herrera wrote:
> > The more I look at this, the less I like it.  This would set a precedent
> > that any action that can be initiated from an autovac work-item has a
> > requirement of silently being discarded when it referenced a
> > non-existant relation.
> 
> My original request was to change to INFO, which is what vacuum proper does at
> vacuum_open_relation().  I realize that still means that new work item 
> handlers
> would have to do the LockOid, try_open dance - maybe it could be factored out.

As I understand, INFO is not well suited to messages that are not going
to the client.  Anyway, my point is about the contortions that are
needed to support the case, rather than what exactly do we do when it
happens.

Retrospectively, it's strange that this problem (what happens when
indexes with pending work-items are dropped) hadn't manifested.  It
seems a pretty obvious one.

> > I'd rather have the code that drops the index go through the list of
> > work-items and delete those that reference that relation.
> > 
> > I'm not sure if this is something that ought to be done in index_drop();
> > One objection might be that if the drop is rolled back, the work-items
> > are lost.
> 
> Should it be done in an AtCommit hook or something like that ?

I didn't like this idea much on first read, on extensibility grounds,
but perhaps it's not so bad because we can generalize it whenever
there's pressure to add a second type of work-item (*if* that ever
happens).

I suppose the idea is that index_drop saves the index OID when a BRIN
index with autosummarization=on is dropped, and then the
AtCommit_WorkItems() call scans the items list and drops those that
match any OIDs in the list.  (It does require to be mindful of subxact
aborts, of course.)




Re: runtime error copying oids field

2020-12-01 Thread Alvaro Herrera
On 2020-Dec-01, Alvaro Herrera wrote:

> On 2020-Nov-30, Zhihong Yu wrote:
> 
> > Alvaro, et al:
> > Please let me know how to proceed with the patch.
> > 
> > Running test suite with the patch showed no regression.
> 
> That's good to hear.  I'll get it pushed today.  Thanks for following
> up.

Done, thanks for reporting this.




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-12-01 Thread Alvaro Herrera
Hi Justin,

Thanks for all the comments. I'll incorporate everything and submit an
updated version later.

On 2020-Nov-30, Justin Pryzby wrote:

> On Tue, Nov 03, 2020 at 08:56:06PM -0300, Alvaro Herrera wrote:

> > +++ b/src/bin/psql/describe.c
> > -   printfPQExpBuffer(&tmpbuf, _("Partition of: %s %s"), 
> > parent_name,
> > - partdef);
> > +   printfPQExpBuffer(&tmpbuf, _("Partition of: %s %s%s"), 
> > parent_name,
> > + partdef,
> > + strcmp(detached, "t") 
> > == 0 ? " DETACHED" : "");
> 
> The attname "detached" is a stretch of what's intuitive (it's more like
> "detachING" or half-detached).  But I think psql should for sure show 
> something
> more obvious to users.  Esp. seeing as psql output isn't documented.  Let's
> figure out what to show to users and then maybe rename the column that, too.

OK.  I agree that "being detached" is the state we want users to see, or
maybe "detach pending", or "unfinisheddetach" (ugh).  I'm not sure that
pg_inherits.inhbeingdetached" is a great column name.  Opinions welcome.

> > +PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
> 
> Instead of finalize .. deferred ?  Or ??

Well, I'm thinking that this has to be a verb in the imperative mood.
The user is commanding the server to "finalize this detach operation".
I'm not sure that DEFERRED fits that grammatical role.  If there are
other ideas, let's discuss them.

ALTER TABLE tst DETACH PARTITION tst_1 FINALIZE  <-- decent
ALTER TABLE tst DETACH PARTITION tst_1 COMPLETE  <-- I don't like it
ALTER TABLE tst DETACH PARTITION tst_1 DEFERRED  <-- grammatically faulty?

> ATExecDetachPartition:
> Doesn't this need to lock the table before testing for default partition ?

Correct, it does.

> I ended up with apparently broken constraint when running multiple loops 
> around
> a concurrent detach / attach:
> 
> while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR VALUES 
> FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 CONCURRENTLY"; do :; 
> done&
> while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR VALUES 
> FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 CONCURRENTLY"; do :; 
> done&
> 
> "p1_check" CHECK (true)
> "p1_i_check" CHECK (i IS NOT NULL AND i >= 1 AND i < 2)

Not good.




Re: error_severity of brin work item

2020-12-01 Thread Alvaro Herrera
On 2020-Dec-01, Justin Pryzby wrote:

> This was an idea I made up - I don't know any of the details of this, but if
> you give a hint I could look at it more.  There'd (still) be a race window, 
> but
> I think that's ok.

See CommitTransaction() and friends, where AtEOXact_on_commit_actions()
and others are called.  You'd have to create a new routine (say
AtEOXact_Autovacuum or more specific AtEOXact_AutovacuumWorkItems), to
be called at the right places in xact.c.  Keep a global variable, say a
list of OIDs.  On subxact commit, the list is reassigned to its parent
transaction; on subxact abort, the list is discarded.  On top xact
commit, the list of OIDs is passed to some new routine in autovacuum.c
that scans the workitem array and deletes items as appropriate.

Not sure what's a good place for OIDs to be added to the list.  We don't
have AM-specific entry points for relation drop.  I think this is the
weakest point of this.


> Another idea is if perform_work_item() were responsible for discarding
> relations which disappear.  Currently it does this, which is racy since it
> holds no lock.

That has the property that it remains contained in autovacuum.c, but no
other advantages I think.




Re: room for improvement in amcheck btree checking?

2020-12-01 Thread Alvaro Herrera
On 2020-Dec-01, Mark Dilger wrote:

> 7) Run a SQL query that uses an index scan on the table and see that it 
> errors with something like:
> 
>ERROR:  could not read block 0 in file "base/13097/16391": read only 0 of 
> 8192 bytes
> 
> I found it surprising that even when precisely zero of the tids in the
> index exist in the table the index checks all come back clean.

Yeah, I've seen this kind of symptom in production databases (indexes
pointing to non-existant heap pages).

I think one useful cross-check that amcheck could do, is verify that if
a heap page is referenced from the index, then the heap page must exist.
Otherwise, it's a future index corruption of sorts: the old index
entries will point to the wrong new heap tuples as soon as the table
grows again.




Re: [DOC] Document concurrent index builds waiting on each other

2020-12-01 Thread Alvaro Herrera
On 2020-Nov-30, James Coleman wrote:

> On Mon, Nov 30, 2020 at 4:53 PM Alvaro Herrera  
> wrote:
> >
> > On 2020-Sep-30, Michael Paquier wrote:

> > Yeah, I think it might be more sensible to document this in
> > maintenance.sgml, as part of the paragraph that discusses removing
> > tuples "to save space".  But making it inline with the rest of the flow,
> > it seems to distract from higher-level considerations, so I suggest to
> > make it a footnote instead.
> 
> I have mixed feelings about wholesale moving it; users aren't likely
> to read the vacuum doc when considering how running CIC might impact
> their system, though I do understand why it otherwise fits there.

Makes sense.  ISTM that if we want to have a cautionary blurb CIC docs,
it should go in REINDEX CONCURRENTLY as well.

> > I'm not sure on the wording to use; what about this?
> 
> The wording seems fine to me.

Great, thanks.

> This is a replacement for what was 0002 earlier? And 0001 from earlier
> still seems to be a useful standalone patch?

0001 is the one that I got pushed yesterday, I think -- correct?
src/tools/git_changelog says:

Author: Alvaro Herrera 
Branch: master [58ebe967f] 2020-11-30 18:24:55 -0300
Branch: REL_13_STABLE [3fe0e7c3f] 2020-11-30 18:24:55 -0300
Branch: REL_12_STABLE [b2603f16a] 2020-11-30 18:24:55 -0300
Branch: REL_11_STABLE [ed9c9b033] 2020-11-30 18:24:55 -0300
Branch: REL_10_STABLE [d3bd36a63] 2020-11-30 18:24:55 -0300
Branch: REL9_6_STABLE [b3d33bf59] 2020-11-30 18:24:55 -0300
Branch: REL9_5_STABLE [968a537b4] 2020-11-30 18:24:55 -0300

Document concurrent indexes waiting on each other

Because regular CREATE INDEX commands are independent, and there's no
logical data dependency, it's not immediately obvious that transactions
held by concurrent index builds on one table will block the second phase
of concurrent index creation on an unrelated table, so document this
caveat.

Backpatch this all the way back.  In branch master, mention that only
some indexes are involved.

Author: James Coleman 
Reviewed-by: David Johnston 
Discussion: 
https://postgr.es/m/CAAaqYe994=purn8cjz4ueo_s-ffrr_3ogeryhtdghab2wg_...@mail.gmail.com





Re: wrong link in acronyms.sgml

2020-12-02 Thread Alvaro Herrera
On 2020-Dec-02, Erik Rijkers wrote:

> Hi
> 
> I just noticed that in
> 
>   https://www.postgresql.org/docs/13/acronyms.html
> (i.e., doc/src/sgml/acronyms.sgml)
> 
> there is under lemma 'HOT' a link with URL:
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/README.HOT;hb=HEAD
> 
> Is this deliberate? Surely pointing 13-docs into HEAD-docs is wrong?

Yeah, I noticed this while working on the glossary.

This link works:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/README.HOT;hb=REL_13_STABLE
but the problem with this one is that we'll have to update once per
release.

> Otherwise it may be better to remove the link altogether and just have the
> acronym lemma: 'HOT' and explanation 'Heap-Only Tuple'.

Yeah, I don't think pointing to the source-code README file is a great
resource.  I think starting with 13 we should add an entry in the
glossary for Heap-Only Tuple, and make the acronym entry point to the
glossary.  In fact, we could do this with several other acronyms too,
such as DDL, DML, GEQO, LSN, MVCC, SQL, TID, WAL.  The links to external
resources can then be put in the glossary definition, if needed.





Re: convert elog(LOG) calls to ereport

2020-12-02 Thread Alvaro Herrera
On 2020-Dec-02, Peter Eisentraut wrote:

> There are a number of elog(LOG) calls that appear to be user-facing, so they
> should be ereport()s.  This patch changes them.  There are more elog(LOG)
> calls remaining, but they all appear to be some kind of debugging support.
> Also, I changed a few elog(FATAL)s that were nearby, but I didn't
> specifically look for them.

> - elog(LOG, "WSAIoctl(SIO_KEEPALIVE_VALS) failed: %ui",
> -  WSAGetLastError());
> + ereport(LOG,
> + (errmsg("WSAIoctl(SIO_KEEPALIVE_VALS) failed: 
> %ui",
> + WSAGetLastError(;

Please take the opportunity to move the flag name out of the message in
this one, also.  I do wonder if it'd be a good idea to move the syscall
name itself out of the message, too; that would reduce the number of
messages to translate 50x to just "%s(%s) failed: %m" instead of one
message per distinct syscall.

Should fd.c messages do errcode_for_file_access() like elsewhere?

Overall, it looks good to me.

Thanks





Re: Huge memory consumption on partitioned table with FKs

2020-12-03 Thread Alvaro Herrera
Hello

I haven't followed this thread's latest posts, but I'm unclear on the
lifetime of the new struct that's being allocated in TopMemoryContext.
At what point are those structs freed?

Also, the comment that was in RI_ConstraintInfo now appears in
RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now
undocumented.  What is the relationship between those two structs?  I
see that they have pointers to each other, but I think the relationship
should be documented more clearly.

Thanks!





Re: Autovacuum on partitioned table (autoanalyze)

2020-12-03 Thread Alvaro Herrera
Hello Yuzuko,

On 2020-Dec-02, yuzuko wrote:

> The problem Horiguchi-san mentioned is as follows:
> [explanation]

Hmm, I see.  So the problem is that if some ancestor is analyzed first,
then analyze of one of its partition will cause a redundant analyze of
the ancestor, because the number of tuples that is propagated from the
partition represents a set that had already been included in the
ancestor's analysis.

If the problem was just that, then I think it would be very simple to
solve: just make sure to sort the tables to vacuum so that all leaves
are vacuumed first, and then all ancestors, sorted from the bottom up.
Problem solved.

But I'm not sure that that's the whole story, for two reasons: one, two
workers can run simultaneously, where one analyzes the partition and the
other analyzes the ancestor.  Then the order is not guaranteed (and
each process will get no effect from remembering whether it did that one
or not).  Second, manual analyzes can occur in any order.

Maybe it's more useful to think about this in terms of rememebering that
partition P had changed_tuples set to N when we analyzed ancestor A.
Then, when we analyze partition P, we send the message listing A as
ancestor; on receipt of that message, we see M+N changed tuples in P,
but we know that we had already seen N, so we only record M.

I'm not sure how to implement this idea however, since on analyze of
ancestor A we don't have the list of partitions, so we can't know the N
for each partition.





Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2020-12-04 Thread Alvaro Herrera
On 2020-Dec-04, Dmitry Dolgov wrote:

> * This one is mostly for me to understand. There are couple of places
>   with a commentary that 'PROC_IN_SAFE_IC is not necessary, because the
>   transaction only takes a snapshot to do some catalog manipulation'.
>   But for some of them I don't immediately see in the relevant code
>   anything related to snapshots. E.g. one in DefineIndex is followed by
>   WaitForOlderSnapshots (which seems to only do waiting, not taking a
>   snapshot), index_set_state_flags and CacheInvalidateRelcacheByRelid.
>   Is taking a snapshot hidden somewhere there inside?

Well, they're actually going to acquire an Xid, not a snapshot, so the
comment is slightly incorrect; I'll fix it, thanks for pointing that
out.  The point stands: because those transactions are of very short
duration (at least of very short duration after the point where the XID
is obtained), it's not necessary to set the PROC_IN_SAFE_IC flag, since
it won't cause any disruption to other processes.

It is possible that I copied the wrong comment in DefineIndex.  (I only
noticed that one after I had mulled over the ones in
ReindexRelationConcurrently, each of which is skipping setting the flag
for slightly different reasons.)




Re: walsender bug: stuck during shutdown

2020-12-04 Thread Alvaro Herrera
On 2020-Nov-26, Fujii Masao wrote:

> Yes, so the problem here is that walsender goes into the busy loop
> in that case. Seems this happens only in logical replication walsender.
> In physical replication walsender, WaitLatchOrSocket() in WalSndLoop()
> seems to work as expected and prevent it from entering into busy loop
> even in that case.
> 
>   /*
>* If postmaster asked us to stop, don't wait anymore.
>*
>* It's important to do this check after the recomputation of
>* RecentFlushPtr, so we can send all remaining data before 
> shutting
>* down.
>*/
>   if (got_STOPPING)
>   break;
> 
> The above code in WalSndWaitForWal() seems to cause this issue. But I've
> not come up with idea about how to fix yet.

With DEBUG1 I observe that walsender is getting a lot of 'r' messages
(standby reply) with all zeroes:

2020-12-01 21:01:24.100 -03 [15307] DEBUG:  write 0/0 flush 0/0 apply 0/0

However, while doing that I also observed that if I do send some
activity to the logical replication stream, with the provided program,
it will *still* have the 'write' pointer set to 0/0, and the 'flush'
pointer has moved forward to what was sent.  I'm not clear on what
causes the write pointer to move forward in logical replication.

Still, the previously proposed patch does resolve the problem in either
case.




Re: POC: Better infrastructure for automated testing of concurrency issues

2020-12-04 Thread Alvaro Herrera
On 2020-Nov-25, Alexander Korotkov wrote:

> In the view of above, I'd like to propose a POC patch, which implements new
> builtin infrastructure for reproduction of concurrency issues in automated
> test suites.  The general idea is so-called "stop events", which are
> special places in the code, where the execution could be stopped on some
> condition.  Stop event also exposes a set of parameters, encapsulated into
> jsonb value.  The condition over stop event parameters is defined using
> jsonpath language.

+1 for the idea.  I agree we have a need for something on this area;
there are *many* scenarios currently untested because of the lack of
what you call "stop points".  I don't know if jsonpath is the best way
to implement it, but at least it is readily available and it seems a
decent way to go at it.





Re: Improper use about DatumGetInt32

2020-12-04 Thread Alvaro Herrera
On 2020-Dec-03, Peter Eisentraut wrote:

> On 2020-11-30 16:32, Alvaro Herrera wrote:
> > On 2020-Nov-30, Peter Eisentraut wrote:
> > 
> > > Patch updated this way.  I agree it's better that way.
> > 
> > Thanks, LGTM.
> 
> For a change like this, do we need to change the C symbol names, so that
> there is no misbehavior if the shared library is not updated at the same
> time as the extension is upgraded in SQL?

Good question.  One point is that since the changes to the arguments are
just in the way we read the values from the Datum C-values, there's no
actual ABI change.  So if I understand correctly, there's no danger of a
crash; there's just a danger of misinterpreting a value.

I don't know if it's possible to determine (at function execution time)
that we're running with the old extension version; if so it might
suffice to throw a warning but still have the SQL function run the same
C function.

If we really think that we ought to differentiate, then we could do what
pg_stat_statement does, and have a separate C function that's called
with the obsolete signature (pg_stat_statements_1_8 et al).




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-04 Thread Alvaro Herrera
On 2020-Dec-04, Michael Paquier wrote:

> VacuumOption does that since 6776142, and ClusterOption since 9ebe057,
> so switching ReindexOption to just match the two others still looks
> like the most consistent move.

9ebe057 goes to show why this is a bad idea, since it has this:

+typedef enum ClusterOption
+{
+   CLUOPT_RECHECK, /* recheck relation state */
+   CLUOPT_VERBOSE  /* print progress info */
+} ClusterOption;

and then you do things like

+   if ($2)
+   n->options |= CLUOPT_VERBOSE;

and then tests like

+   if ((options & VACOPT_VERBOSE) != 0)

Now if you were to ever define third and fourth values in that enum,
this would immediately start malfunctioning.

FWIW I'm with Peter on this.




Re: A few new options for CHECKPOINT

2020-12-04 Thread Alvaro Herrera
On the UI of this patch, you're proposing to add the option FAST.  I'm
not a fan of this option name and propose that (if we have it) we use
the name SPREAD instead (defaults to false).

Now we don't actually explain the term "spread" much in the documentation;
we just say "the writes are spread".  But it seems more natural to build
on that adjective rather than "fast/slow".


I think starting a spread checkpoint has some usefulness, if your
checkpoint interval is very large but your completion target is not very
close to 1.  In that case, you're expressing that you want a checkpoint
to start now and not impact production unduly, so that you know when it
finishes and therefore when is it a good time to start a backup.  (You
will still have some WAL to replay, but it won't be as much as if you
just ignored checkpoint considerations completely.)


On the subject of measuring replay times for backups taking while
pgbench is pounding the database, I think a realistic test does *not*
have pgbench running at top speed; rather you have some non-maximal
"-R xyz" option.  You would probably determine a value to use by running
without -R, observing what's a typical transaction rate, and using some
fraction (say, half) of that in the real run.




Re: Removal of operator_precedence_warning

2020-12-04 Thread Alvaro Herrera
On 2020-Dec-04, Tom Lane wrote:

> I think it's time for $SUBJECT.  We added this GUC in 9.5, which
> will be EOL by the time of our next major release, and it was never
> meant as more than a transitional aid.  Moreover, it's been buggy
> as heck (cf abb164655, 05104f693, 01e0cbc4f, 4cae471d1), and the
> fact that some of those errors went undetected for years shows that
> it's not really gotten much field usage.
> 
> Hence, I propose the attached.  Comments?

I wonder if it'd be fruitful to ask the submitters of those bugs about
their experiences with the feature.  Did they find it useful in finding
precedence problems in their code?  Did they experience other problems
that they didn't report?

Reading the reports mentioned in those commits, it doesn't look like any
of them were actually using the feature -- they all seem to have come
across the problems by accidents of varying nature.





Re: A few new options for CHECKPOINT

2020-12-04 Thread Alvaro Herrera
On 2020-Dec-04, Bossart, Nathan wrote:

> On 12/4/20, 1:47 PM, "Alvaro Herrera"  wrote:
> > On the UI of this patch, you're proposing to add the option FAST.  I'm
> > not a fan of this option name and propose that (if we have it) we use
> > the name SPREAD instead (defaults to false).
> >
> > Now we don't actually explain the term "spread" much in the documentation;
> > we just say "the writes are spread".  But it seems more natural to build
> > on that adjective rather than "fast/slow".
> 
> Here is a version of the patch that uses SPREAD instead of FAST.

WFM.

Instead of adding checkpt_option_list, how about utility_option_list?
It seems intended for reuse.




Re: Change definitions of bitmap flags to bit-shifting style

2020-12-05 Thread Alvaro Herrera
On 2020-Dec-05, Tom Lane wrote:

> FWIW, personally I'd vote for doing the exact opposite.  When you are
> debugging and examining the contents of a bitmask variable, it's easier to
> correlate a value like "0x03" with definitions made in the former style.
> Or at least I think so; maybe others see it differently.

The hexadecimal representation is more natural to me than bit-shifting,
so I would prefer to use that style too.  But maybe I'm trained to it
because of looking at t_infomask symbols constantly.




Re: A few new options for CHECKPOINT

2020-12-05 Thread Alvaro Herrera
On 2020-Dec-05, Stephen Frost wrote:

> So- just to be clear, CHECKPOINTs are more-or-less always happening in
> PG, and running this command might do something or might end up doing
> nothing depending on if a checkpoint is already in progress and this
> request just gets consolidated into an existing one, and it won't
> actually reduce the amount of WAL replay except in the case where
> checkpoint completion target is set to make a checkpoint happen in less
> time than checkpoint timeout, which ultimately isn't a great way to run
> the system anyway.

You keep making this statement, and I don't necessarily disagree, but if
that is the case, please explain why don't we have
checkpoint_completion_target set to 0.9 by default?  Should we change
that?

> Assuming we actually want to do this, which I still generally don't
> agree with since it isn't really clear if it'll actually end up doing
> something, or not, wouldn't it make more sense to have a command that
> just sits and waits for the currently running (or next) checkpoint to
> complete..?  For the use-case that was explained, at least, we don't
> actually need to cause another checkpoint to happen, we just want to
> know when a checkpoint has completed, right?

Yes, I agree that the use case for this is unclear.




Re: Huge memory consumption on partitioned table with FKs

2020-12-07 Thread Alvaro Herrera
On 2020-Dec-07, Amit Langote wrote:

> On Fri, Dec 4, 2020 at 12:05 PM Kyotaro Horiguchi
>  wrote:
> > > Also, the comment that was in RI_ConstraintInfo now appears in
> > > RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now
> > > undocumented.  What is the relationship between those two structs?  I
> > > see that they have pointers to each other, but I think the relationship
> > > should be documented more clearly.
> >
> > I'm not sure the footprint of this patch worth doing but here is a bit
> > more polished version.
> 
> I noticed that the foreign_key test fails and it may have to do with
> the fact that a partition's param info remains attached to the
> parent's RI_ConstraintInfo even after it's detached from the parent
> table using DETACH PARTITION.

I think this bit about splitting the struct is a distraction.  Let's get
a patch that solves the bug first, and then we can discuss what further
refinements we want to do.  I think we should get your patch in
CA+HiwqEOrfN9b=f3sdmyspgc4go-l_vmfhxllxvmmdp34e6...@mail.gmail.com
committed (which I have not read yet.)  Do you agree with this plan?




Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2020-12-07 Thread Alvaro Herrera
Does "ALTER TABLE ONLY parent" work correctly?  Namely, do not affect
existing partitions, but cause future partitions to acquire the new
setting.

This sounds very much related to previous discussion on REPLICA IDENTITY
not propagating to partitions; see
https://postgr.es/m/201902041630.gpadougzab7v@alvherre.pgsql
particularly Robert Haas' comments at
http://postgr.es/m/CA+TgmoYjksObOzY8b1U1=bsp_m+702eof42fqrtqtyio1nu...@mail.gmail.com
and downstream discussion from there.





Re: Commitfest statistics

2020-12-07 Thread Alvaro Herrera
On 2020-Dec-07, Tom Lane wrote:

> Anastasia Lubennikova  writes:

> > Firstly, we use it to track patches that we want to see in the nearest 
> > releases and concentrate our efforts on. And current CFM guideline [1] 
> > reflects this idea. It suggests, that after the commitfest closure date 
> > we relentlessly throw to RWF patches that got at least some feedback. To 
> > be honest, I was reluctant to return semi-ready patches, because it 
> > means that they will get lost somewhere in mailing lists. And it seems 
> > like other CFMs did the same.
> 
> Yeah, the aggressive policy suggested in "Sudden Death Overtime" is
> certainly not what's been followed lately.  I agree that that's
> probably too draconic.  On the other hand, if a patch sits in the
> queue for several CFs without getting committed, that suggests that
> maybe we ought to reject it on the grounds of "apparently nobody but
> the author cares about this".  That argument is easier to make for
> features than bug fixes of course, so maybe the policy needs to
> distinguish what kind of change is being considered.

Note that this checklist was written in 2013 and has never been updated
since then.  I think there is nothing in that policy that we do use.
I'm thinking that rather than try to fine-tune that document, we ought
to rewrite one from scratch.

For one thing, "a beer or three" only at end of CF is surely not
sufficient.




Re: range_agg

2020-12-07 Thread Alvaro Herrera
On 2020-Dec-08, Alexander Korotkov wrote:

> I also found a problem in multirange types naming logic.  Consider the
> following example.
> 
> create type a_multirange AS (x float, y float);
> create type a as range(subtype=text, collation="C");
> create table tbl (x __a_multirange);
> drop type a_multirange;
> 
> If you dump this database, the dump couldn't be restored.  The
> multirange type is named __a_multirange, because the type named
> a_multirange already exists.  However, it might appear that
> a_multirange type is already deleted.  When the dump is restored, a
> multirange type is named a_multirange, and the corresponding table
> fails to be created.  The same thing doesn't happen with arrays,
> because arrays are not referenced in dumps by their internal names.
> 
> I think we probably should add an option to specify multirange type
> names while creating a range type.  Then dump can contain exact type
> names used in the database, and restore wouldn't have a names
> collision.

Hmm, good point.  I agree that a dump must preserve the name, since once
created it is user-visible.  I had not noticed this problem, but it's
obvious in retrospect.

> In general, I wonder if we can make the binary format of multiranges
> more efficient.  It seems that every function involving multiranges
> from multirange_deserialize().  I think we can make functions like
> multirange_contains_elem() much more efficient.  Multirange is
> basically an array of ranges.  So we can pack it as follows.
> 1. Typeid and rangecount
> 2. Tightly packed array of flags (1-byte for each range)
> 3. Array of indexes of boundaries (4-byte for each range).  Or even
> better we can combine offsets and lengths to be compression-friendly
> like jsonb JEntry's do.
> 4. Boundary values
> Using this format, we can implement multirange_contains_elem(),
> multirange_contains_range() without deserialization and using binary
> search.  That would be much more efficient.  What do you think?

I also agree.  I spent some time staring at the I/O code a couple of
months back but was unable to focus on it for long enough.  I don't know
JEntry's format, but I do remember that the storage format for JSONB was
widely discussed back then; it seems wise to apply similar logic or at
least similar reasoning.




Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2020-12-08 Thread Alvaro Herrera
On 2020-Dec-08, tsunakawa.ta...@fujitsu.com wrote:

> From: Alvaro Herrera 
> > Does "ALTER TABLE ONLY parent" work correctly?  Namely, do not affect
> > existing partitions, but cause future partitions to acquire the new
> > setting.
> 
> Yes, it works correctly in the sense that ALTER TABLE ONLY on a
> partitioned table does nothing because it has no storage and therefore
> logged/unlogged has no meaning.

But what happens when you create another partition after you change the
"loggedness" of the partitioned table?

> > This sounds very much related to previous discussion on REPLICA IDENTITY
> > not propagating to partitions; see
> > https://postgr.es/m/201902041630.gpadougzab7v@alvherre.pgsql
> > particularly Robert Haas' comments at
> > http://postgr.es/m/CA+TgmoYjksObOzY8b1U1=BsP_m+702eOf42fqRtQTYio
> > 1nu...@mail.gmail.com
> > and downstream discussion from there.
> 
> There was a hot discussion.  I've read through it.  I hope I'm not
> doing strange in light of that discussioin.

Well, my main take from that is we should strive to change all
subcommands together, in some principled way, rather than change only
one or some, in arbitrary ways.




Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2020-12-09 Thread Alvaro Herrera
On 2020-Dec-09, tsunakawa.ta...@fujitsu.com wrote:

> From: Alvaro Herrera 
> > But what happens when you create another partition after you change the
> > "loggedness" of the partitioned table?
> 
> The new partition will have a property specified when the user creates
> it.  That is, while the storage property of each storage unit
> (=partition) is basically independent, ALTER TABLE on a partitioned
> table is a convenient way to set the same property value to all its
> underlying storage units.

Well, that definition seems unfriendly to me.  I prefer the stance that
if you change the value for the parent, then future partitions inherit
that value.

> I got an impression from the discussion that some form of consensus on
> the principle was made and you were trying to create a fix for REPLICA
> IDENTITY.  Do you think the principle was unclear and we should state
> it first (partly to refresh people's memories)?

I think the principle was sound -- namely, that we should make all those
commands recurse by default, and for cases where it matters, the
parent's setting should determine the default for future children.

> I'm kind of for it, but I'm hesitant to create the fix for all ALTER
> actions, because it may take a lot of time and energy as you were
> afraid.  Can we define the principle, and then create individual fixes
> independently based on that principle?

That seems acceptable to me, as long as we get all changes in the same
release.  What we don't want (according to my reading of that
discussion) is to change the semantics of a few subcommands in one
release, and the semantics of a few other subcommands in another
release.




Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2020-12-09 Thread Alvaro Herrera
On 2020-Dec-09, Bharath Rupireddy wrote:

> I'm not sure how many more of such commands exist which require changes.

The other thread has a list.  I think it is complete, but if you want to
double-check, that would be helpful.

> How about doing it this way?
> 
> 1) Have a separate common thread listing the commands and specifying
> clearly the agreed behaviour for such commands.
> 2) Whenever the separate patches are submitted(hopefully) by the
> hackers for such commands, after the review we can make a note of that
> patch in the common thread.
> 3) Once the patches for all the listed commands are
> received(hopefully) , then they can be committed together in one
> release.

Sounds good.  I think this thread is a good place to collect those
patches, but if you would prefer to have a new thread, feel free to
start one (I'd suggest CC'ing me and Tsunakawa-san).




Re: Change default of checkpoint_completion_target

2020-12-10 Thread Alvaro Herrera
Howdy,

On 2020-Dec-10, Stephen Frost wrote:

> * Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> > On Tue, 2020-12-08 at 17:29 +, Bossart, Nathan wrote:
> > > +1 to setting checkpoint_completion_target to 0.9 by default.
> > 
> > +1 for changing the default or getting rid of it, as Tom suggested.
> 
> Attached is a patch to change it from a GUC to a compile-time #define
> which is set to 0.9, with accompanying documentation updates.

I think we should leave a doc stub or at least an , to let
people know the GUC has been removed rather than just making it
completely invisible.  (Maybe piggyback on the stuff in [1]?)

[1] 
https://postgr.es/m/CAGRY4nyA=jmBNa4LVwgGO1GyO-RnFmfkesddpT_uO+3=mot...@mail.gmail.com





Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-11 Thread Alvaro Herrera
By the way--  What did you think of the idea of explictly marking the
types used for bitmasks using types bits32 and friends, instead of plain
int, which is harder to spot?




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-15 Thread Alvaro Herrera
On 2020-Dec-12, Peter Eisentraut wrote:

> On 2020-12-11 21:27, Alvaro Herrera wrote:
> > By the way--  What did you think of the idea of explictly marking the
> > types used for bitmasks using types bits32 and friends, instead of plain
> > int, which is harder to spot?
> 
> If we want to make it clearer, why not turn the thing into a struct, as in
> the attached patch, and avoid the bit fiddling altogether.

I don't like this idea too much, because adding an option causes an ABI
break.  I don't think we commonly add options in backbranches, but it
has happened.  The bitmask is much easier to work with in that regard.





Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Alvaro Herrera
On 2020-Dec-23, Michael Paquier wrote:

>  bool
> -reindex_relation(Oid relid, int flags, int options)
> +reindex_relation(Oid relid, int flags, ReindexOptions *options)
>  {
>   Relationrel;
>   Oid toast_relid;

Wait a minute.  reindex_relation has 'flags' and *also* 'options' with
an embedded 'flags' member?  Surely that's not right.  I see that they
carry orthogonal sets of options ... but why aren't they a single
bitmask instead of two separate ones?  This looks weird and confusing.


Also: it seems a bit weird to me to put the flags inside the options
struct.  I would keep them separate -- so initially the options struct
would only have the tablespace OID, on API cleanliness grounds:

struct ReindexOptions
{
tablepaceOidoid;
};
extern bool
reindex_relation(Oid relid, bits32 flags, ReindexOptions *options);

I guess you could argue that it's more performance to set up only two
arguments to the function call instead of three .. but I doubt that's
measurable for anything in DDL-land.

But also, are we really envisioning that these routines would have all
that many additional options?  Maybe it is sufficient to do just

extern bool
reindex_relation(Oid relid, bits32 flags, tablespaceOid Oid);




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Alvaro Herrera
On 2020-Dec-23, Justin Pryzby wrote:

> This was getting ugly:
> 
> extern void reindex_index(Oid indexId, bool skip_constraint_checks,
>   char relpersistence, int options, Oid 
> tablespaceOid)Z

Is this what I suggested?




Re: list of extended statistics on psql

2021-01-07 Thread Alvaro Herrera
On 2021-Jan-07, Tomas Vondra wrote:

> On 1/7/21 1:46 AM, Tatsuro Yamada wrote:

> > I overlooked the check for MCV in the logic building query
> > because I created the patch as a new feature on PG14.
> > I'm not sure whether we should do back patch or not. However, I'll
> > add the check on the next patch because it is useful if you decide to
> > do the back patch on PG10, 11, 12, and 13.
> 
> BTW perhaps a quick look at the other \d commands would show if there are
> precedents. I didn't have time for that.

Yes, we do promise that new psql works with older servers.

I think we would not backpatch any of this, though.

-- 
Álvaro Herrera




Re: Key management with tests

2021-01-07 Thread Alvaro Herrera
On 2021-Jan-07, Bruce Momjian wrote:

> All the tests pass now.  The current src/test directory is 19MB, and
> adding these tests takes it to 23MB, or a 20% increase.  That seems like
> a lot.  It is testing 128-bit and 256-bit keys --- should we do fewer
> tests, or just test 256, or use gzip to compress the tests by 50%? 
> (Does every platform have gzip?)

So the tests are about 95% of the patch ... do we really need that many
tests?

-- 
Álvaro Herrera




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-01-08 Thread Alvaro Herrera
On 2020-Dec-01, Alvaro Herrera wrote:

> On 2020-Nov-30, Justin Pryzby wrote:

> Thanks for all the comments. I'll incorporate everything and submit an
> updated version later.

Here's a rebased version 5, with the typos fixed.  More comments below.

> > The attname "detached" is a stretch of what's intuitive (it's more like
> > "detachING" or half-detached).  But I think psql should for sure show 
> > something
> > more obvious to users.  Esp. seeing as psql output isn't documented.  Let's
> > figure out what to show to users and then maybe rename the column that, too.
> 
> OK.  I agree that "being detached" is the state we want users to see, or
> maybe "detach pending", or "unfinisheddetach" (ugh).  I'm not sure that
> pg_inherits.inhbeingdetached" is a great column name.  Opinions welcome.

I haven't changed this yet; I can't make up my mind about what I like
best.

Partition of: parent FOR VALUES IN (1) UNFINISHED DETACH
Partition of: parent FOR VALUES IN (1) UNDER DETACH
Partition of: parent FOR VALUES IN (1) BEING DETACHED

> > ATExecDetachPartition:
> > Doesn't this need to lock the table before testing for default partition ?
> 
> Correct, it does.

I failed to point out that by the time ATExecDetachPartition is called,
the relation has already been locked by the invoking ALTER TABLE support
code.

> > I ended up with apparently broken constraint when running multiple loops 
> > around
> > a concurrent detach / attach:
> > 
> > while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR 
> > VALUES FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 CONCURRENTLY"; 
> > do :; done&
> > while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR 
> > VALUES FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 CONCURRENTLY"; 
> > do :; done&
> > 
> > "p1_check" CHECK (true)
> > "p1_i_check" CHECK (i IS NOT NULL AND i >= 1 AND i < 2)
> 
> Not good.

Haven't had time to investigate this problem yet.

-- 
Álvaro Herrera
>From d21acb0e99ed3d296db70fed2d5d036d721d611b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 13 Jul 2020 20:15:30 -0400
Subject: [PATCH v5 1/2] Let ALTER TABLE exec routines deal with the relation

This means that ATExecCmd relies on AlteredRelationInfo->rel instead of
keeping the relation as a local variable; this is useful if the
subcommand needs to modify the relation internally.  For example, a
subcommand that internally commits its transaction needs this.
---
 src/backend/commands/tablecmds.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 993da56d43..a8528a3423 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -156,6 +156,8 @@ typedef struct AlteredTableInfo
 	Oid			relid;			/* Relation to work on */
 	char		relkind;		/* Its relkind */
 	TupleDesc	oldDesc;		/* Pre-modification tuple descriptor */
+	/* Transiently set during Phase 2, normally set to NULL */
+	Relation	rel;
 	/* Information saved by Phase 1 for Phase 2: */
 	List	   *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */
 	/* Information saved by Phases 1/2 for Phase 3: */
@@ -353,7 +355,7 @@ static void ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 	  AlterTableUtilityContext *context);
 static void ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			  AlterTableUtilityContext *context);
-static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+static void ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 	  AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass,
 	  AlterTableUtilityContext *context);
 static AlterTableCmd *ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab,
@@ -4405,7 +4407,6 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 		{
 			AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);
 			List	   *subcmds = tab->subcmds[pass];
-			Relation	rel;
 			ListCell   *lcmd;
 
 			if (subcmds == NIL)
@@ -4414,10 +4415,10 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			/*
 			 * Appropriate lock was obtained by phase 1, needn't get it again
 			 */
-			rel = relation_open(tab->relid, NoLock);
+			tab->rel = relation_open(tab->relid, NoLock);
 
 			foreach(lcmd, subcmds)
-ATExecCmd(wqueue, tab, rel,
+ATExecCmd(wqueue, tab,
 		  castNode(AlterTableCmd, lfirst(lcmd)),
 		  lockmode, pass, context);
 
@@ -4429,7 +4430,11 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			if (pass == AT_PASS_ALTER_TYPE)
 ATPostAlterTypeCleanup(wqueue, tab, lockmode);
 
-			relation

Re: A failure of standby to follow timeline switch

2021-01-08 Thread Alvaro Herrera
Masao-san: Are you intending to act as committer for these?  Since the
bug is mine I can look into it, but since you already did all the
reviewing work, I'm good with you giving it the final push.

0001 looks good to me; let's get that one committed quickly so that we
can focus on the interesting stuff.  While the implementation of
find_in_log is quite dumb (not this patch's fault), it seems sufficient
to deal with small log files.  We can improve the implementation later,
if needed, but we have to get the API right on the first try.

0003: The fix looks good to me.  I verified that the test fails without
the fix, and it passes with the fix.


The test added in 0002 is a bit optimistic regarding timing, as well as
potentially slow; it loops 1000 times and sleeps 100 milliseconds each
time.  In a very slow server (valgrind or clobber_cache animals) this
could not be sufficient time, while on fast servers it may end up
waiting longer than needed.  Maybe we can do something like this:

for (my $i = 0 ; $i < 1000; $i++)
{
my $current_log_size = determine_current_log_size()

if ($node_standby_3->find_in_log(
"requested WAL segment [0-9A-F]+ has already been 
removed",
$logstart))
{
last;
}
elsif ($node_standby_3->find_in_log(
"End of WAL reached on timeline",
   $logstart))
{
$success = 1;
last;
}
$logstart = $current_log_size;

while (determine_current_log_size() == current_log_size)
{
usleep(10_000);
# with a retry count?
}
}

With test patch, make check PROVE_FLAGS="--timer" 
PROVE_TESTS=t/001_stream_rep.pl

ok 6386 ms ( 0.00 usr  0.00 sys +  1.14 cusr  0.93 csys =  2.07 CPU)
ok 6352 ms ( 0.00 usr  0.00 sys +  1.10 cusr  0.94 csys =  2.04 CPU)
ok 6255 ms ( 0.01 usr  0.00 sys +  0.99 cusr  0.97 csys =  1.97 CPU)

without test patch:

ok 4954 ms ( 0.00 usr  0.00 sys +  0.71 cusr  0.64 csys =  1.35 CPU)
ok 5033 ms ( 0.01 usr  0.00 sys +  0.71 cusr  0.73 csys =  1.45 CPU)
ok 4991 ms ( 0.01 usr  0.00 sys +  0.73 cusr  0.59 csys =  1.33 CPU)

-- 
Álvaro Herrera




Re: PATCH: Batch/pipelining support for libpq

2021-02-16 Thread Alvaro Herrera
On 2021-Feb-16, Craig Ringer wrote:

> FWIW I'm also thinking of revising the docs to mostly use the term
> "pipeline" instead of "batch". Use "pipelining and batching" in the chapter
> topic, and mention "batch" in the index, and add a para that explains how
> to run batches on top of pipelining, but otherwise use the term "pipeline"
> not "batch".

Hmm, this is a good point.  It means I have a lot of API renaming to do.
I'll get on it now and come back with a proposal.

-- 
Álvaro Herrera   Valdivia, Chile




Re: PATCH: Batch/pipelining support for libpq

2021-02-16 Thread Alvaro Herrera
Here's a new version, where I've renamed everything to "pipeline".  I
think the docs could use some additional tweaks now in order to make a
coherent story on pipeline mode, how it can be used in a batched
fashion, etc.

Here's the renames I applied.  It's mostly mechanical, except
PQbatchSendQueue is now PQsendPipeline:

PQBatchStatus -> PGpipelineStatus (enum)
PQBATCH_MODE_OFF -> PQ_PIPELINE_OFF
PQBATCH_MODE_ON -> PQ_PIPELINE_ON
PQBATCH_MODE_ABORTED -> PQ_PIPELINE_ABORTED
PQbatchStatus -> PQpipelineStatus (function)
PQenterBatchMode -> PQenterPipelineMode
PQexitBatchMode -> PQexitPipelineMode
PQbatchSendQueue -> PQsendPipeline
PGRES_BATCH_END -> PGRES_PIPELINE_END
PGRES_BATCH_ABORTED -> PGRES_PIPELINE_ABORTED

Also, PQbatchStatus(conn) returns enum PGpipelineStatus (it previously
returned int).

I'm tempted to rename PGASYNC_QUEUED to PGASYNC_PIPELINE_IDLE (not sure
if PGASYNC_PIPELINE_READY fits better with the existing one).


In pgbench, I changed the metacommands to be \startpipeline and
\endpipeline.  There's a failing Assert() there which I commented out;
needs fixed.

-- 
Álvaro Herrera39°49'30"S 73°17'W
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index b7a82453f0..9f98257dbe 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3099,6 +3099,31 @@ ExecStatusType PQresultStatus(const PGresult *res);

   
  
+
+ 
+  PGRES_PIPELINE_END
+  
+   
+The PGresult represents the end of a pipeline.
+This status occurs only when pipeline mode has been selected.
+   
+  
+ 
+
+ 
+  PGRES_PIPELINE_ABORTED
+  
+   
+The PGresult represents a pipeline that's
+received an error from the server.  PQgetResult
+must be called repeatedly, and it will return this status code,
+until the end of the current pipeline, at which point it will return
+PGRES_PIPELINE_END and normal processing can
+resume.
+   
+  
+ 
+
 
 
 If the result status is PGRES_TUPLES_OK or
@@ -4857,6 +4882,494 @@ int PQflush(PGconn *conn);
 
  
 
+ 
+  Pipeline Mode
+
+  
+   libpq
+   pipeline mode
+  
+
+  
+   pipelining
+   in libpq
+  
+
+  
+   batch mode
+   in libpq
+  
+
+  
+   libpq pipeline mode allows applications to
+   send a query without having to read the result of the previously
+   sent query.  Taking advantage of the pipeline mode, a client will wait
+   less for the server, since multiple queries/results can be sent/
+   received in a single network transaction.
+  
+
+  
+   While pipeline mode provides a significant performance boost, writing
+   clients using the pipeline mode is more complex because it involves
+   managing a queue of pending queries and finding which result
+   corresponds to which query in the queue.
+  
+
+  
+   Pipeline mode also generally consumes more memory on both the client and server,
+   though careful and aggressive management of the send/receive queue can mitigate
+   this.  This applies whether or not the connection is in blocking or non-blocking
+   mode.
+  
+
+  
+   Using Pipeline Mode
+
+   
+To issue pipelines the application must switch a connection into pipeline mode.
+Enter pipeline mode with 
+or test whether pipeline mode is active with
+.
+In pipeline mode, only asynchronous operations
+are permitted, and COPY is disallowed.
+Using any synchronous command execution functions,
+such as PQfn, or PQexec
+and its sibling functions, is an error condition.
+   
+
+   
+
+ It is best to use pipeline mode with libpq in
+ non-blocking mode. If used
+ in blocking mode it is possible for a client/server deadlock to occur.
+  
+   
+The client will block trying to send queries to the server, but the
+server will block trying to send results to the client from queries
+it has already processed. This only occurs when the client sends
+enough queries to fill its output buffer and the server's receive
+buffer before switching to processing input from the server,
+but it's hard to predict exactly when that will happen.
+   
+  
+
+   
+
+   
+Issuing Queries
+
+
+ After entering pipeline mode the application dispatches requests using
+ normal asynchronous libpq functions, such as:
+ PQsendQueryParams, PQsendPrepare,
+ PQsendQueryPrepared, PQsendDescribePortal,
+ and PQsendDescribePrepared.
+ The asynchronous requests are followed by a
+ 
+ call to mark the end of the pipeline. The client need not
+ call PQgetResult immediately after
+ dispatching each operation.
+ Result processing
+ is handled separately.
+
+
+
+ The serve

Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Alvaro Herrera
Hello, thanks for looking at this patch.

On 2021-Feb-16, Zhihong Yu wrote:

> +   if (querymode == QUERY_SIMPLE)
> +   {
> +   commandFailed(st, "startpipeline", "cannot use pipeline mode
> with the simple query protocol");
> +   st->state = CSTATE_ABORTED;
> +   return CSTATE_ABORTED;
> 
> I wonder why the st->state is only assigned for this if block. The state is
> not set for other cases where CSTATE_ABORTED is returned.

Yeah, that's a simple oversight.  We don't need to set st->state,
because the caller sets it to the value we return.

> Should PQ_PIPELINE_OFF be returned for the following case ?
> 
> +PQpipelineStatus(const PGconn *conn)
> +{
> +   if (!conn)
> +   return false;

Yep.

Thanks

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Alvaro Herrera
On 2021-Jan-21, Zhihong Yu wrote:

> It seems '\\gset or \\aset is not ' would correspond to the check more
> closely.
> 
> +   if (my_command->argc != 1)
> +   syntax_error(source, lineno, my_command->first_line,
> my_command->argv[0],
> 
> It is possible that my_command->argc == 0 (where my_command->argv[0]
> shouldn't be accessed) ?

No -- process_backslash_command reads the word and sets it as argv[0].

> +   appendPQExpBufferStr(&conn->errorMessage,
> + libpq_gettext("cannot queue commands
> during COPY\n"));
> +   return false;
> +   break;
> 
> Is the break necessary ? Similar comment for pqBatchProcessQueue().

Not necessary.  I removed them.

> +int
> +PQexitBatchMode(PGconn *conn)
> 
> Since either 0 or 1 is returned, maybe change the return type to bool.

I was kinda tempted to do that, but in reality a lot of libpq's API is
defined like that -- to return 0 (failure)/1 (success) as ints, not
bools.  For example see PQsendQuery().  I'm not inclined to do different
for these new functions.  (One curious case is PQsetvalue, which returns
int, and is documented as "returns zero for failure" and then uses
"return false").

> Also, the function operates on PGconn - should the function be static
> (pqBatchProcessQueue is) ?

I don't understand this suggestion.  How would an application call it,
if we make it static?

Thanks

-- 
Álvaro Herrera   Valdivia, Chile
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)




Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Alvaro Herrera
Hi,

On 2021-Feb-19, Zhihong Yu wrote:

> Hi,
> +static int pqBatchProcessQueue(PGconn *conn);
> 
> I was suggesting changing:
> 
> +int
> +PQexitBatchMode(PGconn *conn)
> 
> to:
> 
> +static int
> +PQexitBatchMode(PGconn *conn)

I understand that, but what I'm saying is that it doesn't work.
pqBatchProcessQueue can be static because it's only used internally in
libpq, but PQexitBatchMode is supposed to be called by the client
application.

-- 
Álvaro Herrera   Valdivia, Chile
"No necesitamos banderas
 No reconocemos fronteras"  (Jorge González)




Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Alvaro Herrera
On 2021-Feb-20, Craig Ringer wrote:

> On Wed, 17 Feb 2021, 07:13 Alvaro Herrera,  wrote:

> > I
> > think the docs could use some additional tweaks now in order to make a
> > coherent story on pipeline mode, how it can be used in a batched
> > fashion, etc.
> 
> I'll do that soon and send an update.

I started to do that, but was sidetracked having a look at error
handling while checking one of the claims.

So now it starts with this:

   
Issuing Queries


 After entering pipeline mode, the application dispatches requests using
 , 
 or its prepared-query sibling
 .
 These requests are queued on the client-side until flushed to the server;
 this occurs when  is used to
 establish a synchronization point in the pipeline,
 or when  is called.
 The functions ,
 , and
  also work in pipeline mode.
 Result processing is described below.



 The server executes statements, and returns results, in the order the
 client sends them.  The server will begin executing the commands in the
 pipeline immediately, not waiting for the end of the pipeline.
 If any statement encounters an error, the server aborts the current
 transaction and skips processing commands in the pipeline until the
 next synchronization point established by 
PQsendPipeline.
 (This remains true even if the commands in the pipeline would rollback
 the transaction.)
 Query processing resumes after the synchronization point.


(Note I changed the wording that "the pipeline is ended by
PQsendPipeline" to "a synchronization point is established".  Is this
easily understandable?  On re-reading it, I'm not sure it's really an
improvement.)

BTW we don't explain why doesn't PQsendQuery work (to wit: because it
uses "simple" query protocol and thus doesn't require the Sync message).
I think we should either document that, or change things so that
PQsendQuery no longer uses a 'Q' message when in pipeline mode; instead
use extended query protocol.  I don't see why we'd force people to use
PQsendQueryParams when not necessary.

BTW, in my experimentation, the sequence of PGresult that you get when
handling a pipeline with errors don't make a lot of sense.  I'll spend
some more time on it.

While at it, as for the PGresult sequence and NULL returns: I think for
PQexec and maybe PQsendQuery, it makes sense to loop until PQgetResult
returns NULL, because you never know how many results are you going to
get.  But for PQsendQueryParams, this is no longer true, because you
can't send multiple queries in one query string.  So the only way to get
multiple results, is by using single-row mode.  But that already has its
own protocol for multiple results, namely to get a stream of
PGRES_SINGLE_TUPLE terminated by a zero-rows PGRES_TUPLES_OK.  So I'm
not sure there is a strong need for the mandatory NULL result.

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: Printing page request trace from buffer manager

2021-02-20 Thread Alvaro Herrera
On 2020-Dec-08, Irodotos Terpizis wrote:

> Initially, I modified the code within the BufferAlloc method in the
> bufmgr.c file,
> to log the pages that were requested and were already in the cache,
> the pages that were evicted and the pages that
> replaced them. However, I feel that this might not be the most optimal
> way, as the log file is a mess and
> it is hard to analyze. I was wondering if there is a more optimal way
> to achieve this.

Hi Irodotos,

Did you find an answer to this question?  Can you explain in what way
the log is a mess?

Maybe you need to start by defining how would you like the log file to
look.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)




Re: Bizarre behavior of \w in a regular expression bracket construct

2021-02-21 Thread Alvaro Herrera
On 2021-Feb-21, Joel Jacobson wrote:

>regex| engine |deduced_ranges
> ++---
> ^([a-z])$  | pg | [a-z]
> ^([a-z])$  | pl | [a-z]
> ^([a-z])$  | v8 | [a-z]
> ^([\d-a])$ | pg |
> ^([\d-a])$ | pl | [-0-9a]
> ^([\d-a])$ | v8 | [-0-9a]
> ^([\w-;])$ | pg |
> ^([\w-;])$ | pl | [-0-9;A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
> ^([\w-;])$ | v8 | [-0-9;A-Z_a-z]
> ^([\w-_])$ | pg | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
> ^([\w-_])$ | pl | [-0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
> ^([\w-_])$ | v8 | [-0-9A-Z_a-z]
> ^([\w])$   | pg | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
> ^([\w])$   | pl | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
> ^([\w])$   | v8 | [0-9A-Z_a-z]
> ^([\W])$   | pg |
> ^([\W])$   | pl | [\x01-/:-@[-^`{-©«-´¶-¹»-¿×÷]
> ^([\W])$   | v8 | [\x01-/:-@[-^`{-ÿ]
> ^([\w-a])$ | pg | [0-9A-Z_-zªµºÀ-ÖØ-öø-ÿ]
> ^([\w-a])$ | pl | [-0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
> ^([\w-a])$ | v8 | [-0-9A-Z_a-z]

It looks like the interpretation of these other engines is that [\d-a]
is the set of \d, the literal character "-", and the literal character
"a".  In other words, the - preceded by \d or \w (or any other character
class, I guess?) loses its special meaning of identifying a character
range.

This one I didn't understand:
> ^([\W])$   | pg |

-- 
Álvaro Herrera   Valdivia, Chile
"Porque francamente, si para saber manejarse a uno mismo hubiera que
rendir examen... ¿Quién es el machito que tendría carnet?"  (Mafalda)




Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-02-23 Thread Alvaro Herrera
On 2021-Feb-05, Julien Rouhaud wrote:

> Thanks, that's way better, copied in v3.

Thank you, pushed.  The code changes are only relevant in master, but I
did back-patch the README.tuplock to all live branches.

> I'm still a bit worried about that description though, as that flag
> isn't consistently set for the FOR UPDATE case.  Well, to be more
> precise it's maybe consistently set when the hint bits are computed,
> but in some cases the flag is later cleared, so you won't reliably
> find it in the tuple.

Is that right?  I think compute_new_xmax_infomask would set the bit to
infomask2_old_tuple (its last output argument) if it's needed, and so
the bit would find its way to infomask2 eventually.  Do you have a case
where it doesn't achieve that?

-- 
Álvaro Herrera   Valdivia, Chile
"E pur si muove" (Galileo Galilei)




Re: Fix typo about generate_gather_paths

2021-02-23 Thread Alvaro Herrera
On 2020-Dec-22, Tomas Vondra wrote:

> Thanks. I started looking at this a bit more closely, and I think most of
> the changes are fine - the code was changed to call a different function,
> but the comments still reference generate_gather_paths().

Hi, this was forgotten.  It seemed better to fix at least some of the
wrong references than not do anything, so I pushed the parts that seemed
100% correct.  Regarding this one:

> The one exception seems to be create_ordered_paths(), because that comment
> also makes statements about what generate_gather_pathes is doing. And some
> of it does not apply to generate_useful_gather_paths.
> For example it says it generates order-preserving Gather Merge paths, but
> generate_useful_gather_paths also generates paths with sorts (which are
> clearly not order-preserving).

I left this one out.  If Hou or Tomas want to propose/push a further
patch, that'd be great.

Thanks!

-- 
Álvaro Herrera39°49'30"S 73°17'W
"I'm always right, but sometimes I'm more right than other times."
  (Linus Torvalds)




Re: Bizarre behavior of \w in a regular expression bracket construct

2021-02-24 Thread Alvaro Herrera
On 2021-Feb-23, Tom Lane wrote:

> * Create infrastructure to allow treating \w as a character class
> in its own right.  (I did not expose [[:word:]] as a class name,
> though it would be a little more symmetric to do so; should we?)

Apparently [:word:] is a GNU extension (or at least a "bash-specific
character class"[1] but apparently Emacs also supports it?); all the
others are mandated by POSIX[2].

[1] 
https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions
[2] 
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05

I think it'd be fine to expose [:word:] ...


> [1] https://www.regular-expressions.info/charclasssubtract.html

I had never heard of this subtraction thing.  Nightmarish and confusing
syntax, but useful.

> +Also, the character class shorthands \D
> +and \W will match a newline regardless of this mode.
> +(Before PostgreSQL 14, they did not match
> +newlines in newline-sensitive mode.)

This seems an acceptable change to me, but then I only work here.

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: [PATCH] pgbench: Remove ecnt, a member variable of CState

2021-02-26 Thread Alvaro Herrera
On 2021-Feb-26, Michael Paquier wrote:

> On Fri, Feb 26, 2021 at 05:39:31PM +0900, miyake_kouta wrote:
> > Also, the current pgbench's client abandons processing after hitting error,
> > so this variable is no need, I think.
> 
> Agreed.  Its last use was in 12788ae, as far as I can see.  So let's
> just cleanup that.

+1

-- 
Álvaro Herrera   Valdivia, Chile




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-02-26 Thread Alvaro Herrera
On 2021-Jan-10, Justin Pryzby wrote:

> On Fri, Jan 08, 2021 at 04:14:33PM -0300, Alvaro Herrera wrote:
> > > > I ended up with apparently broken constraint when running multiple 
> > > > loops around
> > > > a concurrent detach / attach:
> > > > 
> > > > while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR 
> > > > VALUES FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 
> > > > CONCURRENTLY"; do :; done&
> > > > while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR 
> > > > VALUES FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 
> > > > CONCURRENTLY"; do :; done&
> > > > 
> > > > "p1_check" CHECK (true)
> > > > "p1_i_check" CHECK (i IS NOT NULL AND i >= 1 AND i < 2)
> > > 
> > > Not good.
> > 
> > Haven't had time to investigate this problem yet.
> 
> I guess it's because you commited the txn and released lock in the middle of
> the command.

Hmm, but if we take this approach, then we're still vulnerable to the
problem that somebody can do DETACH CONCURRENTLY and cancel the wait (or
crash the server), then mess up the state before doing DETACH FINALIZE:
when they cancel the wait, the lock will be released.

I think the right fix is to disallow any action on a partition which is
pending detach other than DETACH FINALIZE.  (Didn't do that here.)

Here's a rebase to current sources; there are no changes from v5.

-- 
Álvaro Herrera   Valdivia, Chile
"No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)
>From d042b99ad3368ba0b658ac9260450ed8e39accea Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 13 Jul 2020 20:15:30 -0400
Subject: [PATCH v6 1/2] Let ALTER TABLE exec routines deal with the relation

This means that ATExecCmd relies on AlteredRelationInfo->rel instead of
keeping the relation as a local variable; this is useful if the
subcommand needs to modify the relation internally.  For example, a
subcommand that internally commits its transaction needs this.
---
 src/backend/commands/tablecmds.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b2457a6924..b990063d38 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -156,6 +156,8 @@ typedef struct AlteredTableInfo
 	Oid			relid;			/* Relation to work on */
 	char		relkind;		/* Its relkind */
 	TupleDesc	oldDesc;		/* Pre-modification tuple descriptor */
+	/* Transiently set during Phase 2, normally set to NULL */
+	Relation	rel;
 	/* Information saved by Phase 1 for Phase 2: */
 	List	   *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */
 	/* Information saved by Phases 1/2 for Phase 3: */
@@ -353,7 +355,7 @@ static void ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 	  AlterTableUtilityContext *context);
 static void ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			  AlterTableUtilityContext *context);
-static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+static void ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 	  AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass,
 	  AlterTableUtilityContext *context);
 static AlterTableCmd *ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab,
@@ -4513,7 +4515,6 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 		{
 			AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);
 			List	   *subcmds = tab->subcmds[pass];
-			Relation	rel;
 			ListCell   *lcmd;
 
 			if (subcmds == NIL)
@@ -4522,10 +4523,10 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			/*
 			 * Appropriate lock was obtained by phase 1, needn't get it again
 			 */
-			rel = relation_open(tab->relid, NoLock);
+			tab->rel = relation_open(tab->relid, NoLock);
 
 			foreach(lcmd, subcmds)
-ATExecCmd(wqueue, tab, rel,
+ATExecCmd(wqueue, tab,
 		  castNode(AlterTableCmd, lfirst(lcmd)),
 		  lockmode, pass, context);
 
@@ -4537,7 +4538,11 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			if (pass == AT_PASS_ALTER_TYPE)
 ATPostAlterTypeCleanup(wqueue, tab, lockmode);
 
-			relation_close(rel, NoLock);
+			if (tab->rel)
+			{
+relation_close(tab->rel, NoLock);
+tab->rel = NULL;
+			}
 		}
 	}
 
@@ -4563,11 +4568,12 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
  * ATExecCmd: dispatch a subcommand to appropriate execution routine
  */
 static void
-ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 		  AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass,
 		  AlterTableUtilityContext *context)
 {
 	ObjectAddress

Re: [PATCH] Bug fix in initdb output

2021-03-01 Thread Alvaro Herrera
On 2021-Mar-01, Juan José Santamaría Flecha wrote:

> On Mon, Mar 1, 2021 at 5:52 AM Nitin Jadhav 
> wrote:
> 
> >
> >> Please share your thoughts on this. If we go ahead with this change,
> > then we need to back-patch. I would be happy to create those patches.
> 
> A full path works, even with the slashes. The commiter will take care of
> back-patching, if needed. As for HEAD at least, this LGTM.

I don't get it.  I thought the windows API accepted both forward slashes
and backslashes as path separators.  Did you try the command and see it
fail?

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Entristecido, Wutra (canción de Las Barreras)
echa a Freyr a rodar
y a nosotros al mar"




Re: [PATCH] Bug fix in initdb output

2021-03-01 Thread Alvaro Herrera
On 2021-Mar-01, Juan José Santamaría Flecha wrote:

> This is not a problem with the APi, but the shell. e.g. when using a CMD:
> 
> - This works:
> c:\>c:\Windows\System32\notepad.exe
> c:\>c:/Windows/System32/notepad.exe
> c:\>/Windows/System32/notepad.exe
> 
> - This doesn't:
> c:\>./Windows/System32/notepad.exe
> '.' is not recognized as an internal or external command,
> operable program or batch file.
> c:\>Windows/System32/notepad.exe
> 'Windows' is not recognized as an internal or external command,
> operable program or batch file.

Ah, so another way to fix it would be to make the path to pg_ctl be
absolute?

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: [PATCH] Bug fix in initdb output

2021-03-01 Thread Alvaro Herrera
On 2021-Mar-01, Juan José Santamaría Flecha wrote:

> On Mon, Mar 1, 2021 at 7:50 PM Alvaro Herrera 
> wrote:

> > Ah, so another way to fix it would be to make the path to pg_ctl be
> > absolute?
> 
> Yes, that's right. If you call initdb with an absolute path you won't see a
> problem.

So, is make_native_path a better fix than make_absolute_path?  (I find
it pretty surprising that initdb shows a relative path, but maybe that's
just me.)

-- 
Álvaro Herrera39°49'30"S 73°17'W
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it." (ncm, http://lwn.net/Articles/174769/)




Re: [PATCH] Bug fix in initdb output

2021-03-01 Thread Alvaro Herrera
On 2021-Mar-01, Juan José Santamaría Flecha wrote:

> Uhm, now that you point it out, an absolute path would make the message
> more consistent and reusable.

Well.  This code was introduced in a00c58314745, with discussion at
http://postgr.es/m/CAHeEsBeAe1FeBypT3E8R1ZVZU0e8xv3A-7BHg6bEOi=jzny...@mail.gmail.com
which did not touch on the point of the pg_ctl path being relative or
absolute.  The previous decision to use relative seems to have been made
here in commit ee814b4511ec, which was backed by this discussion
https://www.postgresql.org/message-id/flat/200411020134.52513.peter_e%40gmx.net

So I'm not sure that anybody would love me if I change it again to
absolute.

-- 
Álvaro Herrera   Valdivia, Chile
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)




Re: [PATCH] Bug fix in initdb output

2021-03-02 Thread Alvaro Herrera
On 2021-Mar-02, Nitin Jadhav wrote:

> > FWIW, I don't think that it is a good idea to come back to this
> > decision for *nix platforms, so I would let it as-is, and use
> > relative paths if initdb is called using a relative path.
> 
> The command to be displayed either in absolute path or relative path
> depends on the way the user is calling initdb. I agree with the above
> comment to keep this behaviour as-is, that is if the initdb is called
> using relative path, then it displays relative path otherwise absolute
> path.

Yeah.

> > However, if you can write something that makes the path printed
> > compatible for a WIN32 terminal while keeping the behavior
> > consistent with other platforms, people would have no objections to
> > that.
> 
> I feel the patch attached above handles this scenario.

Agreed.  I'll push the original patch then.  Thanks everybody for the
discussion.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Porque francamente, si para saber manejarse a uno mismo hubiera que
rendir examen... ¿Quién es el machito que tendría carnet?"  (Mafalda)




Re: WIP: BRIN multi-range indexes

2021-03-02 Thread Alvaro Herrera
On 2021-Mar-03, Tomas Vondra wrote:

> 1) The 0001 patch allows passing of all scan keys to BRIN opclasses,
> which is needed for the minmax-multi to work. But it also modifies the
> existing opclasses (minmax and inclusion) to do this - but for those
> opclasses it does not make much difference, I think. Initially this was
> done because the patch did this for all opclasses, but then we added the
> detection based on number of parameters. So I wonder if we should just
> remove this, to make the patch a bit smaller. It'll also test the other
> code path (for opclasses without the last parameter).

I think it makes sense to just do them all in one pass.  I think trying
to keep the old way working just because it's how it was working
previously does not have much benefit.  I don't think we care about the
*patch* being small as much as the resulting *code* being as simple as
possible (but no simpler).


> 2) This needs bsearch_arg, but we only have that defined/used in
> extended_statistics_internal.h - it seems silly to include that here, as
> this has nothing to do with extended stats, so I simply added another
> prototype (which gets linked correctly). But, I suppose a better way
> would be to define our "portable" variant pg_bsearch_arg, next to
> pg_qsort etc.

Yeah, I think it makes sense to move bsearch_arg to a place where it's
more generally accesible (src/port/bsearch_arg.c I suppose), and make
both places use that.

-- 
Álvaro Herrera   Valdivia, Chile




Re: WIP: BRIN multi-range indexes

2021-03-02 Thread Alvaro Herrera
On 2021-Mar-03, Tomas Vondra wrote:

> That's kinda my point - I agree the size of the patch is not the primary
> concern, but it makes the minmax/inclusion code a bit more complicated
> (because they now have to loop over the keys), with very little benefit
> (there might be some speedup, but IMO it's rather negligible).

Yeah, OK.

> Alternatively we could simply remove the code supporting the old API
> with "consistent" functions without the additional parameter. But the
> idea was to seamlessly support existing opclasses / not breaking them
> unnecessarily (I know we don't guarantee that in major upgrades, but as
> they may not benefit from this, why break them?). It'd simplify the code
> in brin.c a little bit, but the opclasses a bit more complex.

Well, I doubt any opclass-support functions exist outside of core.
Or am I just outdated and we do know of some?

-- 
Álvaro Herrera   Valdivia, Chile
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)




Re: Huge memory consumption on partitioned table with FKs

2021-03-03 Thread Alvaro Herrera
On 2021-Mar-03, Amit Langote wrote:

> I don't know of any unaddressed comments on the patch, so I've marked
> the entry Ready for Committer.

Thanks, I'll look at it later this week.

-- 
Álvaro Herrera39°49'30"S 73°17'W
#error "Operator lives in the wrong universe"
  ("Use of cookies in real-time system development", M. Gleixner, M. Mc Guire)




Re: PATCH: Batch/pipelining support for libpq

2021-03-03 Thread &#x27;Alvaro Herrera'
On 2021-Mar-03, k.jami...@fujitsu.com wrote:

> I tried applying this patch to test it on top of Iwata-san's libpq trace log 
> [1].
> In my environment, the compiler complains.
> It seems that in libpqwalreceiver.c: libpqrcv_exec()
> the switch for PQresultStatus needs to handle the
> cases for PGRES_PIPELINE_END and PGRES_PIPELINE_ABORTED too.

Yeah, I noticed this too and decided to handle these cases as errors.

Here's v28, which includes that fix, and also gets rid of the
PGASYNC_QUEUED state; I instead made the rest of the code work using the
existing PGASYNC_IDLE state.  This allows us get rid of a few cases
where we had to report "internal error: IDLE state unexpected" in a few
cases, and was rather pointless.  With the current formulation, both
pipeline mode and normal mode share pretty much all state.

Also, I've renamed PGRES_PIPELINE_END to PGRES_PIPELINE_SYNC, because
that's closer to what it means: that result state by no means that
pipeline mode has ended.  It only means that you called PQsendPipeline()
so the server gives you a Sync message.

I'm much more comfortable with this version, so I'm marking the patch as
Ready for Committer in case anybody else wants to look at this before I
push it.

However: on errors, I think it's a bit odd that you get a NULL from
PQgetResult after getting PGRES_PIPELINE_ABORTED.  Maybe I should
suppress that.  I'm no longer wrestling with the NULL after
PGRES_PIPELINE_END however, because that's not critical and you can not
ask for it and things work fine.

(Also, the test pipeline.c program in src/test/modules/test_libpq has a
new "transaction" mode that is not exercised by the TAP program; I'll
plug that in before commit.)

-- 
Álvaro Herrera   Valdivia, Chile
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index cc20b0f234..4ea8cb8c93 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3199,6 +3199,32 @@ ExecStatusType PQresultStatus(const PGresult *res);

   
  
+
+ 
+  PGRES_PIPELINE_SYNC
+  
+   
+The PGresult represents a point in a
+pipeline where a synchronization point has been established.
+This status occurs only when pipeline mode has been selected.
+   
+  
+ 
+
+ 
+  PGRES_PIPELINE_ABORTED
+  
+   
+The PGresult represents a pipeline that's
+received an error from the server.  PQgetResult
+must be called repeatedly, and it will return this status code,
+until the end of the current pipeline, at which point it will return
+PGRES_PIPELINE_SYNC and normal processing can
+resume.
+   
+  
+ 
+
 
 
 If the result status is PGRES_TUPLES_OK or
@@ -4957,6 +4983,498 @@ int PQflush(PGconn *conn);
 
  
 
+ 
+  Pipeline Mode
+
+  
+   libpq
+   pipeline mode
+  
+
+  
+   pipelining
+   in libpq
+  
+
+  
+   batch mode
+   in libpq
+  
+
+  
+   libpq pipeline mode allows applications to
+   send a query without having to read the result of the previously
+   sent query.  Taking advantage of the pipeline mode, a client will wait
+   less for the server, since multiple queries/results can be sent/
+   received in a single network transaction.
+  
+
+  
+   While pipeline mode provides a significant performance boost, writing
+   clients using the pipeline mode is more complex because it involves
+   managing a queue of pending queries and finding which result
+   corresponds to which query in the queue.
+  
+
+  
+   Pipeline mode also generally consumes more memory on both the client and server,
+   though careful and aggressive management of the send/receive queue can mitigate
+   this.  This applies whether or not the connection is in blocking or non-blocking
+   mode.
+  
+
+  
+   Using Pipeline Mode
+
+   
+To issue pipelines the application must switch a connection into pipeline mode.
+Enter pipeline mode with 
+or test whether pipeline mode is active with
+.
+In pipeline mode, only asynchronous operations
+are permitted, and COPY is disallowed.
+Using any synchronous command execution functions,
+such as PQfn, or PQexec
+and its sibling functions, is an error condition.
+   
+
+   
+
+ It is best to use pipeline mode with libpq in
+ non-blocking mode. If used
+ in blocking mode it is possible for a client/server deadlock to occur.
+  
+   
+The client will block trying to send queries to the server, but the
+server will block trying to send results to the client from queries
+it has already processed. This only occurs when the client sends
+enough queries to fill both its output buffer and the server's receive
+buffer before switching to processing input from the server,
+but it's hard to predict exactly when that will happen.
+   
+  
+ 

Re: PATCH: Batch/pipelining support for libpq

2021-03-03 Thread &#x27;Alvaro Herrera'
On 2021-Mar-03, 'Alvaro Herrera' wrote:

> I'm much more comfortable with this version, so I'm marking the patch as
> Ready for Committer in case anybody else wants to look at this before I
> push it.

Actually, I just noticed a pretty serious problem, which is that when we
report an error, we don't set "conn->errQuery" to the current query but
to the *previous* query, leading to non-sensical error reports like

$ ./pipeline pipeline_abort
aborted pipeline... ERROR:  no existe la función no_such_function(integer)
LINE 1: INSERT INTO pq_pipeline_demo(itemno) VALUES ($1);
   ^
HINT:  Ninguna función coincide en el nombre y tipos de argumentos. Puede ser 
necesario agregar conversión explícita de tipos.
ok


(Sorry about the Spanish there, but the gist of the problem is that
we're positioning the error cursor on a query that succeeded prior to
the query that raised the error.)

This should obviously not occur.  I'm trying to figure out how to repair
it and not break everything again ...

-- 
Álvaro Herrera   Valdivia, Chile




Re: PATCH: Batch/pipelining support for libpq

2021-03-03 Thread &#x27;Alvaro Herrera'
On 2021-Mar-03, 'Alvaro Herrera' wrote:

> This should obviously not occur.  I'm trying to figure out how to repair
> it and not break everything again ...

I think trying to set up the connection state so that the next query
appears in conn->last_query prior to PQgetResult being called again
leads to worse breakage.  The simplest fix seems to make fe-protocol3.c
aware that in this case, the query is in conn->cmd_queue_head instead,
as in the attached patch.

-- 
Álvaro Herrera39°49'30"S 73°17'W
diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c
index 584922b340..b89f500902 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -962,9 +962,21 @@ pqGetErrorNotice3(PGconn *conn, bool isError)
 	 * Save the active query text, if any, into res as well; but only if we
 	 * might need it for an error cursor display, which is only true if there
 	 * is a PG_DIAG_STATEMENT_POSITION field.
+	 *
+	 * Note that in pipeline mode, we have not yet advanced the query pointer
+	 * to the next query, so we have to look at that.
 	 */
-	if (have_position && conn->last_query && res)
-		res->errQuery = pqResultStrdup(res, conn->last_query);
+	if (have_position && res)
+	{
+		if (conn->pipelineStatus != PQ_PIPELINE_OFF &&
+			conn->asyncStatus == PGASYNC_IDLE)
+		{
+			if (conn->cmd_queue_head)
+res->errQuery = pqResultStrdup(res, conn->cmd_queue_head->query);
+		}
+		else if (conn->last_query)
+			res->errQuery = pqResultStrdup(res, conn->last_query);
+	}
 
 	/*
 	 * Now build the "overall" error message for PQresultErrorMessage.


Re: PATCH: Batch/pipelining support for libpq

2021-03-03 Thread &#x27;Alvaro Herrera'
On 2021-Mar-03, 'Alvaro Herrera' wrote:

> On 2021-Mar-03, 'Alvaro Herrera' wrote:
> 
> > This should obviously not occur.  I'm trying to figure out how to repair
> > it and not break everything again ...
> 
> I think trying to set up the connection state so that the next query
> appears in conn->last_query prior to PQgetResult being called again
> leads to worse breakage.  The simplest fix seems to make fe-protocol3.c
> aware that in this case, the query is in conn->cmd_queue_head instead,
> as in the attached patch.

I wonder if it would make sense to get rid of conn->last_query
completely and just rely always on conn->cmd_queue_head, where the
normal (non-pipeline) would use the first entry of the command queue.
That might end up being simpler than pipeline mode "pretending" to take
over ->last_query.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"La verdad no siempre es bonita, pero el hambre de ella sí"




Re: make coverage-html would fail within build directory separate from source tree

2021-03-04 Thread Alvaro Herrera
On 2021-Mar-04, walker wrote:

> Hi, hackers
> 
> During installation from source code, I created a build directory separate 
> from the source tree, and execute the following command in the build 
> directory:
> /home/postgres/postgresql-13.2/configure -- enable-coverage
> make
> make check
> make coverage-html
> 
> 
> However, while executing make coverage-html, it failed with the following 
> error messages:
> /bin/lcov --gcov-tool /bin/gcov -q --no-external -c -i -d . -d 
> /home/postgres/postgresql-13.2/ -o lcve_base.info
> ...
> geninfo: ERROR: no .gcno files found in /home/postgres/postgresql-13.2/!
> make: *** [lcov_base.info] Error 255
> make: *** Deleting file 'lcov_base.info'

Hmm, it works fine for me.  config.log says I do this (in
/pgsql/build/master-coverage):

  $ /pgsql/source/REL_13_STABLE/configure --enable-debug --enable-depend 
--enable-cassert --enable-coverage 
--cache-file=/home/alvherre/run/pgconfig.master-coverage.cache 
--enable-thread-safety --enable-tap-tests --with-python --with-perl --with-tcl 
--with-openssl --with-libxml --with-tclconfig=/usr/lib/tcl8.6 
PYTHON=/usr/bin/python3 --prefix=/pgsql/install/master-coverage 
--with-pgport=55451

I do run "make install" too, though (and "make -C contrib install").
Not sure if that makes a difference.  
But for sure there are no .gcno files in the source dir -- they're all
in the build dir.


-- 
Álvaro Herrera   Valdivia, Chile
"La persona que no quería pecar / estaba obligada a sentarse
 en duras y empinadas sillas/ desprovistas, por cierto
 de blandos atenuantes"  (Patricio Vogel)




Re: make coverage-html would fail within build directory separate from source tree

2021-03-04 Thread Alvaro Herrera
On 2021-Mar-04, Alvaro Herrera wrote:

> On 2021-Mar-04, walker wrote:
> 
> > Hi, hackers
> > 
> > During installation from source code, I created a build directory separate 
> > from the source tree, and execute the following command in the build 
> > directory:
> > /home/postgres/postgresql-13.2/configure -- enable-coverage
> > make
> > make check
> > make coverage-html
> > 
> > 
> > However, while executing make coverage-html, it failed with the following 
> > error messages:
> > /bin/lcov --gcov-tool /bin/gcov -q --no-external -c -i -d . -d 
> > /home/postgres/postgresql-13.2/ -o lcve_base.info
> > ...
> > geninfo: ERROR: no .gcno files found in /home/postgres/postgresql-13.2/!

"make coverage-html" outputs this: note that I get a WARNING about the
source directory rather than an ERROR:

/usr/bin/lcov --gcov-tool /usr/bin/gcov -q --no-external -c -i -d . -d 
/pgsql/source/REL_13_STABLE/ -o lcov_base.info
geninfo: WARNING: no .gcno files found in /pgsql/source/REL_13_STABLE/ - 
skipping!
/usr/bin/lcov --gcov-tool /usr/bin/gcov -q --no-external -c -d . -d 
/pgsql/source/REL_13_STABLE/ -o lcov_test.info
geninfo: WARNING: no .gcda files found in /pgsql/source/REL_13_STABLE/ - 
skipping!
rm -rf coverage
/usr/bin/genhtml -q --legend -o coverage --title='PostgreSQL 13.2' 
--num-spaces=4  lcov_base.info lcov_test.info
touch coverage-html-stamp

(In my system, /pgsql is a symlink to /home/alvhere/Code/pgsql/)

$ geninfo --version
geninfo: LCOV version 1.13


-- 
Álvaro Herrera   Valdivia, Chile




Re: make coverage-html would fail within build directory separate from source tree

2021-03-04 Thread Alvaro Herrera
On 2021-Mar-04, walker wrote:

> thanks for your reply. it indeed that there are no .gcon files in source tree 
> directory, they're in build tree directory, which results in failures.
> 
> 
> That's a bit wired.
> 
> 
> Add more detailed testing steps:
> mkdir build_dir
> cd build_dir
> /home/postgres/postgresql-13.2/configure -- enable-coverage
> make
> make check
> make coverage-html

Hmm, my build dir is not inside the source dir -- is yours?  Maybe that
makes a difference?  Also, what version of lcov do you have?


-- 
Álvaro Herrera39°49'30"S 73°17'W
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."  (Brian Kernighan)




Re: make coverage-html would fail within build directory separate from source tree

2021-03-04 Thread Alvaro Herrera
On 2021-Mar-04, walker wrote:

> The same, the build directory is outside the source tree.
> 
> 
> the version of lcov is 1.10

That seems *really* ancient.  Please try with a fresher version.


-- 
Álvaro Herrera39°49'30"S 73°17'W
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."(Robert Davidson)
   http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php




Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2021-03-04 Thread Alvaro Herrera
On 2021-Mar-04, Ibrar Ahmed wrote:

> The build is failing for this patch, can you please take a look at this?
> 
> https://cirrus-ci.com/task/4568547922804736
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.129221
> 
> 
> I am marking the patch "Waiting on Author"

I don't know why you chose to respond to such an old message, but here's
a rebase which also includes the workaround I suggested last night for
the problem of the outdated query printed by error messages, as well as
Justin's suggested fixes.  (I also made a few tweaks to the TAP test).

-- 
Álvaro Herrera   Valdivia, Chile
"No renuncies a nada. No te aferres a nada."




Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2021-03-04 Thread Alvaro Herrera
Apparently, the archives system or the commitfest system is not picking
up new messages to the thread, so the CF app is trying to apply a
very old patch version.  I'm not sure what's up with that.  Thomas, any
clues on where to look?

-- 
Álvaro Herrera   Valdivia, Chile
"Oh, great altar of passive entertainment, bestow upon me thy discordant images
at such speed as to render linear thought impossible" (Calvin a la TV)




Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2021-03-04 Thread Alvaro Herrera
> I think it's just because you forgot the patch.
> https://www.postgresql.org/message-id/20210304142627.GA5978%40alvherre.pgsql


-- 
Álvaro Herrera39°49'30"S 73°17'W
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 0553279314..c16befa314 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3173,6 +3173,33 @@ ExecStatusType PQresultStatus(const PGresult *res);

   
  
+
+ 
+  PGRES_PIPELINE_SYNC
+  
+   
+The PGresult represents a
+synchronization point in pipeline mode, requested by 
+.
+This status occurs only when pipeline mode has been selected.
+   
+  
+ 
+
+ 
+  PGRES_PIPELINE_ABORTED
+  
+   
+The PGresult represents a pipeline that has
+received an error from the server.  PQgetResult
+must be called repeatedly, and each time it will return this status code
+until the end of the current pipeline, at which point it will return
+PGRES_PIPELINE_SYNC and normal processing can
+resume.
+   
+  
+ 
+
 
 
 If the result status is PGRES_TUPLES_OK or
@@ -4919,6 +4946,498 @@ int PQflush(PGconn *conn);
 
  
 
+ 
+  Pipeline Mode
+
+  
+   libpq
+   pipeline mode
+  
+
+  
+   pipelining
+   in libpq
+  
+
+  
+   batch mode
+   in libpq
+  
+
+  
+   libpq pipeline mode allows applications to
+   send a query without having to read the result of the previously
+   sent query.  Taking advantage of the pipeline mode, a client will wait
+   less for the server, since multiple queries/results can be sent/
+   received in a single network transaction.
+  
+
+  
+   While pipeline mode provides a significant performance boost, writing
+   clients using the pipeline mode is more complex because it involves
+   managing a queue of pending queries and finding which result
+   corresponds to which query in the queue.
+  
+
+  
+   Pipeline mode also generally consumes more memory on both the client and server,
+   though careful and aggressive management of the send/receive queue can mitigate
+   this.  This applies whether or not the connection is in blocking or non-blocking
+   mode.
+  
+
+  
+   Using Pipeline Mode
+
+   
+To issue pipelines, the application must switch a connection into pipeline mode.
+Enter pipeline mode with 
+or test whether pipeline mode is active with
+.
+In pipeline mode, only asynchronous operations
+are permitted, and COPY is disallowed.
+Using any synchronous command execution functions,
+such as PQfn, or PQexec
+and its sibling functions, is an error condition.
+   
+
+   
+
+ It is best to use pipeline mode with libpq in
+ non-blocking mode. If used
+ in blocking mode it is possible for a client/server deadlock to occur.
+  
+   
+The client will block trying to send queries to the server, but the
+server will block trying to send results to the client from queries
+it has already processed. This only occurs when the client sends
+enough queries to fill both its output buffer and the server's receive
+buffer before it switches to processing input from the server,
+but it's hard to predict exactly when that will happen.
+   
+  
+
+   
+
+   
+Issuing Queries
+
+
+ After entering pipeline mode, the application dispatches requests using
+ , 
+ or its prepared-query sibling
+ .
+ These requests are queued on the client-side until flushed to the server;
+ this occurs when  is used to
+ establish a synchronization point in the pipeline,
+ or when  is called.
+ The functions ,
+ , and
+  also work in pipeline mode.
+ Result processing is described below.
+
+
+
+ The server executes statements, and returns results, in the order the
+ client sends them.  The server will begin executing the commands in the
+ pipeline immediately, not waiting for the end of the pipeline.
+ If any statement encounters an error, the server aborts the current
+ transaction and skips processing commands in the pipeline until the
+ next synchronization point established by PQsendPipeline.
+ (This remains true even if the commands in the pipeline would rollback
+ the transaction.)
+ Query processing resumes after the synchronization point.
+
+
+
+ It's fine for one operation to depend on the results of a
+ prior one; for example, one query may define a table that the next
+ query in the same pipeline uses. Similarly, an application may
+ create a named prepared statement and execute it with later
+ statements in the same pipeline.
+
+   
+
+   
+Processing Results
+
+
+ To process the result of one query in a pipeline, the 

Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2021-03-04 Thread Alvaro Herrera
v30 contains changes to hopefully make it build on MSVC.

-- 
Álvaro Herrera   Valdivia, Chile
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 0553279314..c16befa314 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3173,6 +3173,33 @@ ExecStatusType PQresultStatus(const PGresult *res);

   
  
+
+ 
+  PGRES_PIPELINE_SYNC
+  
+   
+The PGresult represents a
+synchronization point in pipeline mode, requested by 
+.
+This status occurs only when pipeline mode has been selected.
+   
+  
+ 
+
+ 
+  PGRES_PIPELINE_ABORTED
+  
+   
+The PGresult represents a pipeline that has
+received an error from the server.  PQgetResult
+must be called repeatedly, and each time it will return this status code
+until the end of the current pipeline, at which point it will return
+PGRES_PIPELINE_SYNC and normal processing can
+resume.
+   
+  
+ 
+
 
 
 If the result status is PGRES_TUPLES_OK or
@@ -4919,6 +4946,498 @@ int PQflush(PGconn *conn);
 
  
 
+ 
+  Pipeline Mode
+
+  
+   libpq
+   pipeline mode
+  
+
+  
+   pipelining
+   in libpq
+  
+
+  
+   batch mode
+   in libpq
+  
+
+  
+   libpq pipeline mode allows applications to
+   send a query without having to read the result of the previously
+   sent query.  Taking advantage of the pipeline mode, a client will wait
+   less for the server, since multiple queries/results can be sent/
+   received in a single network transaction.
+  
+
+  
+   While pipeline mode provides a significant performance boost, writing
+   clients using the pipeline mode is more complex because it involves
+   managing a queue of pending queries and finding which result
+   corresponds to which query in the queue.
+  
+
+  
+   Pipeline mode also generally consumes more memory on both the client and server,
+   though careful and aggressive management of the send/receive queue can mitigate
+   this.  This applies whether or not the connection is in blocking or non-blocking
+   mode.
+  
+
+  
+   Using Pipeline Mode
+
+   
+To issue pipelines, the application must switch a connection into pipeline mode.
+Enter pipeline mode with 
+or test whether pipeline mode is active with
+.
+In pipeline mode, only asynchronous operations
+are permitted, and COPY is disallowed.
+Using any synchronous command execution functions,
+such as PQfn, or PQexec
+and its sibling functions, is an error condition.
+   
+
+   
+
+ It is best to use pipeline mode with libpq in
+ non-blocking mode. If used
+ in blocking mode it is possible for a client/server deadlock to occur.
+  
+   
+The client will block trying to send queries to the server, but the
+server will block trying to send results to the client from queries
+it has already processed. This only occurs when the client sends
+enough queries to fill both its output buffer and the server's receive
+buffer before it switches to processing input from the server,
+but it's hard to predict exactly when that will happen.
+   
+  
+
+   
+
+   
+Issuing Queries
+
+
+ After entering pipeline mode, the application dispatches requests using
+ , 
+ or its prepared-query sibling
+ .
+ These requests are queued on the client-side until flushed to the server;
+ this occurs when  is used to
+ establish a synchronization point in the pipeline,
+ or when  is called.
+ The functions ,
+ , and
+  also work in pipeline mode.
+ Result processing is described below.
+
+
+
+ The server executes statements, and returns results, in the order the
+ client sends them.  The server will begin executing the commands in the
+ pipeline immediately, not waiting for the end of the pipeline.
+ If any statement encounters an error, the server aborts the current
+ transaction and skips processing commands in the pipeline until the
+ next synchronization point established by PQsendPipeline.
+ (This remains true even if the commands in the pipeline would rollback
+ the transaction.)
+ Query processing resumes after the synchronization point.
+
+
+
+ It's fine for one operation to depend on the results of a
+ prior one; for example, one query may define a table that the next
+ query in the same pipeline uses. Similarly, an application may
+ create a named prepared statement and execute it with later
+ statements in the same pipeline.
+
+   
+
+   
+Processing Results
+
+
+ To process the result of one query in a pipeline, the application calls
+ PQgetResult repeatedly and handles each result
+ until PQgetResult re

Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2021-03-04 Thread Alvaro Herrera
On 2021-Mar-04, Alvaro Herrera wrote:

> v30 contains changes to hopefully make it build on MSVC.

Hm, that didn't work -- appveyor still says:

Project "C:\projects\postgresql\pgsql.sln" (1) is building 
"C:\projects\postgresql\pipeline.vcxproj" (75) on node 1 (default targets).
PrepareForBuild:
  Creating directory ".\Release\pipeline\".
  Creating directory ".\Release\pipeline\pipeline.tlog\".
InitializeBuildStatus:
  Creating ".\Release\pipeline\pipeline.tlog\unsuccessfulbuild" because 
"AlwaysCreate" was specified.
ClCompile:
  C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_amd64\CL.exe 
/c /Isrc/include /Isrc/include/port/win32 /Isrc/include/port/win32_msvc /Zi 
/nologo /W3 /WX- /Ox /D WIN32 /D _WINDOWS /D __WINDOWS__ /D __WIN32__ /D 
WIN32_STACK_RLIMIT=4194304 /D _CRT_SECURE_NO_DEPRECATE /D 
_CRT_NONSTDC_NO_DEPRECATE /D _MBCS /GF /Gm- /EHsc /MD /GS /fp:precise 
/Zc:wchar_t /Zc:forScope /Fo".\Release\pipeline\\" 
/Fd".\Release\pipeline\vc120.pdb" /Gd /TC /wd4018 /wd4244 /wd4273 /wd4102 
/wd4090 /wd4267 /errorReport:queue /MP src/test/modules/test_libpq/pipeline.c
  pipeline.c
src/test/modules/test_libpq/pipeline.c(11): fatal error C1083: Cannot open 
include file: 'libpq-fe.h': No such file or directory 
[C:\projects\postgresql\pipeline.vcxproj]
Done Building Project "C:\projects\postgresql\pipeline.vcxproj" (default 
targets) -- FAILED.
Project "C:\projects\postgresql\pgsql.sln" (1) is building 
"C:\projects\postgresql\test_parser.vcxproj" (76) on node 1 (default targets).

I think the problem is that the project is called pipeline and not test_libpq,
so there's no match in the name.  I'm going to rename the whole thing to
src/test/modules/libpq_pipeline/ and see if the msvc tooling likes that
better.


-- 
Álvaro Herrera39°49'30"S 73°17'W




  1   2   3   4   5   6   7   8   9   10   >