Re: Catalog invalidations vs catalog scans vs ScanPgRelation()

2020-04-07 Thread Andres Freund
Hi,

Robert, Tom, it'd be great if you could look through this thread. I
think there's a problem here (and it has gotten worse after the
introduction of catalog snapshots). Both of you at least dabbled in
related code.


On 2020-02-29 12:17:07 -0800, Andres Freund wrote:
> On 2020-02-28 22:10:52 -0800, Andres Freund wrote:
> > So, um. What happens is that doDeletion() does a catalog scan, which
> > sets a snapshot. The results of that catalog scan are then used to
> > perform modifications. But at that point there's no guarantee that we
> > still hold *any* snapshot, as e.g. invalidations can trigger the catalog
> > snapshot being released.
> > 
> > I don't see how that's safe. Without ->xmin preventing that,
> > intermediate row versions that we did look up could just get vacuumed
> > away, and replaced with a different row. That does seem like a serious
> > issue?
> > 
> > I think there's likely a lot of places that can hit this? What makes it
> > safe for InvalidateCatalogSnapshot()->SnapshotResetXmin() to release
> > ->xmin when there previously has been *any* catalog access? Because in
> > contrast to normal table modifications, there's currently nothing at all
> > forcing us to hold a snapshot between catalog lookups an their
> > modifications?
> > 
> > Am I missing something? Or is this a fairly significant hole in our
> > arrangements?
> 
> I still think that's true.  In a first iteration I hacked around the
> problem by explicitly registering a catalog snapshot in
> RemoveTempRelations(). That *sometimes* allows to get through the
> regression tests without the assertions triggering.

The attached two patches (they're not meant to be applied) reliably get
through the regression tests. But I suspect I'd have to at least do a
CLOBBER_CACHE_ALWAYS run to find all the actually vulnerable places.


> But I don't think that's good enough (even if we fixed the other
> potential crashes similarly). The only reason that avoids the asserts is
> because in nearly all other cases there's also a user snapshot that's
> pushed. But that pushed snapshot can have an xmin that's newer than the
> catalog snapshot, which means we're still in danger of tids from catalog
> scans being outdated.
> 
> My preliminary conclusion is that it's simply not safe to do
> SnapshotResetXmin() from within InvalidateCatalogSnapshot(),
> PopActiveSnapshot(), UnregisterSnapshotFromOwner() etc. Instead we need
> to defer the SnapshotResetXmin() call until at least
> CommitTransactionCommand()? Outside of that there ought (with exception
> of multi-transaction commands, but they have to be careful anyway) to be
> no "in progress" sequences of related catalog lookups/modifications.
> 
> Alternatively we could ensure that all catalog lookup/mod sequences
> ensure that the first catalog snapshot is registered. But that seems
> like a gargantuan task?

I also just noticed comments of this style in a few places
 * Start a transaction so we can access pg_database, and get a snapshot.
 * We don't have a use for the snapshot itself, but we're interested in
 * the secondary effect that it sets RecentGlobalXmin.  (This is 
critical
 * for anything that reads heap pages, because HOT may decide to prune
 * them even if the process doesn't attempt to modify any tuples.)
followed by code like

StartTransactionCommand();
(void) GetTransactionSnapshot();

rel = table_open(DatabaseRelationId, AccessShareLock);
scan = table_beginscan_catalog(rel, 0, NULL);

which is not safe at all, unfortunately. The snapshot is not
pushed/active, therefore invalidations processed e.g. as part of the
table_open() could execute a InvalidateCatalogSnapshot(), which in turn
would remove the catalog snapshot from the pairing heap and
SnapshotResetXmin().  And poof, the backend's xmin is gone.

Greetings,

Andres Freund
>From 3b58990c088936122f38d855a5a3900602deacf7 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH 1/2] TMP: work around missing snapshot registrations.

This is just what's hit by the tests. It's not an actual fix.
---
 src/backend/catalog/namespace.c |  7 +++
 src/backend/catalog/pg_subscription.c   |  4 
 src/backend/commands/indexcmds.c|  9 +
 src/backend/commands/tablecmds.c|  8 
 src/backend/replication/logical/tablesync.c | 12 
 src/backend/replication/logical/worker.c|  4 
 src/backend/utils/time/snapmgr.c|  4 
 7 files changed, 48 insertions(+)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 2ec23016fe5..e4696d8d417 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -55,6 +55,7 @@
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/varlena.h"
 
@@ -4244,12 +4

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-04-07 Thread Kyotaro Horiguchi
At Tue, 07 Apr 2020 12:09:05 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> > it seems to me that it suffices to check restart_lsn for being invalid
> > in the couple of places where the slot's owner advances (which is the
> > two auxiliary functions for ProcessStandbyReplyMessage).  I have done so
> > in the attached.  There are other places where the restart_lsn is set,
> > but those seem to be used only when the slot is created.  I don't think
> > we need to cover for those, but I'm not 100% sure about that.
> 
> StartLogicalReplcation does
> "XLogBeginRead(,MyReplicationSlot->data.restart_lsn)". If the
> restart_lsn is invalid, following call to XLogReadRecord runs into
> assertion failure.  Walsender (or StartLogicalReplication) should
> correctly reject reconnection from the subscriber if restart_lsn is
> invalid.
> 
> > However, the change in PhysicalConfirmReceivedLocation() breaks
> > the way slots work for pg_basebackup: apparently the slot is created
> > with a restart_lsn of Invalid and we only advance it the first time we
> > process a feedback message from pg_basebackup.  I have a vague feeling
> > that that's bogus, but I'll have to look at the involved code a little
> > bit more closely to be sure about this.
> 
> Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot?

I did that in the attached. The invalidated is shared-but-not-saved
member of a slot and initialized to false then irreversibly changed to
true when the slot loses required segment.

It is checked by the new function CheckReplicationSlotInvalidated() at
acquireing a slot and at updating slot by standby reply message. This
change stops walsender without explicitly killing but I didn't remove
that code.

When logical slot loses segment, the publisher complains as:


[backend  ] LOG:  slot "s1" is invalidated at 0/370001C0 due to exceeding 
max_slot_wal_keep_size
[walsender] FATAL:  terminating connection due to administrator command

The subscriber tries to reconnect and that fails as follows:

[19350] ERROR:  replication slot "s1" is invalidated
[19352] ERROR:  replication slot "s1" is invalidated
...

If the publisher restarts, the message is not seen and see the
following instead.

[19372] ERROR:  requested WAL segment 00010037 has already been 
removed

The check is done at ReplicationSlotAcquire, some slot-related SQL
functions are affected.

=# select pg_replication_slot_advance('s1', '0/3700');
ERROR:  replication slot "s1" is invalidated

After restarting the publisher, the message changes as the same with
walsender.

=# select pg_replication_slot_advance('s1', '0/380001C0');
ERROR:  requested WAL segment pg_wal/00010037 has already been 
removed

Since I didn't touch restart_lsn at all so no fear for changing other
behavior inadvertently.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 3f81c5740ea3554835bbe794820624b56c9c3ea8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 7 Apr 2020 11:09:54 +0900
Subject: [PATCH] further change type 2

---
 src/backend/access/transam/xlog.c   | 17 +
 src/backend/replication/slot.c  | 59 -
 src/backend/replication/walsender.c |  2 +
 src/include/replication/slot.h  |  7 
 4 files changed, 68 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f28ffaab9..c5b96126ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9559,20 +9559,21 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 	XLByteToSeg(recptr, currSegNo, wal_segment_size);
 	segno = currSegNo;
 
+	/*
+	 * Calculate how many segments are kept by slots first, adjusting
+	 * for max_slot_wal_keep_size.
+	 */
 	keep = XLogGetReplicationSlotMinimumLSN();
-
-	/*
-	 * Calculate how many segments are kept by slots first.
-	 */
-	/* Cap keepSegs by max_slot_wal_keep_size */
 	if (keep != InvalidXLogRecPtr)
 	{
 		XLByteToSeg(keep, segno, wal_segment_size);
 
-		/* Reduce it if slots already reserves too many. */
+		/* Cap by max_slot_wal_keep_size ... */
 		if (max_slot_wal_keep_size_mb >= 0)
 		{
-			XLogRecPtr slot_keep_segs =
+			XLogRecPtr	slot_keep_segs;
+
+			slot_keep_segs =
 ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
 			if (currSegNo - segno > slot_keep_segs)
@@ -9580,7 +9581,7 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 		}
 	}
 
-	/* but, keep at least wal_keep_segments segments if any */
+	/* but, keep at least wal_keep_segments if that's set */
 	if (wal_keep_segments > 0 && currSegNo - segno < wal_keep_segments)
 	{
 		/* avoid underflow, don't go below 1 */
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 86ddff8b9d..0a28b27607 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -277,6 +277,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	StrNCpy(NameStr(slot->data

Re: proposal \gcsv

2020-04-07 Thread Pavel Stehule
út 7. 4. 2020 v 2:28 odesílatel Tom Lane  napsal:

> Here's a WIP patch for the parenthesized-options route.
>
> I realized that if we make the options be single words in the form
> name=value, we can easily handle the shortcut forms with no value.
> So that's what this does.
>
> What this does *not* do is offer any solution to the question of
> how to put a right paren as the last character of a pset option
> value.  I don't really see any easy way to handle that, but maybe
> we can punt for now.
>
> Also no docs or test cases, but I see no point in putting effort into
> that in advance of consensus that this is what we want.
>
> 0001 is some save/restore infrastructure that we'd need for pretty
> much all of the proposals on the table, and then 0002 improves the
> command itself.
>

looks well

just note to syntax

your patch supports syntax

(option1=value option2=value)

It looks little bit inconsistent and unusual

should be better comma separated list?

(option1=value, option2=value)

Regards

Pavel

>
> regards, tom lane
>
>


Re: Don't try fetching future segment of a TLI.

2020-04-07 Thread Kyotaro Horiguchi
At Tue, 7 Apr 2020 12:15:00 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/04/07 4:04, David Steele wrote:
> > On 4/6/20 1:43 PM, Fujii Masao wrote:
> >>
> >>
> >> On 2020/03/19 22:22, Pavel Suderevsky wrote:
> >>> Hi,
> >>>
> >>> I've tested patch provided by Kyotaro and do confirm it fixes the
> >>> issue.
> >>
> >> The patch looks good to me. Attached is the updated version of the
> >> patch.
> >> I updated only comments.
> >>
> >> Barring any objection, I will commit this patch.
> > The patch looks good to me.
> > 
> >>> Any chance it will be merged to one of the next minor releases?
> >>
> >> This doesn't seem a bug, so I'm thinking to merge this to next *major*
> >> version release, i.e., v13.
> > Not a bug, perhaps, but I think we do consider back-patching
> > performance problems. The rise in S3 usage has just exposed how poorly
> > this performed code in high-latency environments.
> 
> I understood the situation and am fine to back-patch that. But I'm not
> sure
> if it's fair to do that. Maybe we need to hear more opinions about
> this?
> OTOH, feature freeze for v13 is today, so what about committing the
> patch
> in v13 at first, and then doing the back-patch after hearing opinions
> and
> receiving many +1?

+1 for commit only v13 today, then back-patch if people wants and/or
accepts.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-07 Thread Masahiko Sawada
On Tue, 7 Apr 2020 at 02:40, Peter Geoghegan  wrote:
>
> On Mon, Apr 6, 2020 at 2:21 AM Amit Kapila  wrote:
> > AFAIU, it uses heapam_index_build_range_scan but for writing to index,
> > it doesn't use buffer manager.
>
> Right. It doesn't need to use the buffer manager to write to the
> index, unlike (say) GIN's CREATE INDEX.

Hmm, after more thoughts and testing, it seems to me that parallel
btree index creation uses buffer manager while scanning the table in
parallel, i.e in heapam_index_build_range_scan, which affects
shared_blks_xxx in pg_stat_statements. I've some parallel create index
tests with the current HEAD and with the attached patch. The table has
44248 blocks.

HEAD, no workers:

-[ RECORD 1 ]---+--
total_plan_time | 0
total_plan_time | 0
shared_blks_hit | 148
shared_blks_read| 44281
total_read_blks | 44429
shared_blks_dirtied | 44261
shared_blks_written | 24644
wal_records | 71693
wal_num_fpw | 71682
wal_bytes   | 566815038

HEAD, 4 workers:

-[ RECORD 1 ]---+--
total_plan_time | 0
total_plan_time | 0
shared_blks_hit | 160
shared_blks_read| 8892
total_read_blks | 9052
shared_blks_dirtied | 8871
shared_blks_written | 5342
wal_records | 71693
wal_num_fpw | 71682
wal_bytes   | 566815038

The WAL usage statistics are good but the buffer usage statistics seem
not correct.

Patched, no workers:

-[ RECORD 1 ]---+--
total_plan_time | 0
total_plan_time | 0
shared_blks_hit | 148
shared_blks_read| 44281
total_read_blks | 44429
shared_blks_dirtied | 44261
shared_blks_written | 24843
wal_records | 71693
wal_num_fpw | 71682
wal_bytes   | 566815038

Patched, 4 workers:

-[ RECORD 1 ]---+--
total_plan_time | 0
total_plan_time | 0
shared_blks_hit | 172
shared_blks_read| 44282
total_read_blks | 44454
shared_blks_dirtied | 44261
shared_blks_written | 26968
wal_records | 71693
wal_num_fpw | 71682
wal_bytes   | 566815038

Buffer usage statistics seem correct. The small differences would be
catalog lookups Peter mentioned.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


bufferusage_create_index_v4.patch
Description: Binary data


Re: [patch] Fix pg_checksums to allow checking of offline base backup directories

2020-04-07 Thread Michael Paquier
On Mon, Apr 06, 2020 at 01:26:17PM +0200, Michael Banck wrote:
> I think we can allow checking of base backups if we make sure
> backup_label exists in the data directory or am I missing something?
> I think we need to have similar checks about pages changed during base
> backup, so this patch ignores checksum failures between the checkpoint
> LSN and (as a reasonable upper bound) the last LSN of the last existing
> transaction log file. If no xlog files exist (the --wal-method=none
> case), the last LSN of the checkpoint WAL segment is taken.

Have you considered that backup_label files can exist in the data
directory of a live cluster?  That's not the case with pg_basebackup
or non-exclusive backups with the SQL interface, but that's possible
with the SQL interface and an exclusive backup running.

FWIW, my take on this matter is that you should consider checksum
verification as one step to check the sanity of a base backup, meaning
that you have to restore the base backup first, then let it reach its
consistent LSN position, and finally stop the cluster cleanly to make
sure that everything is safely flushed on disk and consistent.
Attempting to verify checksums from a raw base backup would most
likely lead to false positives, and my guess is that your patch has
issues in this area.  Hint at quick glance: the code path setting
insertLimitLSN where you actually don't use any APIs from
xlogreader.h.
--
Michael


signature.asc
Description: PGP signature


Re: Don't try fetching future segment of a TLI.

2020-04-07 Thread Michael Paquier
On Tue, Apr 07, 2020 at 12:15:00PM +0900, Fujii Masao wrote:
> I understood the situation and am fine to back-patch that. But I'm not sure
> if it's fair to do that. Maybe we need to hear more opinions about this?
> OTOH, feature freeze for v13 is today, so what about committing the patch
> in v13 at first, and then doing the back-patch after hearing opinions and
> receiving many +1?

I have not looked at the patch so I cannot say much about it, but it
is annoying to fetch segments you are not going to need anyway if you
target recovery with a timeline older than the segments fetched and
this has a cost when you pay for the bandwidth of your environment
with only one archive location.  So a backpatch sounds like a good
thing to do even if recovery is not broken per-se, only slower.

Designing a TAP test for that is tricky, but you could look at the
logs of the backend to make sure that only the wanted segments are
fetched with a central archived solution and multiple timelines
involved.  And costly it is.
--
Michael


signature.asc
Description: PGP signature


Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-07 Thread Amit Kapila
On Tue, Apr 7, 2020 at 1:30 PM Masahiko Sawada
 wrote:
>
> Buffer usage statistics seem correct. The small differences would be
> catalog lookups Peter mentioned.
>

Agreed, but can you check which part of code does that lookup?  I want
to see if we can avoid that from buffer usage stats or at least write
a comment about it, otherwise, we might have to face this question
again and again.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: EINTR while resizing dsm segment.

2020-04-07 Thread Nicola Contu
So that seems to be a bug, correct?
Just to confirm, I am not using NFS, it is directly on disk.

Other than that, is there a particular option we can set in the
postgres.conf to mitigate the issue?

Thanks a lot for your help.


Il giorno sab 4 apr 2020 alle ore 02:49 Thomas Munro 
ha scritto:

> On Thu, Apr 2, 2020 at 9:25 PM Kyotaro Horiguchi
>  wrote:
> > I provided the subject, and added -hackers.
> >
> > > Hello,
> > > I am running postgres 11.5 and we were having issues with shared
> segments.
> > > So I increased the max_connection as suggested by you guys and reduced
> my
> > > work_mem to 600M.
> > >
> > > Right now instead, it is the second time I see this error :
> > >
> > > ERROR:  could not resize shared memory segment
> "/PostgreSQL.2137675995" to
> > > 33624064 bytes: Interrupted system call
> >
> > The function posix_fallocate is protected against EINTR.
> >
> > | do
> > | {
> > |   rc = posix_fallocate(fd, 0, size);
> > | } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
> >
> > But not for ftruncate and write. Don't we need to protect them from
> > ENTRI as the attached?
>
> We don't handle EINTR for write() generally because that's not
> supposed to be necessary on local files (local disks are not "slow
> devices", and we document that if you're using something like NFS you
> should use its "hard" mount option so that it behaves that way too).
> As for ftruncate(), you'd think it'd be similar, and I can't think of
> a more local filesystem than tmpfs (where POSIX shmem lives on Linux),
> but I can't seem to figure that out from reading man pages; maybe I'm
> reading the wrong ones.  Perhaps in low memory situations, an I/O wait
> path reached by ftruncate() can return EINTR here rather than entering
> D state (non-interruptable sleep) or restarting due to our SA_RESTART
> flag... anyone know?
>
> Another thought: is there some way for the posix_fallocate() retry
> loop to exit because (ProcDiePending || QueryCancelPending), but then
> for CHECK_FOR_INTERRUPTS() to do nothing, so that we fall through to
> reporting the EINTR?
>


Re: adding partitioned tables to publications

2020-04-07 Thread Peter Eisentraut

On 2020-04-07 08:44, Amit Langote wrote:

I updated the patch to make the following changes:

* Rewrote the tests to match in style with those committed yesterday
* Renamed all variables that had pubasroot in it to have pubviaroot
instead to match the publication parameter
* Updated pg_publication catalog documentation


Thanks.  I have some further questions:

The change in nodeModifyTable.c to add CheckValidResultRel() is unclear. 
 It doesn't seem to do anything, and it's not clear how it's related to 
this patch.


The changes in GetRelationPublications() are confusing to me:

+   if (published_rels)
+   {
+   num = list_length(result);
+   for (i = 0; i < num; i++)
+   *published_rels = lappend_oid(*published_rels, relid);
+   }

This adds relid to the output list "num" times, where num is the number 
of publications found.  Shouldn't "i" be used in the loop somehow? 
Similarly later in the function.


The descriptions of the new fields in RelationSyncEntry don't seem to 
match the code accurately, or at least it's confusing. 
replicate_as_relid is always filled in with an ancestor, even if 
pubviaroot is not set.


I think the pubviaroot field is actually not necessary.  We only need 
replicate_as_relid.


There is a markup typo in logical-replication.sgml:



In pg_dump, you missed updating a branch for an older version.  See 
attached patch.


Also attached a patch to rephrase the psql output a bit to make it not 
so long.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 572549aa23d4e3fa2d1abc0733d33f28cb692c40 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 7 Apr 2020 10:56:11 +0200
Subject: [PATCH 1/2] fixup! Allow publishing partition changes via ancestors

---
 src/bin/pg_dump/pg_dump.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 4d03608596..c579227b19 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3891,14 +3891,14 @@ getPublications(Archive *fout)
appendPQExpBuffer(query,
  "SELECT p.tableoid, p.oid, 
p.pubname, "
  "(%s p.pubowner) AS rolname, "
- "p.puballtables, p.pubinsert, 
p.pubupdate, p.pubdelete, p.pubtruncate, false as pubviaroot "
+ "p.puballtables, p.pubinsert, 
p.pubupdate, p.pubdelete, p.pubtruncate, false AS pubviaroot "
  "FROM pg_publication p",
  username_subquery);
else
appendPQExpBuffer(query,
  "SELECT p.tableoid, p.oid, 
p.pubname, "
  "(%s p.pubowner) AS rolname, "
- "p.puballtables, p.pubinsert, 
p.pubupdate, p.pubdelete, false AS pubtruncate "
+ "p.puballtables, p.pubinsert, 
p.pubupdate, p.pubdelete, false AS pubtruncate, false AS pubviaroot "
  "FROM pg_publication p",
  username_subquery);
 
-- 
2.26.0

From 3e12e0ff53e4d1aa5b78b6b7fa181e79ca280ef0 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 7 Apr 2020 10:58:55 +0200
Subject: [PATCH 2/2] fixup! Allow publishing partition changes via ancestors

---
 src/bin/psql/describe.c   |  4 +-
 src/test/regress/expected/publication.out | 72 +++
 2 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index c731ed6322..f05e914b4d 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5741,7 +5741,7 @@ listPublications(const char *pattern)
if (pset.sversion >= 13)
appendPQExpBuffer(&buf,
  ",\n  pubviaroot AS \"%s\"",
- gettext_noop("Publishes Using 
Root Schema"));
+ gettext_noop("Via root"));
 
appendPQExpBufferStr(&buf,
 "\nFROM 
pg_catalog.pg_publication\n");
@@ -5874,7 +5874,7 @@ describePublications(const char *pattern)
if (has_pubtruncate)
printTableAddHeader(&cont, gettext_noop("Truncates"), 
true, align);
if (has_pubviaroot)
-   printTableAddHeader(&cont, gettext_noop("Publishes 
Using Root Schema"), true, align);
+   printTableAddHeader(&cont, gettext_noop("Via root"), 
true, align);
 
printTableAddCell(&cont, PQgetvalue(res,

Re: [patch] Fix pg_checksums to allow checking of offline base backup directories

2020-04-07 Thread Michael Banck
Hi,

Am Dienstag, den 07.04.2020, 17:07 +0900 schrieb Michael Paquier:
> On Mon, Apr 06, 2020 at 01:26:17PM +0200, Michael Banck wrote:
> > I think we can allow checking of base backups if we make sure
> > backup_label exists in the data directory or am I missing something?
> > I think we need to have similar checks about pages changed during base
> > backup, so this patch ignores checksum failures between the checkpoint
> > LSN and (as a reasonable upper bound) the last LSN of the last existing
> > transaction log file. If no xlog files exist (the --wal-method=none
> > case), the last LSN of the checkpoint WAL segment is taken.
> 
> Have you considered that backup_label files can exist in the data
> directory of a live cluster?  That's not the case with pg_basebackup
> or non-exclusive backups with the SQL interface, but that's possible
> with the SQL interface and an exclusive backup running.

I see, that's what I was missing. I think it is unfortunate that
pg_control does not record an ongoing (base)backup in the state or
elsewhere. Maybe one could look at the `BACKUP METHOD' field in
backup_label, which is (always?) `pg_start_backup' for the non-exclusive 
backup and `streamed' for pg_basebackup.

> FWIW, my take on this matter is that you should consider checksum
> verification as one step to check the sanity of a base backup, meaning
> that you have to restore the base backup first, then let it reach its
> consistent LSN position, and finally stop the cluster cleanly to make
> sure that everything is safely flushed on disk and consistent.

That's a full restore and it should certainly be encouraged that
organizations do full restore tests regularly, but (not only) if you
have lots of big instances, that is often not the case.

So I think making it easier to check plain base backups would be
helpful, even if some part of  recently changed data might not get
checked.

> Attempting to verify checksums from a raw base backup would most
> likely lead to false positives, and my guess is that your patch has
> issues in this area.  Hint at quick glance: the code path setting
> insertLimitLSN where you actually don't use any APIs from
> xlogreader.h.

I evaluated using xlogreader to fetch the BACKUP STOP position from the
WAL but then discarded that for now as possibly being overkill and went
the route of a slightly larger upper bound by taking the following WAL
segment and not the BACKUP STOP position. But I can take a look at
implementing the more fine-grained method if needed.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz





Re: WAL usage calculation patch

2020-04-07 Thread Julien Rouhaud
On Tue, Apr 7, 2020 at 4:36 AM Amit Kapila  wrote:
>
> On Mon, Apr 6, 2020 at 7:58 PM Euler Taveira
>  wrote:
> >
> > On Mon, 6 Apr 2020 at 10:37, Julien Rouhaud  wrote:
> >>
> >> On Mon, Apr 06, 2020 at 10:12:55AM -0300, Euler Taveira wrote:
> >> > On Mon, 6 Apr 2020 at 00:25, Amit Kapila  wrote:
> >> >
> >> > >
> >> > > I have pushed pg_stat_statements and Explain related patches.  I am
> >> > > now looking into (auto)vacuum patch and have few comments.
> >> > >
> >> > > I wasn't paying much attention to this thread. May I suggest changing
> >> > wal_num_fpw to wal_fpw? wal_records and wal_bytes does not have a prefix
> >> > 'num'. It seems inconsistent to me.
> >> >
> >>
> >> If we want to be consistent shouldn't we rename it to wal_fpws?  FTR I 
> >> don't
> >> like much either version.
> >
> >
> > Since FPW is an acronym, plural form reads better when you are using 
> > uppercase (such as FPWs or FPW's); thus, I prefer singular form because 
> > parameter names are lowercase. Function description will clarify that this 
> > is "number of WAL full page writes".
> >
>
> I like Euler's suggestion to change wal_num_fpw to wal_fpw.  It is
> better if others who didn't like this name can also share their
> opinion now because changing multiple times the same thing is not a
> good idea.

+1

About Justin and your comments on the other thread:

On Tue, Apr 7, 2020 at 4:31 AM Amit Kapila  wrote:
>
> On Mon, Apr 6, 2020 at 10:04 PM Justin Pryzby  wrote:
> >
> > On Thu, Apr 02, 2020 at 08:29:31AM +0200, Julien Rouhaud wrote:
> > > > > "full page records" seems to be showing the number of full page
> > > > > images, not the record having full page images.
> > > >
> > > > I am not sure what exactly is a difference but it is the records
> > > > having full page images.  Julien correct me if I am wrong.
> >
> > > Obviously previous complaints about the meaning and parsability of
> > > "full page writes" should be addressed here for consistency.
> >
> > There's a couple places that say "full page image records" which I think is
> > language you were trying to avoid.  It's the number of pages, not the 
> > number of
> > records, no ?  I see explain and autovacuum say what I think is wanted, but
> > these say the wrong thing?  Find attached slightly larger patch.
> >
> > $ git grep 'image record'
> > contrib/pg_stat_statements/pg_stat_statements.c:int64   
> > wal_num_fpw;/* # of WAL full page image records generated */
> > doc/src/sgml/ref/explain.sgml:  number of records, number of full page 
> > image records and amount of WAL
> >
>
> Few comments:
> 1.
> - int64 wal_num_fpw; /* # of WAL full page image records generated */
> + int64 wal_num_fpw; /* # of WAL full page images generated */
>
> Let's change comment as " /* # of WAL full page writes generated */"
> to be consistent with other places like instrument.h.  Also, make a
> similar change at other places if required.

Agreed.  That's pg_stat_statements.c and instrument.h.  I'll send a
patch once we reach consensus with the rest of the comments.

> 2.
>
> -Total amount of WAL bytes generated by the statement
> +Total number of WAL bytes generated by the statement
>
>
> I feel the previous text was better as this field can give us the size
> of WAL with which we can answer "how much WAL data is generated by a
> particular statement?".  Julien, do you have any thoughts on this?

I also prefer "amount" as it feels more natural.  I'm not a native
english speaker though, so maybe I'm just biased.




[bug] Wrong bool value parameter

2020-04-07 Thread 曾文旌
Do we allow such a bool parameter value? This seems puzzling to me.postgres=# create table t1(c1 int) with(autovacuum_enabled ='tr');CREATE TABLEpostgres=# create table t2(c1 int) with(autovacuum_enabled ='fa');CREATE TABLEpostgres=# \d+ t1                                    Table "public.t1" Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description +-+---+--+-+-+--+- c1     | integer |           |          |         | plain   |              | Access method: heapOptions: autovacuum_enabled=trpostgres=# \d+ t2                                    Table "public.t2" Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description +-+---+--+-+-+--+- c1     | integer |           |          |         | plain   |              | Access method: heapOptions: autovacuum_enabled=faI am try to fix in bug_boolrelopt.patchWenjing

bug_boolrelopt.patch
Description: Binary data


Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-07 Thread Masahiko Sawada
On Tue, 7 Apr 2020 at 17:42, Amit Kapila  wrote:
>
> On Tue, Apr 7, 2020 at 1:30 PM Masahiko Sawada
>  wrote:
> >
> > Buffer usage statistics seem correct. The small differences would be
> > catalog lookups Peter mentioned.
> >
>
> Agreed, but can you check which part of code does that lookup?  I want
> to see if we can avoid that from buffer usage stats or at least write
> a comment about it, otherwise, we might have to face this question
> again and again.

Okay, I'll check it.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Generate backup manifests for base backups, and validate them.

2020-04-07 Thread Peter Eisentraut

On 2020-04-03 21:07, Robert Haas wrote:

A new tool called pg_validatebackup can validate a backup against the
manifest.


In software engineering, "verify" and "validate" have standardized 
distinct meanings.  I'm not going to try to explain them here, but you 
can easily find them online.  I haven't formed an opinion on which one 
of them this tool is doing, but I notice that both the man page and the 
messages produced by the tool use the two terms seemingly 
interchangeably.  We should try to pick the correct term and use it 
consistently.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-07 Thread Amit Kapila
On Tue, Apr 7, 2020 at 7:56 AM Anna Akenteva  wrote:
>
> On 2020-04-07 00:58, Kartyshov Ivan wrote:
> > Ok, here is a new version of patch with single LSN and TIMEOUT.
>
> I had a look at the code and did some more code cleanup, with Ivan's
> permission.
> This is what I did:
> - Removed "WAIT FOR" command tag from cmdtaglist.h and renamed WaitStmt
> to WaitClause (since there's no standalone WAIT FOR command anymore)
> - Added _copyWaitClause() and _equalWaitClause()
> - Removed unused #include-s from utility.c
> - Adjusted tests and documentation
> - Fixed/added some code comments
>
> I have a couple of questions about WaitUtility() though:
> - When waiting forever (due to not specifying a timeout), isn't 60
> seconds too long of an interval to check for interrupts?
> - If we did specify a timeout, it might be a very long one. In this
> case, shouldn't we also make sure to wake up sometimes to check for
> interrupts?
>

Right, we should probably wait for 100ms before checking the
interrupts.  See the similar logic in pg_promote where we wait for
specified number of seconds.

> - Is it OK that specifying timeout = 0 (BEGIN WAIT FOR LSN ... TIMEOUT
> 0) is the same as not specifying timeout at all?
>

Yes that sounds reasonable to me.

Review comments:
--
1.
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)

I don't see how this is implemented or called to handle any errors.
For example in function WaitUtility if the WaitLatch errors out due to
any error, then the event won't be deleted.  I think we can't assume
WaitLatch or any other code in this code path will never error out.
For ex. WaitLatch>WaitEventSetWaitBlock() can error out.  Also, in
future we can add more code which can error out.

2.
+ /*
+ * If received an interruption from CHECK_FOR_INTERRUPTS,
+ * then delete the current event from array.
+ */
+ if (InterruptPending)
+ {
+ DeleteEvent();
+ ProcessInterrupts();
+ }

We generally do this type of handling via CHECK_FOR_INTERRUPTS.  One
reason is that it behaves slightly differently in Windows.  I am not
sure why we want to do differently here? This looks quite adhoc to me
and may not be correct.  If we handle this event in error path, then
we might not need to do some special handling.

3.
+/*
+ * On WAIT use a latch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ * Returns 1 if LSN was reached and 0 otherwise.
+ */
+int
+WaitUtility(XLogRecPtr target_lsn, const float8 secs)

Isn't it better to have a return value as bool?  IOW, why this
function need int as its return value?

4.
+#define GetNowFloat() ((float8) GetCurrentTimestamp() / 100.0)

This same define is used elsewhere in the code as well, may be we can
define it in some central place and use it.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: WAL usage calculation patch

2020-04-07 Thread Peter Eisentraut

On 2020-04-07 04:12, Amit Kapila wrote:

On Mon, Apr 6, 2020 at 10:01 PM Justin Pryzby  wrote:


On Mon, Apr 06, 2020 at 05:01:30PM +0200, Peter Eisentraut wrote:

I noticed in some of the screenshots that were tweeted that for example in

 WAL:  records=1  bytes=56

there are two spaces between pieces of data.  This doesn't match the rest of
the EXPLAIN output.  Can that be adjusted?


We talked about that here:
https://www.postgresql.org/message-id/20200402054120.GC14618%40telsasoft.com



Yeah.  Just to brief here, the main reason was that one of the fields
(full page writes) already had a single space and then we had prior
cases as mentioned in Justin's email [1] where we use two spaces which
lead us to decide using two spaces in this case.


We also have existing cases for the other way:

actual time=0.050..0.052
Buffers: shared hit=3 dirtied=1

The cases mentioned by Justin are not formatted in a key=value format, 
so it's not quite the same, but it also raises the question why they are 
not.


Let's figure out a way to consolidate this without making up a third format.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Global temporary tables

2020-04-07 Thread Prabhat Sahu
>
> Thanks for review.
> This parameter should support all types of writing of the bool type like
> parameter autovacuum_enabled.
> So I fixed in global_temporary_table_v24-pg13.patch.
>

Thank you Wenjing for the new patch with the fix and the "VACUUM FULL GTT"
support.
I have verified the above issue now its resolved.

Please check the below findings on VACUUM FULL.

postgres=# create global temporary table  gtt(c1 int) on commit preserve
rows;
CREATE TABLE
postgres=# vacuum FULL ;
WARNING:  global temp table oldest FrozenXid is far in the past
HINT:  please truncate them or kill those sessions that use them.
VACUUM

-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com


Re: EINTR while resizing dsm segment.

2020-04-07 Thread Nicola Contu
The only change we made on the disk, is the encryption at OS level.
Not sure this can be something related.

Il giorno mar 7 apr 2020 alle ore 10:58 Nicola Contu 
ha scritto:

> So that seems to be a bug, correct?
> Just to confirm, I am not using NFS, it is directly on disk.
>
> Other than that, is there a particular option we can set in the
> postgres.conf to mitigate the issue?
>
> Thanks a lot for your help.
>
>
> Il giorno sab 4 apr 2020 alle ore 02:49 Thomas Munro <
> thomas.mu...@gmail.com> ha scritto:
>
>> On Thu, Apr 2, 2020 at 9:25 PM Kyotaro Horiguchi
>>  wrote:
>> > I provided the subject, and added -hackers.
>> >
>> > > Hello,
>> > > I am running postgres 11.5 and we were having issues with shared
>> segments.
>> > > So I increased the max_connection as suggested by you guys and
>> reduced my
>> > > work_mem to 600M.
>> > >
>> > > Right now instead, it is the second time I see this error :
>> > >
>> > > ERROR:  could not resize shared memory segment
>> "/PostgreSQL.2137675995" to
>> > > 33624064 bytes: Interrupted system call
>> >
>> > The function posix_fallocate is protected against EINTR.
>> >
>> > | do
>> > | {
>> > |   rc = posix_fallocate(fd, 0, size);
>> > | } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
>> >
>> > But not for ftruncate and write. Don't we need to protect them from
>> > ENTRI as the attached?
>>
>> We don't handle EINTR for write() generally because that's not
>> supposed to be necessary on local files (local disks are not "slow
>> devices", and we document that if you're using something like NFS you
>> should use its "hard" mount option so that it behaves that way too).
>> As for ftruncate(), you'd think it'd be similar, and I can't think of
>> a more local filesystem than tmpfs (where POSIX shmem lives on Linux),
>> but I can't seem to figure that out from reading man pages; maybe I'm
>> reading the wrong ones.  Perhaps in low memory situations, an I/O wait
>> path reached by ftruncate() can return EINTR here rather than entering
>> D state (non-interruptable sleep) or restarting due to our SA_RESTART
>> flag... anyone know?
>>
>> Another thought: is there some way for the posix_fallocate() retry
>> loop to exit because (ProcDiePending || QueryCancelPending), but then
>> for CHECK_FOR_INTERRUPTS() to do nothing, so that we fall through to
>> reporting the EINTR?
>>
>


Re: EINTR while resizing dsm segment.

2020-04-07 Thread Thomas Munro
On Tue, Apr 7, 2020 at 8:58 PM Nicola Contu  wrote:
> So that seems to be a bug, correct?
> Just to confirm, I am not using NFS, it is directly on disk.
>
> Other than that, is there a particular option we can set in the postgres.conf 
> to mitigate the issue?

Hi Nicola,

Yeah, I think it's a bug.  We're not sure exactly where yet.




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-07 Thread Amit Kapila
On Tue, Apr 7, 2020 at 5:56 AM Alexander Korotkov
 wrote:
>
> On Tue, Apr 7, 2020 at 12:58 AM Kartyshov Ivan
>  wrote:
> > On 2020-04-04 03:14, Alexander Korotkov wrote:
> > > I think that now we would be fine with single LSN and single TIMEOUT.
> > > In future we may add multiple LSNs/TIMEOUTs or/and support for
> > > expressions as LSNs/TIMEOUTs if we figure out it's necessary.
> > >
> > > I also think it's good to couple waiting for lsn with beginning of
> > > transaction is good idea.  Separate WAIT FOR LSN statement called in
> > > the middle of transaction looks problematic for me. Imagine we have RR
> > > isolation and already acquired the snapshot.  Then out snapshot can
> > > block applying wal records, which we are waiting for.  That would be
> > > implicit deadlock.  It would be nice to evade such deadlocks by
> > > design.
> > Ok, here is a new version of patch with single LSN and TIMEOUT.
>
> I think this quite small feature, which already received quite amount
> of review.  The last version is very pinched.  But I think it would be
> good to commit some very basic version, which is at least some
> progress in the area and could be extended in future.  I'm going to
> pass trough the code tomorrow and commit this unless I found major
> issues or somebody objects.
>

I have gone through this thread and skimmed through the patch and I am
not sure if we can say that this patch is ready to go.  First, I don't
think we have a consensus on the syntax being used in the patch
(various people didn't agree to LSN specific syntax).  They wanted a
more generic syntax and I see that we tried to implement it and it
turns out to be a bit complex but that doesn't mean we just give up on
the idea and take the simplest approach and that too without a broader
agreement.  Second, on my quick review, it seems there are a few
things like error handling, interrupt checking which need more work.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Using the rr debugging tool to debug Postgres

2020-04-07 Thread Jehan-Guillaume de Rorthais
On Mon, 6 Apr 2020 10:38:31 -0700
Peter Geoghegan  wrote:

> rr is a tool that makes gdb much more useful by supporting recording
> and replaying of the program being debugged. I highly recommend trying
> rr if you're somebody that regularly uses gdb to debug Postgres. rr
> implements a gdbserver under the hood, so it's very easy to start
> using once you're already familiar with gdb.
> 
> I have written a Wiki page on how to use rr to record and replay
> Postgres executions:
> 
> https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Recording_Postgres_using_rr_Record_and_Replay_Framework

Thank you Peter!




Re: [Proposal] Global temporary tables

2020-04-07 Thread Erik Rijkers

On 2020-04-07 10:57, 曾文旌 wrote:


[global_temporary_table_v24-pg13.patch ]


Hi,

With gcc 9.3.0 (debian stretch), I see some low-key protests during the 
build:


index.c: In function ‘index_drop’:
index.c:2051:8: warning: variable ‘rel_persistence’ set but not used 
[-Wunused-but-set-variable]

 2051 |  char  rel_persistence;
  |^~~
storage_gtt.c: In function ‘gtt_force_enable_index’:
storage_gtt.c:1252:6: warning: unused variable ‘indexOid’ 
[-Wunused-variable]

 1252 |  Oid indexOid = RelationGetRelid(index);
  |  ^~~~
cluster.c: In function ‘copy_table_data’:
cluster.c:780:2: warning: this ‘if’ clause does not guard... 
[-Wmisleading-indentation]

  780 |  if (RELATION_IS_GLOBAL_TEMP(OldHeap));
  |  ^~
cluster.c:781:3: note: ...this statement, but the latter is misleadingly 
indented as if it were guarded by the ‘if’

  781 |   is_gtt = true;
  |   ^~


Erik




Re: proposal \gcsv

2020-04-07 Thread Isaac Morland
On Tue, 7 Apr 2020 at 03:30, Pavel Stehule  wrote:


> your patch supports syntax
>
> (option1=value option2=value)
>
> It looks little bit inconsistent and unusual
>
>>
It's the same as a connection string. Actually, maybe that's the key to
allowing parentheses, etc. in option values if needed - allow the same
single-quote quoting as in connection strings. Maybe even just call the
same code to do the parsing.


Re: WIP/PoC for parallel backup

2020-04-07 Thread Kashif Zeeshan
On Fri, Apr 3, 2020 at 3:01 PM Kashif Zeeshan <
kashif.zees...@enterprisedb.com> wrote:

> Hi Asif
>
> When a non-existent slot is used with tablespace then correct error is
> displayed but then the backup folder is not cleaned and leaves a corrupt
> backup.
>
> Steps
> ===
>
> edb@localhost bin]$
> [edb@localhost bin]$ mkdir /home/edb/tbl1
> [edb@localhost bin]$ mkdir /home/edb/tbl_res
> [edb@localhost bin]$
> postgres=#  create tablespace tbl1 location '/home/edb/tbl1';
> CREATE TABLESPACE
> postgres=#
> postgres=# create table t1 (a int) tablespace tbl1;
> CREATE TABLE
> postgres=# insert into t1 values(100);
> INSERT 0 1
> postgres=# insert into t1 values(200);
> INSERT 0 1
> postgres=# insert into t1 values(300);
> INSERT 0 1
> postgres=#
>
>
> [edb@localhost bin]$
> [edb@localhost bin]$  ./pg_basebackup -v -j 2 -D
>  /home/edb/Desktop/backup/ -T /home/edb/tbl1=/home/edb/tbl_res -S test
> pg_basebackup: initiating base backup, waiting for checkpoint to complete
> pg_basebackup: checkpoint completed
> pg_basebackup: write-ahead log start point: 0/2E28 on timeline 1
> pg_basebackup: starting background WAL receiver
> pg_basebackup: error: could not send replication command
> "START_REPLICATION": ERROR:  replication slot "test" does not exist
> pg_basebackup: backup worker (0) created
> pg_basebackup: backup worker (1) created
> pg_basebackup: write-ahead log end point: 0/2E000100
> pg_basebackup: waiting for background process to finish streaming ...
> pg_basebackup: error: child thread exited with error 1
> [edb@localhost bin]$
>
> backup folder not cleaned
>
> [edb@localhost bin]$
> [edb@localhost bin]$
> [edb@localhost bin]$
> [edb@localhost bin]$ ls /home/edb/Desktop/backup
> backup_label  globalpg_dynshmem  pg_ident.conf  pg_multixact
>  pg_replslot  pg_snapshots  pg_stat_tmp  pg_tblspcPG_VERSION  pg_xact
> postgresql.conf
> base  pg_commit_ts  pg_hba.conf  pg_logical pg_notify
> pg_serialpg_stat   pg_subtrans  pg_twophase  pg_wal
>  postgresql.auto.conf
> [edb@localhost bin]$
>
>
>
>
> If the same case is executed without the parallel backup patch then the
> backup folder is cleaned after the error is displayed.
>
> [edb@localhost bin]$ ./pg_basebackup -v -D  /home/edb/Desktop/backup/ -T
> /home/edb/tbl1=/home/edb/tbl_res -S test999
> pg_basebackup: initiating base backup, waiting for checkpoint to complete
> pg_basebackup: checkpoint completed
> pg_basebackup: write-ahead log start point: 0/2B28 on timeline 1
> pg_basebackup: starting background WAL receiver
> pg_basebackup: error: could not send replication command
> "START_REPLICATION": ERROR:  replication slot "test999" does not exist
> pg_basebackup: write-ahead log end point: 0/2B000100
> pg_basebackup: waiting for background process to finish streaming ...
> pg_basebackup: error: child process exited with exit code 1
> *pg_basebackup: removing data directory " /home/edb/Desktop/backup"*
> pg_basebackup: changes to tablespace directories will not be undone
>


Hi Asif

A similar case is when DB Server is shut down while the Parallel Backup is
in progress then the correct error is displayed but then the backup folder
is not cleaned and leaves a corrupt backup. I think one bug fix will solve
all these cases where clean up is not done when parallel backup is failed.

[edb@localhost bin]$
[edb@localhost bin]$
[edb@localhost bin]$  ./pg_basebackup -v -D  /home/edb/Desktop/backup/ -j 8
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/C128 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_57337"
pg_basebackup: backup worker (0) created
pg_basebackup: backup worker (1) created
pg_basebackup: backup worker (2) created
pg_basebackup: backup worker (3) created
pg_basebackup: backup worker (4) created
pg_basebackup: backup worker (5) created
pg_basebackup: backup worker (6) created
pg_basebackup: backup worker (7) created
pg_basebackup: error: could not read COPY data: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_basebackup: error: could not read COPY data: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[edb@localhost bin]$
[edb@localhost bin]$

Same case when executed on pg_basebackup without the Parallel backup patch
then proper clean up is done.

[edb@localhost bin]$
[edb@localhost bin]$  ./pg_basebackup -v -D  /home/edb/Desktop/backup/
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/C528 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_5590"
pg_b

Re: Don't try fetching future segment of a TLI.

2020-04-07 Thread David Steele



On 4/7/20 3:48 AM, Kyotaro Horiguchi wrote:

At Tue, 7 Apr 2020 12:15:00 +0900, Fujii Masao  
wrote in

This doesn't seem a bug, so I'm thinking to merge this to next *major*
version release, i.e., v13.

Not a bug, perhaps, but I think we do consider back-patching
performance problems. The rise in S3 usage has just exposed how poorly
this performed code in high-latency environments.


I understood the situation and am fine to back-patch that. But I'm not
sure
if it's fair to do that. Maybe we need to hear more opinions about
this?
OTOH, feature freeze for v13 is today, so what about committing the
patch
in v13 at first, and then doing the back-patch after hearing opinions
and
receiving many +1?


+1 for commit only v13 today, then back-patch if people wants and/or
accepts.


Definitely +1 for a commit today to v13. I certainly was not trying to 
hold that up.


--
-David
da...@pgmasters.net




Re: [Proposal] Global temporary tables

2020-04-07 Thread tushar

On 4/7/20 2:27 PM, 曾文旌 wrote:
Vacuum full GTT, cluster GTT is already 
supported in global_temporary_table_v24-pg13.patch.


Here , it is skipping GTT

postgres=# \c foo
You are now connected to database "foo" as user "tushar".
foo=# create global temporary table  g123( c1 int) ;
CREATE TABLE
foo=# \q
[tushar@localhost bin]$ ./vacuumdb --full  foo
vacuumdb: vacuuming database "foo"
WARNING:  skipping vacuum global temp table "g123" because storage is 
not initialized for current session


--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company





Re: SLRU statistics

2020-04-07 Thread Kuntal Ghosh
Hello Tomas,

On Thu, Apr 2, 2020 at 5:59 PM Tomas Vondra
 wrote:
> Thank you for the patch, pushed.
>
In SimpleLruReadPage_ReadOnly, we first try to find the SLRU page in
shared buffer under shared lock, then conditionally visit
SimpleLruReadPage if reading is necessary. IMHO, we should update
hit_count if we can find the buffer in SimpleLruReadPage_ReadOnly
directly. Am I missing something?

Attached a patch for the same.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Update-stats-in-SimpleLruReadPage_ReadOnly.patch
Description: Binary data


Re: Use compiler intrinsics for bit ops in hash

2020-04-07 Thread David Rowley
On Thu, 12 Mar 2020 at 22:59, John Naylor  wrote:
>
> On Thu, Mar 12, 2020 at 7:42 AM David Rowley  wrote:
> >
> > I don't think Jesse's proposed solution is that great due to the
> > additional function call overhead for pg_count_leading_zeros_32(). The
> > (num & (num - 1)) == 0 I imagine will perform better, but I didn't
> > test it.
>
> Right, I believe we've all landed on the same page about that. I see
> two ways of doing next_power_of_2_32 without an indirect function
> call, and leaving pg_leftmost_one_pos32 the same as it is now. I
> haven't measured either yet (or tested for that matter):

I've attached an updated patch.  It includes the modifications
mentioned above to pre-check for a power of 2 number with the bit
masking hack mentioned above. I also renamed the functions to be more
aligned to the other functions in pg_bitutils.h   I'm not convinced
pg_ceil_log2_* needs the word "ceil" in there.

I dropped the part of the patch that was changing longs to ints of a
known size. I went on and did some additional conversion in the 0003
patch. There are more laying around the code base, but I ended up
finding a bit to fix up than i had thought I would. e.g. various
places that repalloc() inside a loop that is multiplying the
allocation size by 2 each time.  The repalloc should be done at the
end, not during the loop.  I thought I might come back to those some
time in the future.

Is anyone able to have a look at this?

David
From b69f994a8f284b7931f5383a8de323a2986d86e5 Mon Sep 17 00:00:00 2001
From: "dgrow...@gmail.com" 
Date: Tue, 7 Apr 2020 11:02:47 +1200
Subject: [PATCH v7 1/3] Add functions to calculate the next power of 2

There are many areas in the code where we need to determine the next
highest power of 2 of a given number.  We tend to always do that in an
ad-hoc way each time, generally with some tight for loop which performs a
bitshift left once per loop and goes until it finds a number above the
given number.

Here we add two generic functions which make use of the existing
pg_leftmost_one_pos* functions which, when available will allow us to
calculate the next power of 2 without any looping.

Here we don't add any code which uses these new functions. That will be
done in followup commits.
---
 src/include/port/pg_bitutils.h | 72 ++
 1 file changed, 72 insertions(+)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 498e532308..4ca92f076d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -129,6 +129,78 @@ pg_rightmost_one_pos64(uint64 word)
 #endif			/* HAVE__BUILTIN_CTZ */
 }
 
+/*
+ * pg_nextpower2_32
+ *		Returns the next highest power of 2 of 'num', or 'num', if it's
+ *		already a power of 2.
+ *
+ * 'num' mustn't be 0 or be above PG_UINT32_MAX / 2 + 1.
+ */
+static inline uint32
+pg_nextpower2_32(uint32 num)
+{
+	Assert(num > 0 && num <= PG_UINT32_MAX / 2 + 1);
+
+	/*
+	 * A power 2 number has only 1 bit set.  Subtracting 1 from such a number
+	 * will turn on all previous bits resulting in no common bits being set
+	 * between num and num-1.
+	 */
+	if ((num & (num - 1)) == 0)
+		return num;/* already power 2 */
+
+	return ((uint32) 1) << (pg_leftmost_one_pos32(num) + 1);
+}
+
+/*
+ * pg_nextpower2_64
+ *		Returns the next highest power of 2 of 'num', or 'num', if it's
+ *		already a power of 2.
+ *
+ * 'num' mustn't be 0 or be above PG_UINT64_MAX / 2  + 1.
+ */
+static inline uint64
+pg_nextpower2_64(uint64 num)
+{
+	Assert(num > 0 && num <= PG_UINT64_MAX / 2 + 1);
+
+	/*
+	 * A power 2 number has only 1 bit set.  Subtracting 1 from such a number
+	 * will turn on all previous bits resulting in no common bits being set
+	 * between num and num-1.
+	 */
+	if ((num & (num - 1)) == 0)
+		return num;/* already power 2 */
+
+	return ((uint64) 1) << (pg_leftmost_one_pos64(num) + 1);
+}
+
+/*
+ * pg_ceil_log2_32
+ *		Returns equivalent of ceil(log2(num))
+ */
+static inline uint32
+pg_ceil_log2_32(uint32 num)
+{
+	if (num < 2)
+		return 0;
+	else
+		return pg_leftmost_one_pos32(num - 1) + 1;
+}
+
+/*
+ * pg_ceil_log2_64
+ *		Returns equivalent of ceil(log2(num))
+ */
+static inline uint64
+pg_ceil_log2_64(uint64 num)
+{
+	if (num < 2)
+		return 0;
+	else
+		return pg_leftmost_one_pos64(num - 1) + 1;
+}
+
 /* Count the number of one-bits in a uint32 or uint64 */
 extern int	(*pg_popcount32) (uint32 word);
 extern int	(*pg_popcount64) (uint64 word);
-- 
2.25.1

From 63cc6c0d302bf4c0f6136ae237ce1b7acfa47db6 Mon Sep 17 00:00:00 2001
From: "dgrow...@gmail.com" 
Date: Tue, 7 Apr 2020 22:50:27 +1200
Subject: [PATCH v7 3/3] Modify additional power 2 calculations to use new
 helper functions

2nd pass of modifying various places which obtain the next power
of 2 of a number and make them use the new functions added in
pg_bitutils instead.
---
 src/backend/access/gin/ginfast.c| 12 +++-
 src/backend/executor/nodeHash.c |  8 ++--
 src/backend/nodes/list.c| 1

Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-07 Thread Masahiko Sawada
On Tue, 7 Apr 2020 at 18:29, Masahiko Sawada
 wrote:
>
> On Tue, 7 Apr 2020 at 17:42, Amit Kapila  wrote:
> >
> > On Tue, Apr 7, 2020 at 1:30 PM Masahiko Sawada
> >  wrote:
> > >
> > > Buffer usage statistics seem correct. The small differences would be
> > > catalog lookups Peter mentioned.
> > >
> >
> > Agreed, but can you check which part of code does that lookup?  I want
> > to see if we can avoid that from buffer usage stats or at least write
> > a comment about it, otherwise, we might have to face this question
> > again and again.
>
> Okay, I'll check it.
>

I've checked the buffer usage differences when parallel btree index creation.

TL;DR;

During tuple sorting individual parallel workers read blocks of
pg_amproc and pg_amproc_fam_proc_index to get the sort support
function. The call flow is like:

ParallelWorkerMain()
  _bt_parallel_scan_and_sort()
tuplesort_begin_index_btree()
  PrepareSortSupportFromIndexRel()
FinishSortSupportFunction()
  get_opfamily_proc()

The details are as follows.

I populated the test table by the following scripts:

create table test (c int) with (autovacuum_enabled = off, parallel_workers = 8);
insert into test select generate_series(1,1000);

and create index DDL is:

create index test_idx on test (c);

Before executing the test script, I've put code at the following 4
places which checks the buffer usage at that point, and calculated the
difference between points: (a), (b) and (c). For example, (b) shows
the number of blocks read or hit during executing scanning heap and
building index.

1. Before executing CREATE INDEX command (at pgss_ProcessUtility())
(a)
2. Before parallel create index (at _bt_begin_parallel())
(b)
3. After parallel create index, after accumlating workers stats (at
_bt_end_parallel())
(c)
4. After executing CREATE INDEX command (at pgss_ProcessUtility())

And here is the results:

2 workers:
(a) hit: 107, read: 26
(b) hit: 12(=6+3+3), read: 44248(=15538+14453+14527)
(c) hit: 13, read: 2
total hit: 132, read:44276

4 workers:
(a) hit: 107, read: 26
(b) hit: 18(=6+3+3+3+3), read: 44248(=9368+8582+8544+9250+8504)
(c) hit: 13, read: 2
total hit: 138, read:44276

The table 'test' has 44276 blocks.

>From the above results, the total number of reading blocks (44248
blocks) during parallel index creation is stable and equals to the
number of blocks of the test table. And we can see that extra three
blocks are read per workers. These three blocks are two for
pg_amproc_fam_proc_index and one for pg_amproc. That is, individual
parallel workers accesses these relations to get the sort support
function. The full backtrace is:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff779c561a libsystem_kernel.dylib`__select + 10
frame #1: 0x00010cc9f90d postgres`pg_usleep(microsec=2000)
at pgsleep.c:56:10
frame #2: 0x00010ca5a668
postgres`ReadBuffer_common(smgr=0x7fe872848f70,
relpersistence='p', forkNum=MAIN_FORKNUM, blockNum=3, mode=RBM_NORMAL,
strategy=0x, hit=0x7ffee363071b) at bufmgr.c:685:3
frame #3: 0x00010ca5a4b6
postgres`ReadBufferExtended(reln=0x00010d58f790,
forkNum=MAIN_FORKNUM, blockNum=3, mode=RBM_NORMAL,
strategy=0x) at bufmgr.c:628:8
frame #4: 0x00010ca5a397
postgres`ReadBuffer(reln=0x00010d58f790, blockNum=3) at
bufmgr.c:560:9
frame #5: 0x00010c67187e
postgres`_bt_getbuf(rel=0x00010d58f790, blkno=3, access=1) at
nbtpage.c:792:9
frame #6: 0x00010c670507
postgres`_bt_getroot(rel=0x00010d58f790, access=1) at
nbtpage.c:294:13
frame #7: 0x00010c679393
postgres`_bt_search(rel=0x00010d58f790, key=0x7ffee36312d0,
bufP=0x7ffee3631bec, access=1, snapshot=0x7fe8728388e0) at
nbtsearch.c:107:10
frame #8: 0x00010c67b489
postgres`_bt_first(scan=0x7fe86f814998, dir=ForwardScanDirection)
at nbtsearch.c:1355:10
frame #9: 0x00010c676869
postgres`btgettuple(scan=0x7fe86f814998, dir=ForwardScanDirection)
at nbtree.c:253:10
frame #10: 0x00010c6656ad
postgres`index_getnext_tid(scan=0x7fe86f814998,
direction=ForwardScanDirection) at indexam.c:530:10
frame #11: 0x00010c66585b
postgres`index_getnext_slot(scan=0x7fe86f814998,
direction=ForwardScanDirection, slot=0x7fe86f814880) at
indexam.c:622:10
frame #12: 0x00010c663eac
postgres`systable_getnext(sysscan=0x7fe86f814828) at genam.c:454:7
frame #13: 0x00010cc0be41
postgres`SearchCatCacheMiss(cache=0x7fe872818e80, nkeys=4,
hashValue=3052139574, hashIndex=6, v1=1976, v2=23, v3=23, v4=2) at
catcache.c:1368:9
frame #14: 0x00010cc0bced
postgres`SearchCatCacheInternal(cache=0x7fe872818e80, nkeys=4,
v1=1976, v2=23, v3=23, v4=2) at catcache.c:1299:9
frame #15: 0x00010cc0baa8
postgres`SearchCatCache4(cache=0x7fe872818e80, v1=1976, v2=23,
v3=23, v4=2) at catcache.c:1191:9
frame #16: 0x00010cc27c82 postgres`SearchSysCache4(

Re: Run-time pruning for ModifyTable

2020-04-07 Thread David Rowley
On Wed, 25 Mar 2020 at 15:37, Amit Langote  wrote:
> Actually, I was saying in that email that the update/delete planning
> overhaul being talked about will make the entirety of the
> functionality this patch is adding, which is ModifyTable node being
> able to prune its subplans based on run-time parameter values,
> redundant.  That's because, with the overhaul, there won't be multiple
> subplans under ModifyTable, only one which would take care of any
> pruning that's necessary.

Thanks for explaining.  I've not read over any patch for that yet, so
wasn't aware of exactly what was planned.

With your explanation, I imagine some sort of Append / MergeAppend
that runs the query as if it were a SELECT, but each
Append/MergeAppend subnode is tagged somehow with an index of which
ModifyTable subnode that it belongs to. Basically, just one complete
plan, rather than a plan per ModifyTable subnode.

> What I did say in favor of this patch though is that it doesn not seem
> that invasive, so maybe okay to get in for v13.

Since it seems there's much less code that will be useful after the
rewrite than I thought, combined with the fact that I'm not entirely
sure the best way to reuse the partitioned table's RelOptInfo from the
SELECT's PlannerInfo, then I'm going to return this one with feedback.

David




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-07 Thread Anna Akenteva

On 2020-04-07 13:32, Amit Kapila wrote:

First, I don't
think we have a consensus on the syntax being used in the patch
(various people didn't agree to LSN specific syntax).  They wanted a
more generic syntax and I see that we tried to implement it and it
turns out to be a bit complex but that doesn't mean we just give up on
the idea and take the simplest approach and that too without a broader
agreement.


Thank you for your comments!

Initially, the syntax used to be "WAITLSN", which confined us with only 
waiting for LSN-s and not anything else. So we switched to "WAIT FOR 
LSN", which would allow us to add variations like "WAIT FOR XID" or 
"WAIT FOR COMMIT TOKEN" in the future if we wanted. A few people seemed 
to imply that this kind of syntax is expandable enough:


On 2018-02-01 14:47, Simon Riggs wrote:

I agree that WAIT LSN is good syntax because this allows us to wait
for something else in future.


On 2017-10-31 12:42:56, Ants Aasma wrote:
For lack of a better proposal I would like something along the lines 
of:

WAIT FOR state_id[, state_id] [ OPTIONS (..)]


As for giving up waiting for multiple events: we can only wait for LSN-s 
at the moment, and there seems to be no point in waiting for multiple 
LSN-s at once, because it's equivalent to waiting for the biggest LSN. 
So we opted for simpler grammar for now, only letting the user specify 
one LSN and one TIMEOUT. If in the future we allow waiting for something 
else, like XID-s, we can expand the grammar as needed.


What are your own thoughts on the syntax?

--
Anna Akenteva
Postgres Professional:
The Russian Postgres Company
http://www.postgrespro.com




Re: Make MemoryContextMemAllocated() more precise

2020-04-07 Thread David Rowley
On Tue, 7 Apr 2020 at 14:26, Jeff Davis  wrote:
>
> On Mon, 2020-04-06 at 23:39 +1200, David Rowley wrote:
> > 4. Do you think it would be better to have two separate functions for
> > MemoryContextCount(), a recursive version and a non-recursive
> > version.
>
> I could, but right now the only caller passes recurse=true, so I'm not
> really eliminating any code in that path by specializing the functions.
> Are you thinking about performance or you just think it would be better
> to have two entry points?

I was thinking in terms of both performance by eliminating a branch,
but mostly it was down to ease of code reading.

I thought it was easier to read:
MemoryContextGetCountersRecruse(&counters); then
MemoryContextGetCounters(&counters, true); since I might need to go
see what "true" is for.

The non-recursive version, if we decide we need one, would likely just
be 1 one-line body function calling the implementation's getcounter
method.

David




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-07 Thread Alexander Korotkov
On Tue, Apr 7, 2020 at 3:07 PM Anna Akenteva  wrote:
> On 2017-10-31 12:42:56, Ants Aasma wrote:
> > For lack of a better proposal I would like something along the lines
> > of:
> > WAIT FOR state_id[, state_id] [ OPTIONS (..)]
>
> As for giving up waiting for multiple events: we can only wait for LSN-s
> at the moment, and there seems to be no point in waiting for multiple
> LSN-s at once, because it's equivalent to waiting for the biggest LSN.
> So we opted for simpler grammar for now, only letting the user specify
> one LSN and one TIMEOUT. If in the future we allow waiting for something
> else, like XID-s, we can expand the grammar as needed.

+1
In the latest version of patch we have very brief and simple syntax
allowing to wait for given LSN with given timeout.  In future we can
expand this syntax in different ways.  I don't see that current syntax
is limiting us from something.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Use compiler intrinsics for bit ops in hash

2020-04-07 Thread John Naylor
On Tue, Apr 7, 2020 at 7:41 PM David Rowley  wrote:

> I've attached an updated patch.  It includes the modifications
> mentioned above to pre-check for a power of 2 number with the bit
> masking hack mentioned above. I also renamed the functions to be more
> aligned to the other functions in pg_bitutils.h   I'm not convinced
> pg_ceil_log2_* needs the word "ceil" in there.
>
> I dropped the part of the patch that was changing longs to ints of a
> known size. I went on and did some additional conversion in the 0003
> patch. There are more laying around the code base, but I ended up
> finding a bit to fix up than i had thought I would. e.g. various
> places that repalloc() inside a loop that is multiplying the
> allocation size by 2 each time.  The repalloc should be done at the
> end, not during the loop.  I thought I might come back to those some
> time in the future.
>
> Is anyone able to have a look at this?
>
> David

Hi David,

Overall looks good to me. Just a couple things I see:

It seems _hash_log2 is still in the tree, but has no callers?

- max_size = 8; /* semi-arbitrary small power of 2 */
- while (max_size < min_size + LIST_HEADER_OVERHEAD)
- max_size *= 2;
+ max_size = pg_nextpower2_32(Max(8, min_size + LIST_HEADER_OVERHEAD));

Minor nit: We might want to keep the comment that the number is
"semi-arbitrary" here as well.

- 'pg_waldump', 'scripts');
+ 'pg_validatebackup', 'pg_waldump', 'scripts');

This seems like a separate concern?

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
On 2020-04-07 05:15:03 -0700, Andres Freund wrote:
> Attached is a substantially polished version of my patches. Note that
> the first three patches, as well as the last, are not intended to be
> committed at this time / in this form - they're there to make testing
> easier.

I didn't actually attached that last not-to-be-committed patch... It's
just the pgbench patch that I had attached before (and started a thread
about). Here it is again.
>From 59a9a03da728d53364f9c3d6fe8b48e21697b93e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 12/12] WIP: pgbench

---
 src/bin/pgbench/pgbench.c | 107 +-
 1 file changed, 83 insertions(+), 24 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e99af801675..21d1ab2aac1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -310,6 +310,10 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+#ifdef ENABLE_THREAD_SAFETY
+pthread_barrier_t conn_barrier;
+#endif
+
 /*
  * Connection state machine states.
  */
@@ -5206,6 +5210,10 @@ printResults(StatsData *total, instr_time total_time,
 	tps_exclude = ntx /
 		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
 
+	//fprintf(stderr, "time: include: %f, exclude: %f, conn total: %f\n",
+	//		time_include, time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients),
+	//		INSTR_TIME_GET_DOUBLE(conn_total_time));
+
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -6126,26 +6134,14 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
-
-	/* set alarm if duration is specified. */
-	if (duration > 0)
-		setalarm(duration);
-
 	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
+	pthread_barrier_init(&conn_barrier, NULL, nthreads);
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
-
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-(int64) 100 * duration;
-
 		/* the first thread (i = 0) is executed by main thread */
 		if (i > 0)
 		{
@@ -6162,13 +6158,38 @@ main(int argc, char **argv)
 			thread->thread = INVALID_THREAD;
 		}
 	}
-#else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
-	/* compute when to stop */
+#endif			/* ENABLE_THREAD_SAFETY */
+
+#ifdef ENABLE_THREAD_SAFETY
+	/* wait till all threads started (threads wait in threadRun()) */
+	//fprintf(stderr, "andres: waiting for thread start: %u\n", threads[0].tid);
+	pthread_barrier_wait(&conn_barrier);
+#endif			/* ENABLE_THREAD_SAFETY */
+
+	/* get start up time */
+	INSTR_TIME_SET_CURRENT(start_time);
+
+	/* */
+	for (i = 0; i < nthreads; i++)
+	{
+		TState	   *thread = &threads[i];
+
+		thread->start_time = start_time;
+
+		/* compute when to stop */
+		if (duration > 0)
+			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
+(int64) 100 * duration;
+	}
+
+	/* set alarm if duration is specified. */
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 100 * duration;
-	threads[0].thread = INVALID_THREAD;
+		setalarm(duration);
+
+#ifdef ENABLE_THREAD_SAFETY
+	/* updated start time (threads wait in threadRun()) */
+	//fprintf(stderr, "andres: %u: waiting for start time\n", threads[0].tid);
+	pthread_barrier_wait(&conn_barrier);
 #endif			/* ENABLE_THREAD_SAFETY */
 
 	/* wait for threads and accumulate results */
@@ -6236,12 +6257,30 @@ threadRun(void *arg)
 	int			i;
 
 	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 100;
+	int64		thread_start;
+	int64		last_report;
+	int64		next_report;
 	StatsData	last,
 aggs;
 
+	/* wait till all threads started (main waits outside) */
+	if (thread->tid != 0)
+	{
+		//fprintf(stderr, "andres: %u: waiting for thread start\n", thread->tid);
+		pthread_barrier_wait(&conn_barrier);
+	}
+
+	/* wait for start time to be initialized (main waits outside) */
+	if (thread->tid != 0)
+	{
+		//fprintf(stderr, "andres: %u: waiting for start time\n", thread->tid);
+		pthread_barrier_wait(&conn_barrier);
+	}
+
+	thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
+	last_report = thread_start;
+	next_report = last_report + (int64) progress * 100;
+
 	/*
 	 * Initialize throttling rate target for all of the thread's clients.  It
 	 * might be a little more accurate to reset thread->start_time here too.
@@ -6288,7 +6327,27 @@ threadRun(void *arg)
 
 	/* time after thread and connections set up */
 	INSTR_TIME_SET

Re: Use compiler intrinsics for bit ops in hash

2020-04-07 Thread David Rowley
Hi John,

Thanks for having a look at this.

On Wed, 8 Apr 2020 at 00:16, John Naylor  wrote:
> Overall looks good to me. Just a couple things I see:
>
> It seems _hash_log2 is still in the tree, but has no callers?

Yeah, I left it in there since it was an external function.  Perhaps
we could rip it out and write something in the commit message that it
should be replaced with the newer functions.  Thinking of extension
authors here.

> - max_size = 8; /* semi-arbitrary small power of 2 */
> - while (max_size < min_size + LIST_HEADER_OVERHEAD)
> - max_size *= 2;
> + max_size = pg_nextpower2_32(Max(8, min_size + LIST_HEADER_OVERHEAD));
>
> Minor nit: We might want to keep the comment that the number is
> "semi-arbitrary" here as well.

I had dropped that as the 8 part was mentioned in the comment above:
"The minimum allocation is 8 ListCell units". I can put it back, I had
just thought it was overkill.

> - 'pg_waldump', 'scripts');
> + 'pg_validatebackup', 'pg_waldump', 'scripts');
>
> This seems like a separate concern?

That's required due to the #include "lib/simplehash.h" in
pg_validatebackup.c. I have to say, I didn't really take the time to
understand all the Perl code there, but without that change, I was
getting a link error when testing on Windows, and after I added
pg_validatebackup to that array, it worked.

David




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-07 Thread James Coleman
On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby  wrote:
>
> On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> > I've pushed the fist part of this patch series - I've reorganized it a
>
> I scanned through this again post-commit.  Find attached some suggestions.
>
> Shouldn't non-text explain output always show both disk *and* mem, including
> zeros ?

Could you give more context on this? Is there a standard to follow?
Regular sort nodes only ever report one type, so there's not a good
parallel there.

> Should "Pre-sorted Groups:" be on a separate line ?
> | Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB 
> Pre-sorted Groups: 1 Sort Method: quicksort Memory: avg=30kB peak=30kB

I'd originally had that, but Tomas wanted it to be more compact. It's
easy to adjust though if the consensus changes on that.

> And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
> Groups"?  I think you should maybe do that instead of the "semicolon
> separator".  I think "two spaces" makes sense, since the units are different,
> similar to hash buckets and normal sort node.
>
>  "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
>  appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
>
> Note, I made a similar comment regarding two spaces for explain(WAL) here:
> https://www.postgresql.org/message-id/20200402054120.GC14618%40telsasoft.com
>
> And Peter E seemed to dislike that, here:
> https://www.postgresql.org/message-id/ef8c966f-e50a-c583-7b1e-85de6f4ca0d3%402ndquadrant.com

I read through that subthread, and the ending seemed to be Peter
wanting things to be unified. Was there a conclusion beyond that?

> Also, you're showing:
> ExplainPropertyInteger("Maximum Sort Space Used", "kB",
> groupInfo->maxMemorySpaceUsed, es);
>
> But in show_hash_info() and show_hashagg_info(), and in your own text output,
> that's called "Peak":
> ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
> ExplainPropertyInteger("Peak Memory Usage", "kB",
> spacePeakKb, es);

Yes, that's a miss and should be fixed.

James




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-04-07 Thread Alexey Kondratov

On 2020-04-06 21:44, Justin Pryzby wrote:

On Mon, Apr 06, 2020 at 08:43:46PM +0300, Alexey Kondratov wrote:


+/* XXX: reusing reindex_option_list */
+   | CLUSTER opt_verbose '(' reindex_option_list ')' 
qualified_name
cluster_index_specification

Could we actually simply reuse vac_analyze_option_list? From the first 
sight
it does just the right thing, excepting the special handling of 
spelling

ANALYZE/ANALYSE, but it does not seem to be a problem.


Hm, do you mean to let cluster.c reject the other options like 
"analyze" ?

I'm not sure why that would be better than reusing reindex?
I think the suggestion will probably be to just copy+paste the reindex 
option
list and rename it to cluster (possibly with the explanation that 
they're
separate and independant and so their behavior shouldn't be tied 
together).




I mean to literally use vac_analyze_option_list for reindex and cluster 
as well. Please, check attached 0007. Now, vacuum, reindex and cluster 
filter options list and reject everything that is not supported, so it 
seems completely fine to just reuse vac_analyze_option_list, doesn't it?




ReindexRelationConcurrently is used for all cases, but it hits 
different
code paths in the case of database, table and index. I have not 
checked yet,

but are you sure it is safe removing these validations in the case of
REINDEX CONCURRENTLY?


You're right about the pg_global case, fixed.  System catalogs can't be
reindexed CONCURRENTLY, so they're already caught by that check.


> XXX: for cluster/vacuum, it might be more friendly to check before
> clustering
> the table, rather than after clustering and re-indexing.

Yes, I think it would be much more user-friendly.


I realized it's not needed or useful to check indexes in advance of 
clustering,
since 1) a mapped index will be on a mapped relation, which is already 
checked;

2) a system index will be on a system relation.  Right ?



Yes, it seems that you are right. I have tried to create user index on 
system relation with allow_system_table_mods=1, but this new index 
appeared to become system as well. That way, we do not have to check 
indexes in advance.



--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom afa132674d2b49328a5e95ecd2ff30d838f7ec6c Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Tue, 7 Apr 2020 13:42:42 +0300
Subject: [PATCH v18 7/7] Reuse vac_analyze_option_list for cluster and reindex

---
 src/backend/parser/gram.y | 38 +++---
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3665ee8700..7821f07c40 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -512,10 +512,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	explain_option_list
 
 %type 	reindex_target_type reindex_target_multitable
-%type 		reindex_option_name
-%type 	reindex_option_arg
-%type 	reindex_option_list
-%type 	reindex_option_elem
 
 %type 	copy_generic_opt_arg copy_generic_opt_arg_list_item
 %type 	copy_generic_opt_elem
@@ -8428,7 +8424,7 @@ ReindexStmt:
 	n->params = NIL;
 	$$ = (Node *)n;
 }
-			| REINDEX '(' reindex_option_list ')' reindex_target_type opt_concurrently qualified_name
+			| REINDEX '(' vac_analyze_option_list ')' reindex_target_type opt_concurrently qualified_name
 {
 	ReindexStmt *n = makeNode(ReindexStmt);
 	n->kind = $5;
@@ -8438,7 +8434,7 @@ ReindexStmt:
 	n->params = $3;
 	$$ = (Node *)n;
 }
-			| REINDEX '(' reindex_option_list ')' reindex_target_multitable opt_concurrently name
+			| REINDEX '(' vac_analyze_option_list ')' reindex_target_multitable opt_concurrently name
 {
 	ReindexStmt *n = makeNode(ReindexStmt);
 	n->kind = $5;
@@ -8458,33 +8454,6 @@ reindex_target_multitable:
 			| SYSTEM_P{ $$ = REINDEX_OBJECT_SYSTEM; }
 			| DATABASE{ $$ = REINDEX_OBJECT_DATABASE; }
 		;
-reindex_option_list:
-			reindex_option_elem
-{
-	$$ = list_make1($1);
-}
-			| reindex_option_list ',' reindex_option_elem
-{
-	$$ = lappend($1, $3);
-}
-		;
-
-reindex_option_elem:
-			reindex_option_name reindex_option_arg
-{
-	$$ = makeDefElem($1, $2, @1);
-}
-		;
-
-reindex_option_name:
-			NonReservedWord			{ $$ = $1; }
-		;
-
-reindex_option_arg:
-			opt_boolean_or_string	{ $$ = (Node *) makeString($1); }
-			| NumericOnly			{ $$ = (Node *) $1; }
-			| /* EMPTY */		 	{ $$ = NULL; }
-		;
 
 /*
  *
@@ -10643,8 +10612,7 @@ ClusterStmt:
 	n->params = NIL;
 	$$ = (Node*)n;
 }
-/* XXX: reusing reindex_option_list */
-			| CLUSTER opt_verbose '(' reindex_option_list ')' qualified_name cluster_index_specification
+			| CLUSTER opt_verbose '(' vac_analyze_option_list ')' qualified_name cluster_index_specification
 {
 	C

Re: backup manifests and contemporaneous buildfarm failures

2020-04-07 Thread Andrew Dunstan
On Mon, Apr 6, 2020 at 1:18 AM Fabien COELHO  wrote:
>
>
> Hello,
>
> >> Do I need to precede those with some recursive chmod commands? Perhaps
> >> the client should refuse to run if there is still something left after
> >> these.
> >
> > I think the latter would be a very good idea, just so that this sort of
> > failure is less obscure.  Not sure about whether a recursive chmod is
> > really going to be worth the cycles.  (On the other hand, the normal
> > case should be that there's nothing there anyway, so maybe it's not
> > going to be costly.)
>
> Could it be a two-stage process to minimize cost but still be resilient?
>
>rmtree
>if (-d $DIR) {
>  emit warning
>  chmodtree
>  rmtree again
>  if (-d $DIR)
>emit error
>}
>


I thought about doing that. However, it's not really necessary. In the
normal course of events these directories should have been removed at
the end of the previous run, so we're only dealing with exceptional
cases here.

cheers

andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Use compiler intrinsics for bit ops in hash

2020-04-07 Thread John Naylor
On Tue, Apr 7, 2020 at 8:26 PM David Rowley  wrote:
>
> Hi John,
>
> Thanks for having a look at this.
>
> On Wed, 8 Apr 2020 at 00:16, John Naylor  wrote:
> > Overall looks good to me. Just a couple things I see:
> >
> > It seems _hash_log2 is still in the tree, but has no callers?
>
> Yeah, I left it in there since it was an external function.  Perhaps
> we could rip it out and write something in the commit message that it
> should be replaced with the newer functions.  Thinking of extension
> authors here.

I'm not the best judge of where to draw the line for extensions, but
this function does have a name beginning with an underscore, which to
me is a red flag that it's internal in nature.

> > Minor nit: We might want to keep the comment that the number is
> > "semi-arbitrary" here as well.
>
> I had dropped that as the 8 part was mentioned in the comment above:
> "The minimum allocation is 8 ListCell units". I can put it back, I had
> just thought it was overkill.

Oh I see now, nevermind.

> > - 'pg_waldump', 'scripts');
> > + 'pg_validatebackup', 'pg_waldump', 'scripts');
> >
> > This seems like a separate concern?
>
> That's required due to the #include "lib/simplehash.h" in
> pg_validatebackup.c. I have to say, I didn't really take the time to
> understand all the Perl code there, but without that change, I was
> getting a link error when testing on Windows, and after I added
> pg_validatebackup to that array, it worked.

Hmm. Does pg_bitutils.h need something like this?

#ifndef FRONTEND
extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
#else
extern const uint8 pg_leftmost_one_pos[256];
extern const uint8 pg_rightmost_one_pos[256];
extern const uint8 pg_number_of_ones[256];
#endif

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-07 Thread Amit Kapila
On Tue, Apr 7, 2020 at 5:37 PM Anna Akenteva  wrote:
>
> On 2020-04-07 13:32, Amit Kapila wrote:
> > First, I don't
> > think we have a consensus on the syntax being used in the patch
> > (various people didn't agree to LSN specific syntax).  They wanted a
> > more generic syntax and I see that we tried to implement it and it
> > turns out to be a bit complex but that doesn't mean we just give up on
> > the idea and take the simplest approach and that too without a broader
> > agreement.
>
> Thank you for your comments!
>
> Initially, the syntax used to be "WAITLSN", which confined us with only
> waiting for LSN-s and not anything else. So we switched to "WAIT FOR
> LSN", which would allow us to add variations like "WAIT FOR XID" or
> "WAIT FOR COMMIT TOKEN" in the future if we wanted. A few people seemed
> to imply that this kind of syntax is expandable enough:
>
> On 2018-02-01 14:47, Simon Riggs wrote:
> > I agree that WAIT LSN is good syntax because this allows us to wait
> > for something else in future.
>
> On 2017-10-31 12:42:56, Ants Aasma wrote:
> > For lack of a better proposal I would like something along the lines
> > of:
> > WAIT FOR state_id[, state_id] [ OPTIONS (..)]
>
> As for giving up waiting for multiple events: we can only wait for LSN-s
> at the moment, and there seems to be no point in waiting for multiple
> LSN-s at once, because it's equivalent to waiting for the biggest LSN.
> So we opted for simpler grammar for now, only letting the user specify
> one LSN and one TIMEOUT. If in the future we allow waiting for something
> else, like XID-s, we can expand the grammar as needed.
>
> What are your own thoughts on the syntax?
>

I don't know how users can specify the LSN value but OTOH I could see
if users can somehow provide the correct value of commit LSN for which
they want to wait, then it could work out.  It is possible that I
misread and we have a consensus on WAIT FOR LSN [option] because I saw
what Simon and Ants have proposed includes multiple state/events and
it might be fine to have just one event for now.

Anyone else wants to share an opinion on syntax?

I think even if we are good with syntax, I could see the code is not
completely ready to go as mentioned in few comments raised by me.  I
am not sure if we want to commit it in the current form and then
improve after feature freeze.  If it is possible to fix the loose ends
quickly and there are no more comments by anyone then probably we
might be able to commit it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-04-07 Thread Ants Aasma
On Tue, 7 Apr 2020 at 08:24, vignesh C  wrote:
> Leader will create a circular queue
> and share it across the workers. The circular queue will be present in
> DSM. Leader will be using a fixed size queue to share the contents
> between the leader and the workers. Currently we will have 100
> elements present in the queue. This will be created before the workers
> are started and shared with the workers. The data structures that are
> required by the parallel workers will be initialized by the leader,
> the size required in dsm will be calculated and the necessary keys
> will be loaded in the DSM. The specified number of workers will then
> be launched. Leader will read the table data from the file and copy
> the contents to the queue element by element. Each element in the
> queue will have 64K size DSA. This DSA will be used to store tuple
> contents from the file. The leader will try to copy as much content as
> possible within one 64K DSA queue element. We intend to store at least
> one tuple in each queue element. There are some cases where the 64K
> space may not be enough to store a single tuple. Mostly in cases where
> the table has toast data present and the single tuple can be more than
> 64K size. In these scenarios we will extend the DSA space accordingly.
> We cannot change the size of the dsm once the workers are launched.
> Whereas in case of DSA we can free the dsa pointer and reallocate the
> dsa pointer based on the memory size required. This is the very reason
> for choosing DSA over DSM for storing the data that must be inserted
> into the relation.

I think the element based approach and requirement that all tuples fit
into the queue makes things unnecessarily complex. The approach I
detailed earlier allows for tuples to be bigger than the buffer. In
that case a worker will claim the long tuple from the ring queue of
tuple start positions, and starts copying it into its local line_buf.
This can wrap around the buffer multiple times until the next start
position shows up. At that point this worker can proceed with
inserting the tuple and the next worker will claim the next tuple.

This way nothing needs to be resized, there is no risk of a file with
huge tuples running the system out of memory because each element will
be reallocated to be huge and the number of elements is not something
that has to be tuned.

> We had a couple of options for the way in which queue elements can be stored.
> Option 1:  Each element (DSA chunk) will contain tuples such that each
> tuple will be preceded by the length of the tuple.  So the tuples will
> be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2,
> tuple-2),  Or Option 2: Each element (DSA chunk) will contain only
> tuples (tuple-1), (tuple-2), .  And we will have a second
> ring-buffer which contains a start-offset or length of each tuple. The
> old design used to generate one tuple of data and process tuple by
> tuple. In the new design, the server will generate multiple tuples of
> data per queue element. The worker will then process data tuple by
> tuple. As we are processing the data tuple by tuple, I felt both of
> the options are almost the same. However Design1 was chosen over
> Design 2 as we can save up on some space that was required by another
> variable in each element of the queue.

With option 1 it's not possible to read input data into shared memory
and there needs to be an extra memcpy in the time critical sequential
flow of the leader. With option 2 data could be read directly into the
shared memory buffer. With future async io support, reading and
looking for tuple boundaries could be performed concurrently.


Regards,
Ants Aasma
Cybertec




Re: backup manifests and contemporaneous buildfarm failures

2020-04-07 Thread Andrew Dunstan
On Tue, Apr 7, 2020 at 12:37 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > Taking stock of the situation this morning, most of the buildfarm is
> > now green. There are three failures, on eelpout (6 hours ago),
> > fairywren (17 hours ago), and hyrax (3 days, 7 hours ago).
>
> fairywren has now done this twice in the pg_validatebackupCheck step:
>
> exec failed: Bad address at 
> /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm 
> line 340.
>  at /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm 
> line 340.
>
> I'm a tad suspicious that it needs another perl2host()
> somewhere, but the log isn't very clear as to where.
>
> More generally, I wonder if we ought to be trying to
> centralize those perl2host() calls instead of sticking
> them into individual test cases.
>
>


Not sure about that. I'll see if I can run it by hand and get some
more info. What's quite odd is that jacana (a very similar setup) is
passing this happily.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: proposal \gcsv

2020-04-07 Thread Pavel Stehule
út 7. 4. 2020 v 12:49 odesílatel Isaac Morland 
napsal:

> On Tue, 7 Apr 2020 at 03:30, Pavel Stehule 
> wrote:
>
>
>> your patch supports syntax
>>
>> (option1=value option2=value)
>>
>> It looks little bit inconsistent and unusual
>>
>>>
> It's the same as a connection string. Actually, maybe that's the key to
> allowing parentheses, etc. in option values if needed - allow the same
> single-quote quoting as in connection strings. Maybe even just call the
> same code to do the parsing.
>

I don't think so connection string syntax should be used there.


Re: WAL usage calculation patch

2020-04-07 Thread Julien Rouhaud
On Tue, Apr 7, 2020 at 12:00 PM Peter Eisentraut
 wrote:
>
> On 2020-04-07 04:12, Amit Kapila wrote:
> > On Mon, Apr 6, 2020 at 10:01 PM Justin Pryzby  wrote:
> >>
> >> On Mon, Apr 06, 2020 at 05:01:30PM +0200, Peter Eisentraut wrote:
> >>> I noticed in some of the screenshots that were tweeted that for example in
> >>>
> >>>  WAL:  records=1  bytes=56
> >>>
> >>> there are two spaces between pieces of data.  This doesn't match the rest 
> >>> of
> >>> the EXPLAIN output.  Can that be adjusted?
> >>
> >> We talked about that here:
> >> https://www.postgresql.org/message-id/20200402054120.GC14618%40telsasoft.com
> >>
> >
> > Yeah.  Just to brief here, the main reason was that one of the fields
> > (full page writes) already had a single space and then we had prior
> > cases as mentioned in Justin's email [1] where we use two spaces which
> > lead us to decide using two spaces in this case.
>
> We also have existing cases for the other way:
>
>  actual time=0.050..0.052
>  Buffers: shared hit=3 dirtied=1
>
> The cases mentioned by Justin are not formatted in a key=value format,
> so it's not quite the same, but it also raises the question why they are
> not.
>
> Let's figure out a way to consolidate this without making up a third format.

The parsability problem Justin was mentioning is only due to "full
page writes", so we could use "full_page_writes" or "fpw" instead and
remove the extra spaces.  There would be a small discrepancy with the
verbose autovacuum log, but there are others differences already.

I'd slightly in favor of "fpw" to be more concise. Would that be ok?




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Jonathan S. Katz
On 4/7/20 8:15 AM, Andres Freund wrote:

> I think this is pretty close to being committable.
> 
> 
> But: This patch came in very late for v13, and it took me much longer to
> polish it up than I had hoped (partially distraction due to various bugs
> I found (in particular snapshot_too_old), partially covid19, partially
> "hell if I know"). The patchset touches core parts of the system. While
> both Thomas and David have done some review, they haven't for the latest
> version (mea culpa).
> 
> In many other instances I would say that the above suggests slipping to
> v14, given the timing.
> 
> The main reason I am considering pushing is that I think this patcheset
> addresses one of the most common critiques of postgres, as well as very
> common, hard to fix, real-world production issues. GetSnapshotData() has
> been a major bottleneck for about as long as I have been using postgres,
> and this addresses that to a significant degree.
> 
> A second reason I am considering it is that, in my opinion, the changes
> are not all that complicated and not even that large. At least not for a
> change to a problem that we've long tried to improve.

Even as recently as earlier this week there was a blog post making the
rounds about the pain points running PostgreSQL with many simultaneous
connections. Anything to help with that would go a long way, and looking
at the benchmarks you ran (at least with a quick, nonthorough glance)
this could and should be very positively impactful to a *lot* of
PostgreSQL users.

I can't comment on the "close to committable" aspect (at least not with
an informed, confident opinion) but if it is indeed close to committable
and you can put the work to finish polishing (read: "bug fixing" :-) and
we have a plan both of testing and, if need be, to revert, I would be
okay with including it, for whatever my vote is worth. Is the timing /
situation ideal? No, but the way you describe it, it sounds like there
is enough that can be done to ensure it's ready for Beta 1.

From a RMT standpoint, perhaps this is one of the "Recheck at Mid-Beta"
items, as well.

Thanks,

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: proposal \gcsv

2020-04-07 Thread Tom Lane
Isaac Morland  writes:
> On Tue, 7 Apr 2020 at 03:30, Pavel Stehule  wrote:
>> your patch supports syntax
>> (option1=value option2=value)
>> It looks little bit inconsistent and unusual

> It's the same as a connection string.

Yeah, I didn't exactly invent that out of nowhere.  There are a couple
of reasons not to add commas to the syntax:

* It would make comma be another character that's hard to put into
pset values in this context.  And unlike right paren, there's plenty
of reason to think people would wish to do that, eg \g (fieldsep=,) ...

* If we use commas then people would figure the spaces are optional
and would try to write things like \g (expanded,null=NULL) ...
That moves the goalposts quite a bit in terms of the code having
to pick apart strings, and it makes things a lot more ambiguous
than they were before --- notably, now '=' is *also* a character
that you can't readily write in a pset value.

> Actually, maybe that's the key to
> allowing parentheses, etc. in option values if needed - allow the same
> single-quote quoting as in connection strings. Maybe even just call the
> same code to do the parsing.

I don't think there is a lot of wiggle room to let \g have its own quoting
rules.  The psqlscanslash lexer has its own ideas about that, which we
cannot bypass without losing features.  An example is that people would
expect this to work:
\set myfmt '(expanded tuples_only)'
\g :myfmt somefile
So we can't just ask to snarf the input in OT_WHOLE_LINE mode and then
pick it apart locally in \g.  And having two levels of quoting rules
would be disastrous for usability.

The lexer does have the ability to report whether an argument was quoted,
but it doesn't seem to work quite the way we would want here; it actually
reports whether any part of the argument was quoted.  So if we tried to
make right paren recognition depend on that, this'd misbehave:
\g (fieldsep='|')

regards, tom lane




Re: proposal \gcsv

2020-04-07 Thread Pavel Stehule
út 7. 4. 2020 v 16:28 odesílatel Tom Lane  napsal:

> Isaac Morland  writes:
> > On Tue, 7 Apr 2020 at 03:30, Pavel Stehule 
> wrote:
> >> your patch supports syntax
> >> (option1=value option2=value)
> >> It looks little bit inconsistent and unusual
>
> > It's the same as a connection string.
>
> Yeah, I didn't exactly invent that out of nowhere.  There are a couple
> of reasons not to add commas to the syntax:
>
> * It would make comma be another character that's hard to put into
> pset values in this context.  And unlike right paren, there's plenty
> of reason to think people would wish to do that, eg \g (fieldsep=,) ...
>

ok, this is valid argument


> * If we use commas then people would figure the spaces are optional
> and would try to write things like \g (expanded,null=NULL) ...
> That moves the goalposts quite a bit in terms of the code having
> to pick apart strings, and it makes things a lot more ambiguous
> than they were before --- notably, now '=' is *also* a character
> that you can't readily write in a pset value.
>
> > Actually, maybe that's the key to
> > allowing parentheses, etc. in option values if needed - allow the same
> > single-quote quoting as in connection strings. Maybe even just call the
> > same code to do the parsing.
>
> I don't think there is a lot of wiggle room to let \g have its own quoting
> rules.  The psqlscanslash lexer has its own ideas about that, which we
> cannot bypass without losing features.  An example is that people would
> expect this to work:
> \set myfmt '(expanded tuples_only)'
> \g :myfmt somefile
> So we can't just ask to snarf the input in OT_WHOLE_LINE mode and then
> pick it apart locally in \g.  And having two levels of quoting rules
> would be disastrous for usability.
>
> The lexer does have the ability to report whether an argument was quoted,
> but it doesn't seem to work quite the way we would want here; it actually
> reports whether any part of the argument was quoted.  So if we tried to
> make right paren recognition depend on that, this'd misbehave:
> \g (fieldsep='|')
>

ok, I have not any objections.

Regards

Pavel


> regards, tom lane
>


Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-07 Thread Juan José Santamaría Flecha
On Tue, Apr 7, 2020 at 7:44 AM davinder singh 
wrote:

>
> You need to enable NLS support in the config file. Let me know if that
> answers your question.
>

Yes, I can see where this is coming from now, thanks.

Currently, it is using _create_locale it behaves similarly to
> GetLocaleInfoEx i.e. it also only checks the format only difference is for
> a bigger set.
> I thought of using GetLocaleInfoEx for the fix because it behaves
> similarly to the already existing and a similar issue was resolved earlier
> using the same. I have attached the patch, let me know your thoughts.
>

You are right, that the way to get the locale name information is through
GetLocaleInfoEx().

A couple of comments on the patch:

* I think you could save a couple of code lines, and make it clearer, by
merging both tests on  _MSC_VER  into a single #if... #else and leaving as
common code:
+ }
+ else
+ return NULL;
+#endif /*_MSC_VER && _MSC_VER < 1900*/

* The logic on "defined(_MSC_VER) && (_MSC_VER >= 1900)" is defined as
"_WIN32_WINNT >= 0x0600" on other parts of the code. I would
recommend using the later.

* This looks like a spurious change:
 - sizeof(iso_lc_messages), NULL);
+ sizeof(iso_lc_messages), NULL);

Regards,

Juan José Santamaría Flecha


Re: A bug when use get_bit() function for a long bytea string

2020-04-07 Thread Tom Lane
"movead...@highgo.ca"  writes:
> After several patch change by hacker's proposal, I think it's ready to
> commit, can we commit it before doing the code freeze for pg-13?

It would be easier to get this done if you had addressed any of the
objections to the patch as given.  Integer-overflow handling is still
missing, and you still are assuming that it's okay to change catalog
entries in released branches.

regards, tom lane




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-07 Thread Tomas Vondra

On Mon, Apr 06, 2020 at 05:28:52PM +0900, Etsuro Fujita wrote:

On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita  wrote:

Attached is the original patch (0001) and one patch (0002) with
changes including those by Tomas and Ashutosh.


I merged the patches into one and rebased it against HEAD.  Attached
is a new version, in which I added the commit message as well.  Does
that make sense?  If there are no objections, I’ll commit the patch.



+1


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-07 Thread Tomas Vondra

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:

On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby  wrote:


On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit.  Find attached some suggestions.

Shouldn't non-text explain output always show both disk *and* mem, including
zeros ?


Could you give more context on this? Is there a standard to follow?
Regular sort nodes only ever report one type, so there's not a good
parallel there.


Should "Pre-sorted Groups:" be on a separate line ?
| Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB 
Pre-sorted Groups: 1 Sort Method: quicksort Memory: avg=30kB peak=30kB


I'd originally had that, but Tomas wanted it to be more compact. It's
easy to adjust though if the consensus changes on that.



I'm OK with changing the format if there's a consensus. The current
format seemed better to me, but I'm not particularly attached to it.


And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
Groups"?  I think you should maybe do that instead of the "semicolon
separator".  I think "two spaces" makes sense, since the units are different,
similar to hash buckets and normal sort node.

 "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
 appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",

Note, I made a similar comment regarding two spaces for explain(WAL) here:
https://www.postgresql.org/message-id/20200402054120.GC14618%40telsasoft.com

And Peter E seemed to dislike that, here:
https://www.postgresql.org/message-id/ef8c966f-e50a-c583-7b1e-85de6f4ca0d3%402ndquadrant.com


I read through that subthread, and the ending seemed to be Peter
wanting things to be unified. Was there a conclusion beyond that?



Yeah, I don't think there was a clear consensus :-(


Also, you're showing:
ExplainPropertyInteger("Maximum Sort Space Used", "kB",
groupInfo->maxMemorySpaceUsed, es);

But in show_hash_info() and show_hashagg_info(), and in your own text output,
that's called "Peak":
ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
ExplainPropertyInteger("Peak Memory Usage", "kB",
spacePeakKb, es);


Yes, that's a miss and should be fixed.



Will fix.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-07 Thread Justin Pryzby
On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby  wrote:
> > On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> > > I've pushed the fist part of this patch series - I've reorganized it a
> >
> > I scanned through this again post-commit.  Find attached some suggestions.
> >
> > Shouldn't non-text explain output always show both disk *and* mem, including
> > zeros ?
> 
> Could you give more context on this? Is there a standard to follow?
> Regular sort nodes only ever report one type, so there's not a good
> parallel there.

Right, I'm not sure either, since it seems to be a new case.  Maybe Tomas has a
strong intuition.

See at least the commit messages here:
3ec20c7091e97a554e7447ac2b7f4ed795631395
7d91b604d9b5d6ec8c19c57a9ffd2f27129cdd94
8ebb69f85445177575684a0ba5cfedda8d840a91

Maybe this one suggests that it should *not* be present unconditionally, but
only when that sort type is used?
4b234fd8bf21cd6f5ff44f1f1c613bf40860998d

Another thought: is checking if bytes>0 really a good way to determine if a
sort type was used ?  It seems like checking a bit or a pointer would be
better.  I guess a size of 0 is unlikely, and it's ok at least in text mode.

if (groupInfo->maxMemorySpaceUsed > 0)

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
...
> I read through that subthread, and the ending seemed to be Peter
> wanting things to be unified. Was there a conclusion beyond that?

This discussion is ongoing.  I think let's wait until that's settled before
addressing this more complex and even newer case.  We can add "explain, two
spaces and equals vs colon" to the "Open items" list if need be - I hope the
discussion will not delay the release.

-- 
Justin




Re: adding partitioned tables to publications

2020-04-07 Thread Amit Langote
Thanks for the review.

On Tue, Apr 7, 2020 at 6:01 PM Peter Eisentraut
 wrote:
> On 2020-04-07 08:44, Amit Langote wrote:
> > I updated the patch to make the following changes:
> >
> > * Rewrote the tests to match in style with those committed yesterday
> > * Renamed all variables that had pubasroot in it to have pubviaroot
> > instead to match the publication parameter
> > * Updated pg_publication catalog documentation
>
> Thanks.  I have some further questions:
>
> The change in nodeModifyTable.c to add CheckValidResultRel() is unclear.
>   It doesn't seem to do anything, and it's not clear how it's related to
> this patch.

CheckValidResultRel() checks that replica identity is present for
replicating given update/delete, which I think, it's better to perform
on the root table itself, rather than some partition that would be
affected.  The latter already occurs by way of CheckValidResultRel()
being called on partitions to be updated. I think we get a more
helpful message if the root parent is flagged instead of a partition.

update prt1 set b = b + 1 where a = 578;
ERROR:  cannot update table "prt1" because it does not have a replica
identity and publishes updates
HINT:  To enable updating the table, set REPLICA IDENTITY using ALTER TABLE.

vs.

-- checking the partition
update prt1 set b = b + 1 where a = 578;
ERROR:  cannot update table "prt1_p3" because it does not have a
replica identity and publishes updates
HINT:  To enable updating the table, set REPLICA IDENTITY using ALTER TABLE.

I am okay to get rid of the check on root table if flagging invidual
partitions seems good enough.

> The changes in GetRelationPublications() are confusing to me:
>
> +   if (published_rels)
> +   {
> +   num = list_length(result);
> +   for (i = 0; i < num; i++)
> +   *published_rels = lappend_oid(*published_rels, relid);
> +   }
>
> This adds relid to the output list "num" times, where num is the number
> of publications found.  Shouldn't "i" be used in the loop somehow?
> Similarly later in the function.

published_rels contains an *OID* for each publication that will be in
result.  Callers should iterate the two lists together and for each
publication found in result, it will know which relation it is
associated with using the OID found in published_rels being scanned in
parallel.  If publishing through an ancestor's publication, we need to
know which ancestor, so the whole dance.

I have thought this to be a bit ugly before, but after having to
explain it, I think it's better to use some other approach for this.
I have updated the patch so that GetRelationPublications no longer
considers a relation's ancestors.  That way, it doesn't have to
second-guess what other information will be needed by the caller.

I hope that's clearer, because all the logic is in one place and that
is get_rel_sync_entry().

> The descriptions of the new fields in RelationSyncEntry don't seem to
> match the code accurately, or at least it's confusing.
> replicate_as_relid is always filled in with an ancestor, even if
> pubviaroot is not set.

Given this confusion, I have changed how replicate_as_relid works so
that it's now always set -- if different from the relation's own OID,
the code for "publishing via root" kicks in in various places.

> I think the pubviaroot field is actually not necessary.  We only need
> replicate_as_relid.

Looking through the code, I agree.  I guess I only kept it around to
go with pubupdate, etc.

I guess it might also be a good idea to call it publish_as_relid
instead of replicate_as_relid for consistency.

> There is a markup typo in logical-replication.sgml:
>
>  

Oops, fixed.

> In pg_dump, you missed updating a branch for an older version.  See
> attached patch.
>
> Also attached a patch to rephrase the psql output a bit to make it not
> so long.

Thank you, merged.

Attached updated patch with above changes.

--
Thank you,

Amit Langote
EnterpriseDB: http://www.enterprisedb.com


v19-0001-Allow-publishing-partition-changes-via-ancestors.patch
Description: Binary data


Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
Comments:

In 0002, the comments in SnapshotSet() are virtually incomprehensible.
There's no commit message so the reasons for the changes are unclear.
But mostly looks unproblematic.

0003 looks like a fairly unrelated bug fix that deserves to be
discussed on the thread related to the original patch. Probably should
be an open item.

0004 looks fine.

Regarding 0005:

There's sort of a mix of terminology here: are we pruning tuples or
removing tuples or testing whether things are invisible? It would be
better to be more consistent.

+ * State for testing whether tuple versions may be removed. To improve
+ * GetSnapshotData() performance we don't compute an accurate value whenever
+ * acquiring a snapshot. Instead we compute boundaries above/below which we
+ * know that row versions are [not] needed anymore.  If at test time values
+ * falls in between the two, the boundaries can be recomputed (unless that
+ * just happened).

I don't like the wording here much. Maybe: State for testing whether
an XID is invisible to all current snapshots. If an XID precedes
maybe_needed_bound, it's definitely not visible to any current
snapshot. If it equals or follows definitely_needed_bound, that XID
isn't necessarily invisible to all snapshots. If it falls in between,
we're not sure. If, when testing a particular tuple, we see an XID
somewhere in the middle, we can try recomputing the boundaries to get
a more accurate answer (unless we've just done that). This is cheaper
than maintaining an accurate value all the time.

There's also the problem that this sorta contradicts the comment for
definitely_needed_bound. There it says intermediate values needed to
be tested against the ProcArray, whereas here it says we need to
recompute the bounds. That's kinda confusing.

ComputedHorizons seems like a fairly generic name. I think there's
some relationship between InvisibleToEveryoneState and
ComputedHorizons that should be brought out more clearly by the naming
and the comments.

+ /*
+ * The value of ShmemVariableCache->latestCompletedFullXid when
+ * ComputeTransactionHorizons() held ProcArrayLock.
+ */
+ FullTransactionId latest_completed;
+
+ /*
+ * The same for procArray->replication_slot_xmin and.
+ * procArray->replication_slot_catalog_xmin.
+ */
+ TransactionId slot_xmin;
+ TransactionId slot_catalog_xmin;

Department of randomly inconsistent names. In general I think it's
quite hard to grasp the relationship between the different fields in
ComputedHorizons.

+static inline bool OldSnapshotThresholdActive(void)
+{
+ return old_snapshot_threshold >= 0;
+}

Formatting.

+
+bool
+GinPageIsRecyclable(Page page)

Needs a comment. Or more than one.

- /*
- * If a transaction wrote a commit record in the gap between taking and
- * logging the snapshot then latestCompletedXid may already be higher than
- * the value from the snapshot, so check before we use the incoming value.
- */
- if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-   running->latestCompletedXid))
- ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
-
- Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
-
- LWLockRelease(ProcArrayLock);

This code got relocated so that the lock is released later, but you
didn't add any comments explaining why. Somebody will move it back and
then you'll yet at them for doing it wrong. :-)

+ /*
+ * Must have called GetOldestVisibleTransactionId() if using SnapshotAny.
+ * Shouldn't have for an MVCC snapshot. (It's especially worth checking
+ * this for parallel builds, since ambuild routines that support parallel
+ * builds must work these details out for themselves.)
+ */
+ Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+ Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+!TransactionIdIsValid(OldestXmin));
+ Assert(snapshot == SnapshotAny || !anyvisible);

This looks like a gratuitous code relocation.

+HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer,
TransactionId *dead_after)

I don't much like the name dead_after, but I don't have a better
suggestion, either.

- * Deleter committed, but perhaps it was recent enough that some open
- * transactions could still see the tuple.
+ * Deleter committed, allow caller to check if it was recent enough that
+ * some open transactions could still see the tuple.

I think you could drop this change.

+ /*
+ * State related to determining whether a dead tuple is still needed.
+ */
+ InvisibleToEveryoneState *vistest;
+ TimestampTz limited_oldest_ts;
+ TransactionId limited_oldest_xmin;
+ /* have we made removal decision based on old_snapshot_threshold */
+ bool limited_oldest_committed;

Would benefit from more comments.

+ * accuring to prstate->vistest, but that can be removed based on

Typo.

Generally, heap_prune_satisfies_vacuum looks pretty good. The
limited_oldest_committed naming is confusing, but the comments make it
a lot clearer.

+ * If oldest btpo.xact in the deleted pa

Re: WIP/PoC for parallel backup

2020-04-07 Thread Asif Rehman
Hi,

Thanks, Kashif and Rajkumar. I have fixed the reported issues.

I have added the shared state as previously described. The new grammar
changes
are as follows:

START_BACKUP [LABEL ''] [FAST] [MAX_RATE %d]
- This will generate a unique backupid using pg_strong_random(16) and
hex-encoded
  it. which is then returned as the result set.
- It will also create a shared state and add it to the hashtable. The
hash table size is set
  to BACKUP_HASH_SIZE=10, but since hashtable can expand dynamically, I
think it's
  sufficient initial size. max_wal_senders is not used, because it can
be set to quite a
  large values.

JOIN_BACKUP 'backup_id'
- finds 'backup_id' in hashtable and attaches it to server process.


SEND_FILE '(' 'FILE' ')' [NOVERIFY_CHECKSUMS]
- renamed SEND_FILES to SEND_FILE
- removed START_WAL_LOCATION from this because 'startptr' is now
accessible through
  shared state.

There is no change in other commands:
STOP_BACKUP [NOWAIT]
LIST_TABLESPACES [PROGRESS]
LIST_FILES [TABLESPACE]
LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']

The current patches (v11) have been rebased to the latest master. The
backup manifest is enabled
by default, so I have disabled it for parallel backup mode and have
generated a warning so that
user is aware of it and not expect it in the backup.


On Tue, Apr 7, 2020 at 4:03 PM Kashif Zeeshan <
kashif.zees...@enterprisedb.com> wrote:

>
>
> On Fri, Apr 3, 2020 at 3:01 PM Kashif Zeeshan <
> kashif.zees...@enterprisedb.com> wrote:
>
>> Hi Asif
>>
>> When a non-existent slot is used with tablespace then correct error is
>> displayed but then the backup folder is not cleaned and leaves a corrupt
>> backup.
>>
>> Steps
>> ===
>>
>> edb@localhost bin]$
>> [edb@localhost bin]$ mkdir /home/edb/tbl1
>> [edb@localhost bin]$ mkdir /home/edb/tbl_res
>> [edb@localhost bin]$
>> postgres=#  create tablespace tbl1 location '/home/edb/tbl1';
>> CREATE TABLESPACE
>> postgres=#
>> postgres=# create table t1 (a int) tablespace tbl1;
>> CREATE TABLE
>> postgres=# insert into t1 values(100);
>> INSERT 0 1
>> postgres=# insert into t1 values(200);
>> INSERT 0 1
>> postgres=# insert into t1 values(300);
>> INSERT 0 1
>> postgres=#
>>
>>
>> [edb@localhost bin]$
>> [edb@localhost bin]$  ./pg_basebackup -v -j 2 -D
>>  /home/edb/Desktop/backup/ -T /home/edb/tbl1=/home/edb/tbl_res -S test
>> pg_basebackup: initiating base backup, waiting for checkpoint to complete
>> pg_basebackup: checkpoint completed
>> pg_basebackup: write-ahead log start point: 0/2E28 on timeline 1
>> pg_basebackup: starting background WAL receiver
>> pg_basebackup: error: could not send replication command
>> "START_REPLICATION": ERROR:  replication slot "test" does not exist
>> pg_basebackup: backup worker (0) created
>> pg_basebackup: backup worker (1) created
>> pg_basebackup: write-ahead log end point: 0/2E000100
>> pg_basebackup: waiting for background process to finish streaming ...
>> pg_basebackup: error: child thread exited with error 1
>> [edb@localhost bin]$
>>
>> backup folder not cleaned
>>
>> [edb@localhost bin]$
>> [edb@localhost bin]$
>> [edb@localhost bin]$
>> [edb@localhost bin]$ ls /home/edb/Desktop/backup
>> backup_label  globalpg_dynshmem  pg_ident.conf  pg_multixact
>>  pg_replslot  pg_snapshots  pg_stat_tmp  pg_tblspcPG_VERSION  pg_xact
>> postgresql.conf
>> base  pg_commit_ts  pg_hba.conf  pg_logical pg_notify
>> pg_serialpg_stat   pg_subtrans  pg_twophase  pg_wal
>>  postgresql.auto.conf
>> [edb@localhost bin]$
>>
>>
>>
>>
>> If the same case is executed without the parallel backup patch then the
>> backup folder is cleaned after the error is displayed.
>>
>> [edb@localhost bin]$ ./pg_basebackup -v -D  /home/edb/Desktop/backup/ -T
>> /home/edb/tbl1=/home/edb/tbl_res -S test999
>> pg_basebackup: initiating base backup, waiting for checkpoint to complete
>> pg_basebackup: checkpoint completed
>> pg_basebackup: write-ahead log start point: 0/2B28 on timeline 1
>> pg_basebackup: starting background WAL receiver
>> pg_basebackup: error: could not send replication command
>> "START_REPLICATION": ERROR:  replication slot "test999" does not exist
>> pg_basebackup: write-ahead log end point: 0/2B000100
>> pg_basebackup: waiting for background process to finish streaming ...
>> pg_basebackup: error: child process exited with exit code 1
>> *pg_basebackup: removing data directory " /home/edb/Desktop/backup"*
>> pg_basebackup: changes to tablespace directories will not be undone
>>
>
>
> Hi Asif
>
> A similar case is when DB Server is shut down while the Parallel Backup is
> in progress then the correct error is displayed but then the backup folder
> is not cleaned and leaves a corrupt backup. I think one bug fix will solve
> all these cases where clean up is not done when parallel backup is failed.
>
> [edb@localhost bin]$
> [edb@localhost bin]$
> [edb@localhost bin]$  ./pg_basebackup -v -D  /home/ed

Re: pgsql: Generate backup manifests for base backups, and validate them.

2020-04-07 Thread Robert Haas
On Tue, Apr 7, 2020 at 5:51 AM Peter Eisentraut
 wrote:
> On 2020-04-03 21:07, Robert Haas wrote:
> > A new tool called pg_validatebackup can validate a backup against the
> > manifest.
>
> In software engineering, "verify" and "validate" have standardized
> distinct meanings.  I'm not going to try to explain them here, but you
> can easily find them online.  I haven't formed an opinion on which one
> of them this tool is doing, but I notice that both the man page and the
> messages produced by the tool use the two terms seemingly
> interchangeably.  We should try to pick the correct term and use it
> consistently.

The tool is trying to make sure that we have the same backup that
we're supposed to have, and that the associated WAL is present and
sane. Looking at
https://en.wikipedia.org/wiki/Verification_and_validation, that sounds
more like verification than validation, but I confess that this
distinction is new to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Don't try fetching future segment of a TLI.

2020-04-07 Thread Fujii Masao




On 2020/04/07 20:21, David Steele wrote:


On 4/7/20 3:48 AM, Kyotaro Horiguchi wrote:

At Tue, 7 Apr 2020 12:15:00 +0900, Fujii Masao  
wrote in

This doesn't seem a bug, so I'm thinking to merge this to next *major*
version release, i.e., v13.

Not a bug, perhaps, but I think we do consider back-patching
performance problems. The rise in S3 usage has just exposed how poorly
this performed code in high-latency environments.


I understood the situation and am fine to back-patch that. But I'm not
sure
if it's fair to do that. Maybe we need to hear more opinions about
this?
OTOH, feature freeze for v13 is today, so what about committing the
patch
in v13 at first, and then doing the back-patch after hearing opinions
and
receiving many +1?


+1 for commit only v13 today, then back-patch if people wants and/or
accepts.


Definitely +1 for a commit today to v13. I certainly was not trying to hold 
that up

Pushed the patch to v13, at first!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: WIP/PoC for parallel backup

2020-04-07 Thread Jeevan Chalke
On Tue, Apr 7, 2020 at 10:14 PM Asif Rehman  wrote:

> Hi,
>
> Thanks, Kashif and Rajkumar. I have fixed the reported issues.
>
> I have added the shared state as previously described. The new grammar
> changes
> are as follows:
>
> START_BACKUP [LABEL ''] [FAST] [MAX_RATE %d]
> - This will generate a unique backupid using pg_strong_random(16) and
> hex-encoded
>   it. which is then returned as the result set.
> - It will also create a shared state and add it to the hashtable. The
> hash table size is set
>   to BACKUP_HASH_SIZE=10, but since hashtable can expand dynamically,
> I think it's
>   sufficient initial size. max_wal_senders is not used, because it can
> be set to quite a
>   large values.
>
> JOIN_BACKUP 'backup_id'
> - finds 'backup_id' in hashtable and attaches it to server process.
>
>
> SEND_FILE '(' 'FILE' ')' [NOVERIFY_CHECKSUMS]
> - renamed SEND_FILES to SEND_FILE
> - removed START_WAL_LOCATION from this because 'startptr' is now
> accessible through
>   shared state.
>
> There is no change in other commands:
> STOP_BACKUP [NOWAIT]
> LIST_TABLESPACES [PROGRESS]
> LIST_FILES [TABLESPACE]
> LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']
>
> The current patches (v11) have been rebased to the latest master. The
> backup manifest is enabled
> by default, so I have disabled it for parallel backup mode and have
> generated a warning so that
> user is aware of it and not expect it in the backup.
>

So, are you working on to make it work? I don't think a parallel backup
feature should be creating a backup with no manifest.


>
>
>
> --
> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>
>

-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: pgsql: Generate backup manifests for base backups, and validate them.

2020-04-07 Thread David Steele

On 4/7/20 12:44 PM, Robert Haas wrote:

On Tue, Apr 7, 2020 at 5:51 AM Peter Eisentraut
 wrote:

On 2020-04-03 21:07, Robert Haas wrote:

A new tool called pg_validatebackup can validate a backup against the
manifest.


In software engineering, "verify" and "validate" have standardized
distinct meanings.  I'm not going to try to explain them here, but you
can easily find them online.  I haven't formed an opinion on which one
of them this tool is doing, but I notice that both the man page and the
messages produced by the tool use the two terms seemingly
interchangeably.  We should try to pick the correct term and use it
consistently.


The tool is trying to make sure that we have the same backup that
we're supposed to have, and that the associated WAL is present and
sane. Looking at
https://en.wikipedia.org/wiki/Verification_and_validation, that sounds
more like verification than validation, but I confess that this
distinction is new to me.


When I searched I found a two different definitions for validation and 
verification. One for software development (as in the link above and 
what I think Peter meant) and another for data (see 
https://en.wikipedia.org/wiki/Data_validation, 
https://en.wikipedia.org/wiki/Data_verification, 
https://www.differencebetween.com/difference-between-data-validation-and-vs-data-verification/)


It seems that validation vs. verify as defined in PMBOK (the former 
sense) does not really apply here, though. That leaves only the latter 
sense which appears less well-documented but points to "verify" as the 
better option.


Regards,
--
-David
da...@pgmasters.net




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-07 Thread Etsuro Fujita
Hi Tomas,

On Wed, Apr 8, 2020 at 12:15 AM Tomas Vondra
 wrote:
> On Mon, Apr 06, 2020 at 05:28:52PM +0900, Etsuro Fujita wrote:
> >On Sat, Apr 4, 2020 at 12:15 AM Etsuro Fujita  
> >wrote:
> >> Attached is the original patch (0001) and one patch (0002) with
> >> changes including those by Tomas and Ashutosh.
> >
> >I merged the patches into one and rebased it against HEAD.  Attached
> >is a new version, in which I added the commit message as well.  Does
> >that make sense?  If there are no objections, I’ll commit the patch.

> +1

Great!  It's midnight in Japan now, so I'll push the patch early morning.

Best regards,
Etsuro Fujita




Re: WIP/PoC for parallel backup

2020-04-07 Thread Asif Rehman
On Tue, Apr 7, 2020 at 10:03 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Tue, Apr 7, 2020 at 10:14 PM Asif Rehman 
> wrote:
>
>> Hi,
>>
>> Thanks, Kashif and Rajkumar. I have fixed the reported issues.
>>
>> I have added the shared state as previously described. The new grammar
>> changes
>> are as follows:
>>
>> START_BACKUP [LABEL ''] [FAST] [MAX_RATE %d]
>> - This will generate a unique backupid using pg_strong_random(16) and
>> hex-encoded
>>   it. which is then returned as the result set.
>> - It will also create a shared state and add it to the hashtable. The
>> hash table size is set
>>   to BACKUP_HASH_SIZE=10, but since hashtable can expand dynamically,
>> I think it's
>>   sufficient initial size. max_wal_senders is not used, because it
>> can be set to quite a
>>   large values.
>>
>> JOIN_BACKUP 'backup_id'
>> - finds 'backup_id' in hashtable and attaches it to server process.
>>
>>
>> SEND_FILE '(' 'FILE' ')' [NOVERIFY_CHECKSUMS]
>> - renamed SEND_FILES to SEND_FILE
>> - removed START_WAL_LOCATION from this because 'startptr' is now
>> accessible through
>>   shared state.
>>
>> There is no change in other commands:
>> STOP_BACKUP [NOWAIT]
>> LIST_TABLESPACES [PROGRESS]
>> LIST_FILES [TABLESPACE]
>> LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']
>>
>> The current patches (v11) have been rebased to the latest master. The
>> backup manifest is enabled
>> by default, so I have disabled it for parallel backup mode and have
>> generated a warning so that
>> user is aware of it and not expect it in the backup.
>>
>
> So, are you working on to make it work? I don't think a parallel backup
> feature should be creating a backup with no manifest.
>

I will, however parallel backup is already quite a large patch. So I think
we should first
agree on the current work before adding a backup manifest and
progress-reporting support.


--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca


Re: proposal \gcsv

2020-04-07 Thread Tom Lane
Here's a more fully fleshed-out patch, with documentation and some
test cases.  (0001 patch is identical to last time.)

Considering this is the last day before v13 feature freeze, should
I push this, or sit on it till v14?  I feel reasonably good that we
have a nice feature definition here, but it's awfully late in the
cycle to be designing features.

regards, tom lane

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index db31fa8..1055af5 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -160,8 +160,8 @@ static void minimal_error_message(PGresult *res);
 
 static void printSSLInfo(void);
 static void printGSSInfo(void);
-static bool printPsetInfo(const char *param, struct printQueryOpt *popt);
-static char *pset_value_string(const char *param, struct printQueryOpt *popt);
+static bool printPsetInfo(const char *param, printQueryOpt *popt);
+static char *pset_value_string(const char *param, printQueryOpt *popt);
 
 #ifdef WIN32
 static void checkWin32Codepage(void);
@@ -1302,7 +1302,11 @@ exec_command_g(PsqlScanState scan_state, bool active_branch, const char *cmd)
 		}
 		free(fname);
 		if (strcmp(cmd, "gx") == 0)
-			pset.g_expanded = true;
+		{
+			/* save settings, then force expanded = 1 */
+			pset.gsavepopt = savePsetInfo(&pset.popt);
+			pset.popt.topt.expanded = 1;
+		}
 		status = PSQL_CMD_SEND;
 	}
 	else
@@ -3785,6 +3789,17 @@ _unicode_linestyle2string(int linestyle)
 /*
  * do_pset
  *
+ * Performs the assignment "param = value", where value could be NULL;
+ * for some params that has an effect such as inversion, for others
+ * it does nothing.
+ *
+ * Adjusts the state of the formatting options at *popt.  (In practice that
+ * is always pset.popt, but maybe someday it could be different.)
+ *
+ * If successful and quiet is false, then invokes printPsetInfo() to report
+ * the change.
+ *
+ * Returns true if successful, else false (eg for invalid param or value).
  */
 bool
 do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
@@ -4109,9 +4124,11 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
 	return true;
 }
 
-
+/*
+ * printPsetInfo: print the state of the "param" formatting parameter in popt.
+ */
 static bool
-printPsetInfo(const char *param, struct printQueryOpt *popt)
+printPsetInfo(const char *param, printQueryOpt *popt)
 {
 	Assert(param != NULL);
 
@@ -4292,6 +4309,77 @@ printPsetInfo(const char *param, struct printQueryOpt *popt)
 	return true;
 }
 
+/*
+ * savePsetInfo: make a malloc'd copy of the data in *popt.
+ *
+ * Possibly this should be somewhere else, but it's a bit specific to psql.
+ */
+printQueryOpt *
+savePsetInfo(const printQueryOpt *popt)
+{
+	printQueryOpt *save;
+
+	save = (printQueryOpt *) pg_malloc(sizeof(printQueryOpt));
+
+	/* Flat-copy all the scalar fields, then duplicate sub-structures. */
+	memcpy(save, popt, sizeof(printQueryOpt));
+
+	/* topt.line_style points to const data that need not be duplicated */
+	if (popt->topt.fieldSep.separator)
+		save->topt.fieldSep.separator = pg_strdup(popt->topt.fieldSep.separator);
+	if (popt->topt.recordSep.separator)
+		save->topt.recordSep.separator = pg_strdup(popt->topt.recordSep.separator);
+	if (popt->topt.tableAttr)
+		save->topt.tableAttr = pg_strdup(popt->topt.tableAttr);
+	if (popt->nullPrint)
+		save->nullPrint = pg_strdup(popt->nullPrint);
+	if (popt->title)
+		save->title = pg_strdup(popt->title);
+
+	/*
+	 * footers and translate_columns are never set in psql's print settings,
+	 * so we needn't write code to duplicate them.
+	 */
+	Assert(popt->footers == NULL);
+	Assert(popt->translate_columns == NULL);
+
+	return save;
+}
+
+/*
+ * restorePsetInfo: restore *popt from the previously-saved copy *save,
+ * then free *save.
+ */
+void
+restorePsetInfo(printQueryOpt *popt, printQueryOpt *save)
+{
+	/* Free all the old data we're about to overwrite the pointers to. */
+
+	/* topt.line_style points to const data that need not be duplicated */
+	if (popt->topt.fieldSep.separator)
+		free(popt->topt.fieldSep.separator);
+	if (popt->topt.recordSep.separator)
+		free(popt->topt.recordSep.separator);
+	if (popt->topt.tableAttr)
+		free(popt->topt.tableAttr);
+	if (popt->nullPrint)
+		free(popt->nullPrint);
+	if (popt->title)
+		free(popt->title);
+
+	/*
+	 * footers and translate_columns are never set in psql's print settings,
+	 * so we needn't write code to duplicate them.
+	 */
+	Assert(popt->footers == NULL);
+	Assert(popt->translate_columns == NULL);
+
+	/* Now we may flat-copy all the fields, including pointers. */
+	memcpy(popt, save, sizeof(printQueryOpt));
+
+	/* Lastly, free "save" ... but its sub-structures now belong to popt. */
+	free(save);
+}
 
 static const char *
 pset_bool_string(bool val)
@@ -4339,7 +4427,7 @@ pset_quoted_string(const char *str)
  * output that produces the correct setting when fed back into \pset.
  */
 static char *
-pset_value_string(const char *param, struct pr

Re: SLRU statistics

2020-04-07 Thread Tomas Vondra

On Tue, Apr 07, 2020 at 05:01:37PM +0530, Kuntal Ghosh wrote:

Hello Tomas,

On Thu, Apr 2, 2020 at 5:59 PM Tomas Vondra
 wrote:

Thank you for the patch, pushed.


In SimpleLruReadPage_ReadOnly, we first try to find the SLRU page in
shared buffer under shared lock, then conditionally visit
SimpleLruReadPage if reading is necessary. IMHO, we should update
hit_count if we can find the buffer in SimpleLruReadPage_ReadOnly
directly. Am I missing something?

Attached a patch for the same.



Yes, I think that's correct - without this we fail to account for
(possibly) a quite significant number of hits. Thanks for the report,
I'll get this pushed later today.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: proposal \gcsv

2020-04-07 Thread Pavel Stehule
út 7. 4. 2020 v 19:27 odesílatel Tom Lane  napsal:

> Here's a more fully fleshed-out patch, with documentation and some
> test cases.  (0001 patch is identical to last time.)
>
> Considering this is the last day before v13 feature freeze, should
> I push this, or sit on it till v14?  I feel reasonably good that we
> have a nice feature definition here, but it's awfully late in the
> cycle to be designing features.
>

I am for pushing to v13. This feature should not to break any, and there is
lot of time to finish details.

Regards

Pavel



> regards, tom lane
>
>


Re: WIP/PoC for parallel backup

2020-04-07 Thread Robert Haas
On Fri, Apr 3, 2020 at 4:46 AM Asif Rehman  wrote:
> Non-parallel backup already does the early error checking. I only intended
> to make parallel behave the same as non-parallel here. So, I agree with
> you that the behavior of parallel backup should be consistent with the
> non-parallel one.  Please see the code snippet below from
> basebackup.c:sendDir()

Oh, OK. So then we need to preserve that behavior, I think. Sorry, I
didn't realize the check was happening there.

> I am thinking of the following struct for shared state:
>> typedef struct
>> {
>> char backupid[NAMEDATALEN];
>> XLogRecPtr startptr;
>> slock_t lock;
>> int64 throttling_counter;
>> bool backup_started_in_recovery;
>> } BackupSharedState;

Looks broadly reasonable. Can anything other than lock and
throttling_counter change while it's running? If not, how about using
pg_atomic_uint64 for the throttling counter, and dropping lock? If
that gets too complicated it's OK to keep it as you have it.

> The shared state structure entries would be maintained by a shared hash table.
> There will be one structure per parallel backup. Since a single parallel 
> backup
> can engage more than one wal sender, so I think max_wal_senders might be a 
> little
> too much; perhaps max_wal_senders/2 since there will be at least 2 connections
> per parallel backup? Alternatively, we can set a new GUC that defines the 
> maximum
> number of for concurrent parallel backups i.e. ‘max_concurent_backups_allowed 
> = 10’
> perhaps, or we can make it user-configurable.

I don't think you need a hash table. Linear search should be fine. And
I see no point in dividing max_wal_senders by 2 either. The default is
*10*. You'd need to increase that by more than an order of magnitude
for a hash table to be needed, and more than that for the shared
memory consumption to matter.

> The key would be “backupid=hex_encode(pg_random_strong(16))”

wfm

> Progress Reporting:
> Although I think we should add progress-reporting for parallel backup as a
> separate patch. The relevant entries for progress-reporting such as
> ‘backup_total’ and ‘backup_streamed’ would be then added to this structure
> as well.

I mean, you can separate it for review if you wish, but it would need
to be committed together.

> START_BACKUP [LABEL ''] [FAST]
>   - returns startptr, tli, backup_label, unique_backup_id

OK. But what if I want to use this interface for a non-parallel backup?

> STOP_BACKUP [NOWAIT]
>   - returns startptr, tli, backup_label

I don't think it makes sense for STOP_BACKUP to return the same values
that START_BACKUP already returned. Presumably STOP_BACKUP should
return the end LSN. It could also return the backup label and
tablespace map files, as the corresponding SQL function does, unless
there's some better way of returning those in this case.

> JOIN_BACKUP ‘unique_backup_id’
>   - attaches a shared state identified by ‘unique_backup_id’ to a backend 
> process.

OK.

> LIST_TABLESPACES [PROGRESS]

OK.

> LIST_FILES [TABLESPACE]

OK.

> LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']

Why not just LIST_WAL_FILES 'startptr' 'endptr'?

> SEND_FILE '(' FILE ')' [NOVERIFY_CHECKSUMS]

Why parens? That seems useless.

Maybe it would make sense to have SEND_DATA_FILE 'datafilename' and
SEND_WAL_FILE 'walfilename' as separate commands. But not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP/PoC for parallel backup

2020-04-07 Thread Robert Haas
On Tue, Apr 7, 2020 at 1:25 PM Asif Rehman  wrote:
> I will, however parallel backup is already quite a large patch. So I think we 
> should first
> agree on the current work before adding a backup manifest and 
> progress-reporting support.

It's going to be needed for commit, but it may make sense for us to do
more review of what you've got here before we worry about it.

I'm gonna try to find some time for that as soon as I can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

Thanks for the review!


On 2020-04-07 12:41:07 -0400, Robert Haas wrote:
> In 0002, the comments in SnapshotSet() are virtually incomprehensible.
> There's no commit message so the reasons for the changes are unclear.
> But mostly looks unproblematic.

I was planning to drop that patch pre-commit, at least for now.  I think
there's a few live bugs here, but they're all older. I did send a few emails
about the class of problem, unfortunately it was a fairly one-sided
conversation so far ;)

https://www.postgresql.org/message-id/20200407072418.ccvnyjbrktyi3rzc%40alap3.anarazel.de


> 0003 looks like a fairly unrelated bug fix that deserves to be
> discussed on the thread related to the original patch. Probably should
> be an open item.

There was some discussion in a separate thread:
https://www.postgresql.org/message-id/20200406025651.fpzdb5yyb7qyhqko%40alap3.anarazel.de
The only reason for including it in this patch stack is that I can't
really execercise the patchset without the fix (it's a bit sad that this
issue has gone unnoticed for months before I found it as part of the
development of this patch).

Think I'll push a minimal version now, and add an open item.


> 
> Regarding 0005:
> 
> There's sort of a mix of terminology here: are we pruning tuples or
> removing tuples or testing whether things are invisible? It would be
> better to be more consistent.
> 
> + * State for testing whether tuple versions may be removed. To improve
> + * GetSnapshotData() performance we don't compute an accurate value whenever
> + * acquiring a snapshot. Instead we compute boundaries above/below which we
> + * know that row versions are [not] needed anymore.  If at test time values
> + * falls in between the two, the boundaries can be recomputed (unless that
> + * just happened).
> 
> I don't like the wording here much. Maybe: State for testing whether
> an XID is invisible to all current snapshots. If an XID precedes
> maybe_needed_bound, it's definitely not visible to any current
> snapshot. If it equals or follows definitely_needed_bound, that XID
> isn't necessarily invisible to all snapshots. If it falls in between,
> we're not sure. If, when testing a particular tuple, we see an XID
> somewhere in the middle, we can try recomputing the boundaries to get
> a more accurate answer (unless we've just done that). This is cheaper
> than maintaining an accurate value all the time.

I'll incorporate that, thanks.


> There's also the problem that this sorta contradicts the comment for
> definitely_needed_bound. There it says intermediate values needed to
> be tested against the ProcArray, whereas here it says we need to
> recompute the bounds. That's kinda confusing.

For me those are the same. Computing an accurate bound is visitting the
procarray. But I'll rephrase.


> ComputedHorizons seems like a fairly generic name. I think there's
> some relationship between InvisibleToEveryoneState and
> ComputedHorizons that should be brought out more clearly by the naming
> and the comments.

I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
much... But I find it hard to come up with something that's meaningfully
better.

I'll add a comment.


> + /*
> + * The value of ShmemVariableCache->latestCompletedFullXid when
> + * ComputeTransactionHorizons() held ProcArrayLock.
> + */
> + FullTransactionId latest_completed;
> +
> + /*
> + * The same for procArray->replication_slot_xmin and.
> + * procArray->replication_slot_catalog_xmin.
> + */
> + TransactionId slot_xmin;
> + TransactionId slot_catalog_xmin;
> 
> Department of randomly inconsistent names. In general I think it's
> quite hard to grasp the relationship between the different fields in
> ComputedHorizons.

What's the inconsistency? The dropped replication_ vs dropped FullXid
postfix?



> +
> +bool
> +GinPageIsRecyclable(Page page)
> 
> Needs a comment. Or more than one.

Well, I started to write one a couple times. But it's really just moving
the pre-existing code from the macro into a function and there weren't
any comments around *any* of it before. All my comment attempts
basically just were restating the code in so many words, or would have
required more work than I saw justified in the context of just moving
code.


> - /*
> - * If a transaction wrote a commit record in the gap between taking and
> - * logging the snapshot then latestCompletedXid may already be higher than
> - * the value from the snapshot, so check before we use the incoming value.
> - */
> - if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
> -   running->latestCompletedXid))
> - ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
> -
> - Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
> -
> - LWLockRelease(ProcArrayLock);
> 
> This code got relocated so that the lock is released later, but you
> didn't add any comments explaining why. Somebody will move it back and
> then you'll yet at them for doing it wrong. :-

Re: SyncRepLock acquired exclusively in default configuration

2020-04-07 Thread Ashwin Agrawal
On Mon, Apr 6, 2020 at 2:14 PM Andres Freund  wrote:

> > How about we change it to this ?
>
> Hm. Better. But I think it might need at least a compiler barrier /
> volatile memory load?  Unlikely here, but otherwise the compiler could
> theoretically just stash the variable somewhere locally (it's not likely
> to be a problem because it'd not be long ago that we acquired an lwlock,
> which is a full barrier).
>

That's the part, I am not fully sure about. But reading the comment above
SyncRepUpdateSyncStandbysDefined(), it seems fine.

> Bring back the check which existed based on GUC but instead of just
> blindly
> > returning based on just GUC not being set, check
> > WalSndCtl->sync_standbys_defined. Thoughts?
>
> Hm. Is there any reason not to just check
> WalSndCtl->sync_standbys_defined? rather than both !SyncStandbysDefined()
> and WalSndCtl->sync_standbys_defined?
>

Agree, just checking for WalSndCtl->sync_standbys_defined seems fine.

I wasn't fully thinking there, as I got distracted by if lock will be
required or not for reading the same. If lock was required then checking
for guc first would have been better, but seems not required.


Default setting for enable_hashagg_disk

2020-04-07 Thread Jeff Davis
This is just a placeholder thread for an open item that I'm adding to
the Open Items list. We can make a decision later.

Now that we have Disk-based Hash Aggregation, there are a lot more
situations where the planner can choose HashAgg. The
enable_hashagg_disk GUC, if set to true, chooses HashAgg based on
costing. If false, it only generates a HashAgg path if it thinks it
will fit in work_mem, similar to the old behavior (though it wlil now
spill to disk if the planner was wrong about it fitting in work_mem).
The current default is true.

I expect this to be a win in a lot of cases, obviously. But as with any
planner change, it will be wrong sometimes. We may want to be
conservative and set the default to false depending on the experience
during beta. I'm inclined to leave it as true for now though, because
that will give us better information upon which to base any decision.

Regards,
Jeff Davis






Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
More review, since it sounds like you like it:

0006 - Boring. But I'd probably make this move both xmin and xid back,
with related comment changes; see also next comment.

0007 -

+ TransactionId xidCopy; /* this backend's xid, a copy of this proc's
+ProcGlobal->xids[] entry. */

Can we please NOT put Copy into the name like that? Pretty please?

+ int pgxactoff; /* offset into various ProcGlobal-> arrays
+ * NB: can change any time unless locks held!
+ */

I'm going to add the helpful comment "NB: can change any time unless
locks held!" to every data structure in the backend that is in shared
memory and not immutable. No need, of course, to mention WHICH
locks...

On a related note, PROC_HDR really, really, really needs comments
explaining the locking regimen for the new xids field.

+ ProcGlobal->xids[pgxactoff] = InvalidTransactionId;

Apparently this array is not dense in the sense that it excludes
unused slots, but comments elsewhere don't seem to entirely agree.
Maybe the comments discussing how it is "dense" need to be a little
more precise about this.

+ for (int i = 0; i < nxids; i++)

I miss my C89. Yeah, it's just me.

- if (!suboverflowed)
+ if (suboverflowed)
+ continue;
+

Do we really need to do this kind of diddling in this patch? I mean
yes to the idea, but no to things that are going to make it harder to
understand what happened if this blows up.

+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;

  /* ProcGlobal */
  size = add_size(size, sizeof(PROC_HDR));
- /* MyProcs, including autovacuum workers and launcher */
- size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
- /* AuxiliaryProcs */
- size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
- /* Prepared xacts */
- size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
- /* ProcStructLock */
+ size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));

This seems like a bad idea. If we establish a precedent that it's OK
to have sizing routines that don't use add_size() and mul_size(),
people are going to cargo cult that into places where there is more
risk of overflow than there is here.

You've got a bunch of different places that talk about the new PGXACT
array and they are somewhat redundant yet without saying exactly the
same thing every time either. I think that needs cleanup.

One thing I didn't see is any clear discussion of what happens if the
two copies of the XID information don't agree with each other. That
should be added someplace, either in an appropriate code comment or in
a README or something. I *think* both are protected by the same locks,
but there's also some unlocked access to those structure members, so
it's not entirely a slam dunk.

...Robert




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 10:51:12 -0700, Andres Freund wrote:
> > +void AssertTransactionIdMayBeOnDisk(TransactionId xid)
> > 
> > Formatting.
> > 
> > + * Assert that xid is one that we could actually see on disk.
> > 
> > I don't know what this means. The whole purpose of this routine is
> > very unclear to me.
> 
> It's intended to be a double check against

forgetting things...? Err:

It is intended to make it easier to detect cases where the passed
TransactionId is not safe against wraparound. If there is protection
against wraparound, then the xid

a) may never be older than ShmemVariableCache->oldestXid (since
   otherwise the rel/datfrozenxid could not have advanced past the xid,
   and because oldestXid is what what prevents ->nextFullXid from
   advancing far enough to cause a wraparound)

b) cannot be >= ShmemVariableCache->nextFullXid. If it is, it cannot
   recently have come from GetNewTransactionId(), and thus there is no
   anti-wraparound protection either.

As full wraparounds are painful to exercise in testing,
AssertTransactionIdMayBeOnDisk() is intended to make it easier to detect
potential hazards.

The reason for the *OnDisk naming is that [oldestXid, nextFullXid) is
the appropriate check for values actually stored in tables. There could,
and probably should, be a narrower assertion ensuring that a xid is
protected against being pruned away (i.e. a PGPROC's xmin covering it).

The reason for being concerned enough in the new code to add the new
assertion helper (as well as a major motivating reason for making the
horizons 64 bit xids) is that it's much harder to ensure that "global
xmin" style horizons don't wrap around. By definition they include other
backend's ->xmin, and those can be released without a lock at any
time. As a lot of wraparound issues are triggered by very longrunning
transactions, it is not even unlikely to hit such problems: At some
point somebody is going to kill that old backend and ->oldestXid will
advance very quickly.

There is a lot of code that is pretty unsafe around wraparounds... They
are getting easier and easier to hit on a regular schedule in production
(plenty of databases that hit wraparounds multiple times a week). And I
don't think we as PG developers often don't quite take that into
account.


Does that make some sense? Do you have a better suggestion for a name?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Peter Geoghegan
On Tue, Apr 7, 2020 at 11:28 AM Andres Freund  wrote:
> There is a lot of code that is pretty unsafe around wraparounds... They
> are getting easier and easier to hit on a regular schedule in production
> (plenty of databases that hit wraparounds multiple times a week). And I
> don't think we as PG developers often don't quite take that into
> account.

It would be nice if there was high level documentation on wraparound
hazards. Maybe even a dedicated README file.

-- 
Peter Geoghegan




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
On Tue, Apr 7, 2020 at 2:28 PM Andres Freund  wrote:
> Does that make some sense? Do you have a better suggestion for a name?

I think it makes sense. I have two basic problems with the name. The
first is that "on disk" doesn't seem to be a very clear way of
describing what you're actually checking here, and it definitely
doesn't refer to an existing concept which sophisticated hackers can
be expected to understand. The second is that "may" is ambiguous in
English: it can either mean that something is permissible ("Johnny,
you may go to the bathroom") or that we do not have certain knowledge
of it ("Johnny may be in the bathroom"). When it is followed by "be",
it usually has the latter sense, although there are exceptions (e.g.
"She may be discharged from the hospital today if she wishes, but we
recommend that she stay for another day"). Consequently, I found that
use of "may be" in this context wicked confusing. What came to mind
was:

bool
RobertMayBeAGiraffe(void)
{
return true; // i mean, i haven't seen him since last week, so who knows?
}

So I suggest a name with "Is" or no verb, rather than one with
"MayBe." And I suggest something else instead of "OnDisk," e.g.
AssertTransactionIdIsInUsableRange() or
TransactionIdIsInAllowableRange() or
AssertTransactionIdWraparoundProtected(). I kind of like that last
one, but YMMV.

I wish to clarify that in sending these review emails I am taking no
position on whether or not it is prudent to commit any or all of them.
I do not think we can rule out the possibility that they will Break
Things, but neither do I wish to be seen as That Guy Who Stands In The
Way of Important Improvements. Time does not permit me a detailed
review anyway. So, these comments are provided in the hope that they
may be useful but without endorsement or acrimony. If other people
want to endorse or, uh, acrimoniate, based on my comments, that is up
to them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
On Tue, Apr 7, 2020 at 1:51 PM Andres Freund  wrote:
> > ComputedHorizons seems like a fairly generic name. I think there's
> > some relationship between InvisibleToEveryoneState and
> > ComputedHorizons that should be brought out more clearly by the naming
> > and the comments.
>
> I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
> much... But I find it hard to come up with something that's meaningfully
> better.

It would help to stick XID in there, like ComputedXIDHorizons. What I
find really baffling is that you seem to have two structures in the
same file that have essentially the same purpose, but the second one
(ComputedHorizons) has a lot more stuff in it. I can't understand why.

> What's the inconsistency? The dropped replication_ vs dropped FullXid
> postfix?

Yeah, just having the member names be randomly different between the
structs. Really harms greppability.

> > Generally, heap_prune_satisfies_vacuum looks pretty good. The
> > limited_oldest_committed naming is confusing, but the comments make it
> > a lot clearer.
>
> I didn't like _committed much either. But couldn't come up with
> something short. _relied_upon?

oldSnapshotLimitUsed or old_snapshot_limit_used, like currentCommandIdUsed?

> It's just adjusting for the changed name of latestCompletedXid to
> latestCompletedFullXid, as part of widening it to 64bits.  I'm not
> really a fan of adding that to the variable name, but surrounding code
> already did it (cf VariableCache->nextFullXid), so I thought I'd follow
> suit.

Oops, that was me misreading the diff. Sorry for the noise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 14:28:09 -0400, Robert Haas wrote:
> More review, since it sounds like you like it:
>
> 0006 - Boring. But I'd probably make this move both xmin and xid back,
> with related comment changes; see also next comment.
>
> 0007 -
>
> + TransactionId xidCopy; /* this backend's xid, a copy of this proc's
> +ProcGlobal->xids[] entry. */
>
> Can we please NOT put Copy into the name like that? Pretty please?

Do you have a suggested naming scheme? Something indicating that it's
not the only place that needs to be updated?


> + int pgxactoff; /* offset into various ProcGlobal-> arrays
> + * NB: can change any time unless locks held!
> + */
>
> I'm going to add the helpful comment "NB: can change any time unless
> locks held!" to every data structure in the backend that is in shared
> memory and not immutable. No need, of course, to mention WHICH
> locks...

I think it's more on-point here, because we need to hold either of the
locks* even, for changes to a backend's own status that one reasonably
could expect would be safe to at least inspect. E.g looking at
ProcGlobal->xids[MyProc->pgxactoff]
doesn't look suspicious, but could very well return another backends
xid, if neither ProcArrayLock nor XidGenLock is held (because a
ProcArrayRemove() could have changed pgxactoff if a previous entry was
removed).

*see comment at PROC_HDR:

 *
 * Adding/Removing an entry into the procarray requires holding *both*
 * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
 * needed because the dense arrays (see below) are accessed from
 * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
 * further contention by both using one lock. Adding/Removing a procarray
 * entry is much less frequent.
 */
typedef struct PROC_HDR
{
/* Array of PGPROC structures (not including dummies for prepared txns) 
*/
PGPROC *allProcs;


/*
 * Arrays with per-backend information that is hotly accessed, indexed 
by
 * PGPROC->pgxactoff. These are in separate arrays for three reasons:
 * First, to allow for as tight loops accessing the data as
 * possible. Second, to prevent updates of frequently changing data from
 * invalidating cachelines shared with less frequently changing
 * data. Third to condense frequently accessed data into as few 
cachelines
 * as possible.
 *
 * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), 
those
 * copies are used to provide the contents of the dense data, and will 
be
 * transferred by ProcArrayAdd() while it already holds ProcArrayLock.
 */

there's also

 * The various *Copy fields are copies of the data in ProcGlobal arrays that
 * can be accessed without holding ProcArrayLock / XidGenLock (see PROC_HDR
 * comments).


I had a more explicit warning/explanation about the dangers of accessing
the arrays without locks, but apparently went to far when reducing
duplicated comments.


> On a related note, PROC_HDR really, really, really needs comments
> explaining the locking regimen for the new xids field.


I'll expand the above, in particular highlighting the danger of
pgxactoff changing.


> + ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
>
> Apparently this array is not dense in the sense that it excludes
> unused slots, but comments elsewhere don't seem to entirely agree.

What do you mean with "unused slots"? Backends that committed?

Dense is intended to describe that the arrays only contain currently
"live" entries. I.e. that the first procArray->numProcs entries in each
array have the data for all procs (including prepared xacts) that are
"connected".  This is extending the concept that already existed for
procArray->pgprocnos.

Wheras the PGPROC/PGXACT arrays have "unused" entries interspersed.

This is what previously lead to the slow loop in GetSnapshotData(),
where we had to iterate over PGXACTs over an indirection in
procArray->pgprocnos. I.e. to only look at in-use PGXACTs we had to go
through allProcs[arrayP->pgprocnos[i]], which is, uh, suboptimal for
a performance critical routine holding a central lock.

I'll try to expand the comments around dense, but if you have a better
descriptor.


> Maybe the comments discussing how it is "dense" need to be a little
> more precise about this.
>
> + for (int i = 0; i < nxids; i++)
>
> I miss my C89. Yeah, it's just me.

Oh, dear god. I hate declaring variables like 'i' on function scope. The
bug that haunted me the longest in the development of this patch was in
XidCacheRemoveRunningXids, where there are both i and j, and a macro
XidCacheRemove(i), but the macro gets passed j as i.


> - if (!suboverflowed)
> + if (suboverflowed)
> + continue;
> +
>
> Do we really need to do this kind of diddling in this patch? I mean
> yes to the idea, but no to things that are going to make it harder to
> understand what happened if this blows up.

I can try to reduce those

Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
0008 -

Here again, I greatly dislike putting Copy in the name. It makes
little sense to pretend that either is the original and the other is
the copy. You just have the same data in two places. If one of them is
more authoritative, the place to explain that is in the comments, not
by elongating the structure member name and supposing anyone will be
able to make something of that.

+ *
+ * XXX: That's why this is using vacuumFlagsCopy.

I am not sure there's any problem with the code that needs fixing
here, so I might think about getting rid of this XXX. But this gets
back to my complaint about the locking regime being unclear. What I
think you need to do here is rephrase the previous paragraph so that
it explains the reason for using this copy a bit better. Like "We read
the copy of vacuumFlags from PGPROC rather than visiting the copy
attached to ProcGlobal because we can do that without taking a lock.
See fuller explanation in ." Or whatever.

0009, 0010 -

I think you've got this whole series of things divided up too finely.
Like, 0005 feels like the meat of it, and that has a bunch of things
in it that could plausible be separated out as separate commits. 0007
also seems to do more than one kind of thing (see my comment regarding
moving some of that into 0006). But whacking everything around like a
crazy man in 0005 and a little more in 0007 and then doing the
following cleanup in these little tiny steps seems pretty lame.
Separating 0009 from 0010 is maybe the clearest example of that, but
IMHO it's pretty unclear why both of these shouldn't be merged with
0008.

To be clear, I exaggerate for effect. 0005 is not whacking everything
around like a crazy man. But it is a non-minimal patch, whereas I
consider 0009 and 0010 to be sub-minimal.

My comments on the Copy naming apply here as well. I am also starting
to wonder why exactly we need two copies of all this stuff. Perhaps
I've just failed to absorb the idea for having read the patch too
briefly, but I think that we need to make sure that it's super-clear
why we're doing that. If we just needed it for one field because
$REASONS, that would be one thing, but if we need it for all of them
then there must be some underlying principle here that needs a good
explanation in an easy-to-find and centrally located place.

0011 -

+ * Number of top-level transactions that completed in some form since the
+ * start of the server. This currently is solely used to check whether
+ * GetSnapshotData() needs to recompute the contents of the snapshot, or
+ * not. There are likely other users of this.  Always above 1.

Does it only count XID-bearing transactions? If so, best mention that.

+ * transactions completed since the last GetSnapshotData()..

Too many periods.

+ /* Same with CSN */
+ ShmemVariableCache->xactCompletionCount++;

If I didn't know that CSN stood for commit sequence number from
reading years of mailing list traffic, I'd be lost here. So I think
this comment shouldn't use that term.

+GetSnapshotDataFillTooOld(Snapshot snapshot)

Uh... no clue what's going on here. Granted the code had no comments
in the old place either, so I guess it's not worse, but even the name
of the new function is pretty incomprehensible.

+ * Helper function for GetSnapshotData() that check if the bulk of the

checks

+ * the fields that need to change and returns true. false is returned
+ * otherwise.

Otherwise, it returns false.

+ * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go

I know what it means to re-enter a building, but I don't know what it
means to re-enter the snapshot's xmin.

This whole comment seems a bit murky.

...Robert




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 14:51:52 -0400, Robert Haas wrote:
> On Tue, Apr 7, 2020 at 2:28 PM Andres Freund  wrote:
> > Does that make some sense? Do you have a better suggestion for a name?
> 
> I think it makes sense. I have two basic problems with the name. The
> first is that "on disk" doesn't seem to be a very clear way of
> describing what you're actually checking here, and it definitely
> doesn't refer to an existing concept which sophisticated hackers can
> be expected to understand. The second is that "may" is ambiguous in
> English: it can either mean that something is permissible ("Johnny,
> you may go to the bathroom") or that we do not have certain knowledge
> of it ("Johnny may be in the bathroom"). When it is followed by "be",
> it usually has the latter sense, although there are exceptions (e.g.
> "She may be discharged from the hospital today if she wishes, but we
> recommend that she stay for another day"). Consequently, I found that
> use of "may be" in this context wicked confusing.

Well, it *is* only a vague test :). It shouldn't ever have a false
positive, but there's plenty chance for false negatives (if wrapped
around far enough).


> So I suggest a name with "Is" or no verb, rather than one with
> "MayBe." And I suggest something else instead of "OnDisk," e.g.
> AssertTransactionIdIsInUsableRange() or
> TransactionIdIsInAllowableRange() or
> AssertTransactionIdWraparoundProtected(). I kind of like that last
> one, but YMMV.

Make sense - but they all seem to express a bit more certainty than I
think the test actually provides.

I explicitly did not want (and added a comment to that affect) have
something like TransactionIdIsInAllowableRange(), because there never
can be a safe use of its return value, as far as I can tell.

The "OnDisk" was intended to clarify that the range it verifies is
whether it'd be ok for the xid to have been found in a relation.

Greetings,

Andres Freund




Re: Index Skip Scan

2020-04-07 Thread Dmitry Dolgov
> On Mon, Apr 06, 2020 at 06:31:08PM +, Floris Van Nee wrote:
>
> > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, 
> > Floris, I
> > would appreciate if in the future you can make it more visible that changes 
> > you
> > suggest contain some fixes. E.g. it wasn't clear for me from your previous 
> > email
> > that that's the case, and it doesn't make sense to pull into different 
> > direction
> > when we're trying to achieve the same goal :)
>
> I wasn't aware that this particular case could be triggered before I saw 
> Dilip's email, otherwise I'd have mentioned it here of course. It's just that 
> because my patch handles filter conditions in general, it works for this case 
> too.

Oh, then fortunately I've got a wrong impression, sorry and thanks for
clarification :)

> > > > In the patch I posted a week ago these cases are all handled
> > > > correctly, as it introduces this extra logic in the Executor.
> > >
> > > Okay, So I think we can merge those fixes in Dmitry's patch set.
> >
> > I'll definitely take a look at suggested changes in filtering part.
>
> It may be possible to just merge the filtering part into your patch, but I'm 
> not entirely sure. Basically you have to pull the information about skipping 
> one level up, out of the node, into the generic IndexNext code.

I was actually thinking more about just preventing skip scan in this
situation, which is if I'm not mistaken could be solved by inspecting
qual conditions to figure out if they're covered in the index -
something like in attachments (this implementation is actually too
restrictive in this sense and will not allow e.g. expressions, that's
why I haven't bumped patch set version for it - soon I'll post an
extended version).

Other than that to summarize current open points for future readers
(this thread somehow became quite big):

* Making UniqueKeys usage more generic to allow using skip scan for more
  use cases (hopefully it was covered by the v33, but I still need a
  confirmation from David, like blinking twice or something).

* Suspicious performance difference between different type of workload,
  mentioned by Tomas (unfortunately I had no chance yet to investigate).

* Thinking about supporting conditions, that are not covered by the index,
  to make skipping more flexible (one of the potential next steps in the
  future, as suggested by Floris).
>From 15989c5250214fea8606a56afd1eeaf760b8723e Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Tue, 24 Mar 2020 17:04:32 +0100
Subject: [PATCH v33 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   3 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/indxpath.c  |  41 
 src/backend/optimizer/path/pathkeys.c  |  71 +++--
 src/backend/optimizer/path/uniquekey.c | 136 +
 src/backend/optimizer/plan/planagg.c   |   1 +
 src/backend/optimizer/plan/planmain.c  |   1 +
 src/backend/optimizer/plan/planner.c   |  37 ++-
 src/backend/optimizer/util/pathnode.c  |  46 +++--
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  19 
 src/include/nodes/print.h  |   1 +
 src/include/optimizer/pathnode.h   |   2 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 408 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..16083e7a7e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)

Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 15:03:46 -0400, Robert Haas wrote:
> On Tue, Apr 7, 2020 at 1:51 PM Andres Freund  wrote:
> > > ComputedHorizons seems like a fairly generic name. I think there's
> > > some relationship between InvisibleToEveryoneState and
> > > ComputedHorizons that should be brought out more clearly by the naming
> > > and the comments.
> >
> > I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
> > much... But I find it hard to come up with something that's meaningfully
> > better.
> 
> It would help to stick XID in there, like ComputedXIDHorizons. What I
> find really baffling is that you seem to have two structures in the
> same file that have essentially the same purpose, but the second one
> (ComputedHorizons) has a lot more stuff in it. I can't understand why.

ComputedHorizons are the various "accurate" horizons computed by
ComputeTransactionHorizons(). That's used to determine a horizon for
vacuuming (via GetOldestVisibleTransactionId()) and other similar use
cases.

The various InvisibleToEveryoneState variables contain the boundary
based horizons, and are updated / initially filled by
GetSnapshotData(). When the a tested value falls between the boundaries,
we update the approximate boundaries using
ComputeTransactionHorizons(). That briefly makes the boundaries in
the InvisibleToEveryoneState accurate - but future GetSnapshotData()
calls will increase the definitely_needed_bound (if transactions
committed since).

The ComputedHorizons fields could instead just be pointer based
arguments to ComputeTransactionHorizons(), but that seems clearly
worse.

I'll change ComputedHorizons's comment to say that it's the result of
ComputeTransactionHorizons(), not the "state".


> > What's the inconsistency? The dropped replication_ vs dropped FullXid
> > postfix?
> 
> Yeah, just having the member names be randomly different between the
> structs. Really harms greppability.

The long names make it hard to keep line lengths in control, in
particular when also involving the long macro names for TransactionId /
FullTransactionId comparators...


> > > Generally, heap_prune_satisfies_vacuum looks pretty good. The
> > > limited_oldest_committed naming is confusing, but the comments make it
> > > a lot clearer.
> >
> > I didn't like _committed much either. But couldn't come up with
> > something short. _relied_upon?
> 
> oldSnapshotLimitUsed or old_snapshot_limit_used, like currentCommandIdUsed?

Will go for old_snapshot_limit_used, and rename the other variables to
old_snapshot_limit_ts, old_snapshot_limit_xmin.

Greetings,

Andres Freund




Re: Binary support for pgoutput plugin

2020-04-07 Thread Dave Cramer
On Fri, 3 Apr 2020 at 16:44, Dave Cramer  wrote:

>
>
> On Fri, 3 Apr 2020 at 03:43, Petr Jelinek  wrote:
>
>> Hi,
>>
>> On 08/03/2020 00:18, Dave Cramer wrote:
>> > On Fri, 6 Mar 2020 at 08:54, Petr Jelinek > > > wrote:
>> >
>> > Hi Dave,
>> >
>> > On 29/02/2020 18:44, Dave Cramer wrote:
>> >  >
>> >  >
>> >  > rebased and removed the catversion bump.
>> >
>> > Looked into this and it generally seems okay, but I do have one
>> > gripe here:
>> >
>> >  > + tuple->values[i].data =
>> > palloc(len + 1);
>> >  > + /* and data */
>> >  > +
>> >  > + pq_copymsgbytes(in,
>> > tuple->values[i].data, len);
>> >  > + tuple->values[i].len = len;
>> >  > + tuple->values[i].cursor =
>> 0;
>> >  > + tuple->values[i].maxlen =
>> len;
>> >  > + /* not strictly necessary
>> > but the docs say it is required */
>> >  > + tuple->values[i].data[len]
>> > = '\0';
>> >  > + break;
>> >  > + }
>> >  > + case 't':   /* text
>> > formatted value */
>> >  > + {
>> >  > + tuple->changed[i] = true;
>> >  > + int len = pq_getmsgint(in,
>> > 4);  /* read length */
>> >  >
>> >  >   /* and data */
>> >  > - tuple->values[i] =
>> > palloc(len + 1);
>> >  > - pq_copymsgbytes(in,
>> > tuple->values[i], len);
>> >  > - tuple->values[i][len] =
>> '\0';
>> >  > + tuple->values[i].data =
>> > palloc(len + 1);
>> >  > + pq_copymsgbytes(in,
>> > tuple->values[i].data, len);
>> >  > + tuple->values[i].data[len]
>> > = '\0';
>> >  > + tuple->values[i].len = len;
>> >
>> > The cursor should be set to 0 in the text formatted case too if
>> this is
>> > how we chose to encode data.
>> >
>> > However I am not quite convinced I like the StringInfoData usage
>> here.
>> > Why not just change the struct to include additional array of
>> lengths
>> > rather than replacing the existing values array with StringInfoData
>> > array, that seems generally both simpler and should have smaller
>> memory
>> > footprint too, no?
>> >
>> >
>> > Can you explain this a bit more? I don't really see a huge difference
>> in
>> > memory usage.
>> > We still need length and the data copied into LogicalRepTupleData when
>> > sending the data in binary, no?
>> >
>>
>> Yes but we don't need to have fixed sized array of 1600 elements that
>> contains maxlen and cursor positions of the StringInfoData struct which
>> we don't use for anything AFAICS.
>>
>
New patch that fixes a number of errors in the check for validity as well
as reduces the memory usage by
dynamically allocating the data changed as well as collapsing the changed
and binary arrays into a format array.

Dave Cramer

>


0001-First-pass-at-working-code-without-subscription-opti.patch
Description: Binary data


0002-add-binary-column-to-pg_subscription.patch
Description: Binary data


0003-document-new-binary-option-for-CREATE-SUBSCRIPTION.patch
Description: Binary data


0005-Remove-C99-declaration-and-code.patch
Description: Binary data


0004-get-relid-inside-of-logical_read_insert.patch
Description: Binary data


0006-make-sure-the-subcription-is-marked-as-binary.patch
Description: Binary data


0007-check-that-the-subscriber-is-compatible-with-the-pub.patch
Description: Binary data


0008-Changed-binary-and-changed-to-format-and-use-existin.patch
Description: Binary data


Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-07 Thread Anna Akenteva

On 2020-04-07 12:58, Amit Kapila wrote:


Review comments:
1.
+static void
+DeleteEvent(void)
I don't see how this is implemented or called to handle any errors.

2.
+ if (InterruptPending)
+ {
+ DeleteEvent();
+ ProcessInterrupts();
+ }
We generally do this type of handling via CHECK_FOR_INTERRUPTS.

3.
+int
+WaitUtility(XLogRecPtr target_lsn, const float8 secs)
Isn't it better to have a return value as bool?

4.
+#define GetNowFloat() ((float8) GetCurrentTimestamp() / 100.0)
This same define is used elsewhere in the code as well, may be we can
define it in some central place and use it.


Thank you for your review!
Ivan and I have worked on the patch and tried to address your comments:

0. Now we wake up at least every 100ms to check for interrupts.
1. Now we call DeleteWaitedLSN() from 
ProcessInterrupts()=>LockErrorCleanup(). It seems that we can only exit 
the WAIT cycle improperly due to interrupts, so this should be enough 
(?)

2. Now we use CHECK_FOR_INTERRUPTS() instead of ProcessInterrupts()
3. Now WaitUtility() returns bool rather than int
4. Now GetNowFloat() is only defined at one place in the code

What we changed additionally:
- Prohibited using WAIT FOR LSN on master
- Added more tests
- Checked the code with pgindent and adjusted pgindent/typedefs.list
- Changed min_lsn's type to pg_atomic_uint64 + fixed how we work with 
mutex
- Code cleanup in wait.[c|h]: cleaned up #include-s, gave better names 
to functions, changed elog() to ereport()


--
Anna Akenteva
Postgres Professional:
The Russian Postgres Company
http://www.postgrespro.comdiff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index c23bbfb4e71..19a33d7d8fb 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR LSN lsn_value [TIMEOUT number_of_milliseconds ] ]
 
 where transaction_mode is one of:
 
@@ -63,6 +63,17 @@ BEGIN [ WORK | TRANSACTION ] [ transaction_mode
was executed.
   
+
+  
+   The WAIT FOR clause allows to wait for the target log
+   sequence number (LSN) to be replayed on standby before
+   starting the transaction in PostgreSQL databases
+   with master-standby asynchronous replication. Wait time can be limited by
+   specifying a timeout, which is measured in milliseconds and must be a positive
+   integer. If LSN was not reached before timeout, transaction
+   doesn't begin. Waiting can be interrupted using Ctrl+C, or
+   by shutting down the postgres server.
+  
  
 
  
@@ -146,6 +157,10 @@ BEGIN;
different purpose in embedded SQL. You are advised to be careful
about the transaction semantics when porting database applications.
   
+
+  
+   There is no WAIT FOR clause in the SQL standard.
+  
  
 
  
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index d6cd1d41779..c9f70d2709a 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-START TRANSACTION [ transaction_mode [, ...] ]
+START TRANSACTION [ transaction_mode [, ...] ] [ WAIT FOR LSN lsn_value [TIMEOUT number_of_milliseconds ] ]
 
 where transaction_mode is one of:
 
@@ -40,6 +40,17 @@ START TRANSACTION [ transaction_mode was executed. This is the same
as the  command.
   
+
+  
+   The WAIT FOR clause allows to wait for the target log
+   sequence number (LSN) to be replayed on standby before
+   starting the transaction in PostgreSQL databases
+   with master-standby asynchronous replication. Wait time can be limited by
+   specifying a timeout, which is measured in milliseconds and must be a positive
+   integer. If LSN was not reached before timeout, transaction
+   doesn't begin. Waiting can be interrupted using Ctrl+C, or
+   by shutting down the postgres server.
+  
  
 
  
@@ -78,6 +89,10 @@ START TRANSACTION [ transaction_mode
 
+  
+   There is no WAIT FOR clause in the SQL standard.
+  
+
   
See also the compatibility section of .
   
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index abf954ba392..4c7eb0cb219 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_database.h"
 #include "commands/progress.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/controldata_utils.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
@@ -7332,6 +7333,15 @@ StartupXLOG(void)
 	break;
 }
 
+/*
+ * If we replayed an LSN that someone was waiting for,
+ * set latches in shared memory array to notify the waiter.
+ */
+if (XLogCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+{
+	WaitSetLatch(XLogCtl->lastReplayedEndRecPtr);
+}
+
 /* Else, try to fetch the next WAL record */
 record = Re

Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
On Tue, Apr 7, 2020 at 3:31 PM Andres Freund  wrote:
> Well, it *is* only a vague test :). It shouldn't ever have a false
> positive, but there's plenty chance for false negatives (if wrapped
> around far enough).

Sure, but I think you get my point. Asserting that something "might
be" true isn't much of an assertion. Saying that it's in the correct
range is not to say there can't be a problem - but we're saying that
it IS in the expect range, not that it may or may not be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: BUG #16346: pg_upgrade fails on a trigger with a comment

2020-04-07 Thread Hamid Akhtar
As you have mentioned, I have verified that indeed commit 4c40b27b broke
this.

In this particular commit moves restoration of materialized views and event
triggers to the last phase. Perhaps, comments should also be moved to this
phase as there may comments on either of these types of objects.

Attached is a patch that resolves this issue. I've verified that it resolve
the upgrade (and restore issue) introduced by 4c40b27b. I'll test this
patch in a little more detail tomorrow.

On Mon, Apr 6, 2020 at 8:26 PM PG Bug reporting form 
wrote:

> The following bug has been logged on the website:
>
> Bug reference:  16346
> Logged by:  Alexander Lakhin
> Email address:  exclus...@gmail.com
> PostgreSQL version: 12.2
> Operating system:   Ubuntu 18.04
> Description:
>
> When using pg_upgrade on a database with the following contents:
> CREATE FUNCTION public.test_event_trigger() RETURNS event_trigger
> LANGUAGE plpgsql
> AS $$
> BEGIN
> RAISE NOTICE 'test_event_trigger: % %', tg_event, tg_tag;
> END
> $$;
>
> CREATE EVENT TRIGGER regress_event_trigger3 ON ddl_command_start
>EXECUTE PROCEDURE public.test_event_trigger();
>
> COMMENT ON EVENT TRIGGER regress_event_trigger3 IS 'test comment';
>
> I get:
> Restoring global objects in the new cluster ok
> Restoring database schemas in the new cluster
>   postgres
> *failure*
>
> Consult the last few lines of "pg_upgrade_dump_14174.log" for
> the probable cause of the failure.
> Failure, exiting
>
> pg_upgrade_dump_14174.log contains:
> command: "/src/postgres/tmp_install/usr/local/pgsql/bin/pg_restore" --host
> /src/postgres --port 50432 --username postgres --clean --create
> --exit-on-error --verbose --dbname template1 "pg_upgrade_dump_14174.custom"
> >> "pg_upgrade_dump_14174.log" 2>&1
> pg_restore: connecting to database for restore
> pg_restore: dropping DATABASE PROPERTIES postgres
> pg_restore: dropping DATABASE postgres
> pg_restore: creating DATABASE "postgres"
> pg_restore: connecting to new database "postgres"
> pg_restore: connecting to database "postgres" as user "postgres"
> pg_restore: creating COMMENT "DATABASE "postgres""
> pg_restore: creating DATABASE PROPERTIES "postgres"
> pg_restore: connecting to new database "postgres"
> pg_restore: connecting to database "postgres" as user "postgres"
> pg_restore: creating pg_largeobject "pg_largeobject"
> pg_restore: creating FUNCTION "public.test_event_trigger()"
> pg_restore: creating COMMENT "EVENT TRIGGER "regress_event_trigger3""
> pg_restore: while PROCESSING TOC:
> pg_restore: from TOC entry 3705; 0 0 COMMENT EVENT TRIGGER
> "regress_event_trigger3" postgres
> pg_restore: error: could not execute query: ERROR:  event trigger
> "regress_event_trigger3" does not exist
> Command was: COMMENT ON EVENT TRIGGER "regress_event_trigger3" IS 'test
> comment';
>
> It looks like the commit 4c40b27b broke this.
>
>

-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL:+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus


upgrade_comment_fix_bug16346.patch
Description: Binary data


Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
On Tue, Apr 7, 2020 at 3:24 PM Andres Freund  wrote:
> > 0007 -
> >
> > + TransactionId xidCopy; /* this backend's xid, a copy of this proc's
> > +ProcGlobal->xids[] entry. */
> >
> > Can we please NOT put Copy into the name like that? Pretty please?
>
> Do you have a suggested naming scheme? Something indicating that it's
> not the only place that needs to be updated?

I don't think trying to indicate that in the structure member names is
a useful idea. I think you should give them the same names, maybe with
an "s" to pluralize the copy hanging off of ProcGlobal, and put a
comment that says something like:

We keep two copies of each of the following three fields. One copy is
here in the PGPROC, and the other is in a more densely-packed array
hanging off of PGXACT. Both copies of the value must always be updated
at the same time and under the same locks, so that it is always the
case that MyProc->xid == ProcGlobal->xids[MyProc->pgprocno] and
similarly for vacuumFlags and WHATEVER. Note, however, that the arrays
attached to ProcGlobal only contain entries for PGPROC structures that
are currently part of the ProcArray (i.e. there is currently a backend
for that PGPROC). We use those arrays when STUFF and the copies in the
individual PGPROC when THINGS.

> I think it's more on-point here, because we need to hold either of the
> locks* even, for changes to a backend's own status that one reasonably
> could expect would be safe to at least inspect.

It's just too brief and obscure to be useful.

> > + ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
> >
> > Apparently this array is not dense in the sense that it excludes
> > unused slots, but comments elsewhere don't seem to entirely agree.
>
> What do you mean with "unused slots"? Backends that committed?

Backends that have no XID. You mean, I guess, that it is "dense" in
the sense that only live backends are in there, not "dense" in the
sense that only active write transactions are in there. It would be
nice to nail that down better; the wording I suggested above might
help.

> > + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + 
> > max_prepared_xacts;
> >
> >   /* ProcGlobal */
> >   size = add_size(size, sizeof(PROC_HDR));
> > - /* MyProcs, including autovacuum workers and launcher */
> > - size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
> > - /* AuxiliaryProcs */
> > - size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
> > - /* Prepared xacts */
> > - size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
> > - /* ProcStructLock */
> > + size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
> >
> > This seems like a bad idea. If we establish a precedent that it's OK
> > to have sizing routines that don't use add_size() and mul_size(),
> > people are going to cargo cult that into places where there is more
> > risk of overflow than there is here.
>
> Hm. I'm not sure I see the problem. Are you concerned that TotalProcs
> would overflow due to too big MaxBackends or max_prepared_xacts? The
> multiplication itself is still protected by add_size(). It didn't seem
> correct to use add_size for the TotalProcs addition, since that's not
> really a size. And since the limit for procs is much lower than
> UINT32_MAX...

I'm concerned that there are 0 uses of add_size in any shared-memory
sizing function, and I think it's best to keep it that way. If you
initialize TotalProcs = add_size(MaxBackends,
add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts)) then I'm happy. I
think it's a desperately bad idea to imagine that we can dispense with
overflow checks here and be safe. It's just too easy for that to
become false due to future code changes, or get copied to other places
where it's unsafe now.

> > You've got a bunch of different places that talk about the new PGXACT
> > array and they are somewhat redundant yet without saying exactly the
> > same thing every time either. I think that needs cleanup.
>
> Could you point out a few of those comments, I'm not entirely sure which
> you're talking about?

+ /*
+ * Also allocate a separate arrays for data that is frequently (e.g. by
+ * GetSnapshotData()) accessed from outside a backend.  There is one entry
+ * in each for every *live* PGPROC entry, and they are densely packed so
+ * that the first procArray->numProc entries are all valid.  The entries
+ * for a PGPROC in those arrays are at PGPROC->pgxactoff.
+ *
+ * Note that they may not be accessed without ProcArrayLock held! Upon
+ * ProcArrayRemove() later entries will be moved.
+ *
+ * These are separate from the main PGPROC array so that the most heavily
+ * accessed data is stored contiguously in memory in as few cache lines as
+ * possible. This provides significant performance benefits, especially on
+ * a multiprocessor system.
+ */

+ * Arrays with per-backend information that is hotly accessed, indexed by
+ * PGPROC->pgxactoff. These are in separate arrays for three reasons:
+ * First, to 

RE: Index Skip Scan

2020-04-07 Thread Floris Van Nee
> 
> * Suspicious performance difference between different type of workload,
>   mentioned by Tomas (unfortunately I had no chance yet to investigate).
> 

His benchmark results indeed most likely point to multiple comparisons being 
done. Since the most likely place where these occur is _bt_readpage, I suspect 
this is called multiple times. Looking at your patch, I think that's indeed the 
case. For example, suppose a page contains [1,2,3,4,5] and the planner makes a 
complete misestimation and chooses a skip scan here. First call to _bt_readpage 
will compare every tuple on the page already and store everything in the 
workspace, which will now contain [1,2,3,4,5]. However, when a skip is done the 
elements on the page (not the workspace) are compared to find the next one. 
Then, another _bt_readpage is done, starting at the new offnum. So we'll 
compare every tuple (except 1) on the page again. Workspace now contains 
[2,3,4,5]. Next tuple we'll end up with [3,4,5] etc. So tuple 5 actually gets 
compared 5 times in _bt_readpage alone.

-Floris





Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 15:26:36 -0400, Robert Haas wrote:
> 0008 -
> 
> Here again, I greatly dislike putting Copy in the name. It makes
> little sense to pretend that either is the original and the other is
> the copy. You just have the same data in two places. If one of them is
> more authoritative, the place to explain that is in the comments, not
> by elongating the structure member name and supposing anyone will be
> able to make something of that.

Ok.



> 0009, 0010 -
> 
> I think you've got this whole series of things divided up too finely.
> Like, 0005 feels like the meat of it, and that has a bunch of things
> in it that could plausible be separated out as separate commits. 0007
> also seems to do more than one kind of thing (see my comment regarding
> moving some of that into 0006). But whacking everything around like a
> crazy man in 0005 and a little more in 0007 and then doing the
> following cleanup in these little tiny steps seems pretty lame.
> Separating 0009 from 0010 is maybe the clearest example of that, but
> IMHO it's pretty unclear why both of these shouldn't be merged with
> 0008.

I found it a *lot* easier to review / evolve them this way. I e.g. had
an earlier version in which the subxid part of the change worked
substantially differently (it tried to elide the overflowed bool, by
definining -1 as the indicator for overflows), and it'd been way harder
to change that if I didn't have a patch with *just* the subxid changes.

I'd not push them separated by time, but I do think it'd make sense to
push them as separate commits. I think it's easier to review them in
case of a bug in a separate area.


> My comments on the Copy naming apply here as well. I am also starting
> to wonder why exactly we need two copies of all this stuff. Perhaps
> I've just failed to absorb the idea for having read the patch too
> briefly, but I think that we need to make sure that it's super-clear
> why we're doing that. If we just needed it for one field because
> $REASONS, that would be one thing, but if we need it for all of them
> then there must be some underlying principle here that needs a good
> explanation in an easy-to-find and centrally located place.

The main reason is that we want to be able to cheaply check the current
state of the variables (mostly when checking a backend's own state). We
can't access the "dense" ones without holding a lock, but we e.g. don't
want to make ProcArrayEndTransactionInternal() take a lock just to check
if vacuumFlags is set.

It turns out to also be good for performance to have the copy for
another reason: The "dense" arrays share cachelines with other
backends. That's worth it because it allows to make GetSnapshotData(),
by far the most frequent operation, touch fewer cache lines. But it also
means that it's more likely that a backend's "dense" array entry isn't
in a local cpu cache (it'll be pulled out of there when modified in
another backend). In many cases we don't need the shared entry at commit
etc time though, we just need to check if it is set - and most of the
time it won't be.  The local entry allows to do that cheaply.

Basically it makes sense to access the PGPROC variable when checking a
single backend's data, especially when we have to look at the PGPROC for
other reasons already.  It makes sense to look at the "dense" arrays if
we need to look at many / most entries, because we then benefit from the
reduced indirection and better cross-process cacheability.


> 0011 -
> 
> + * Number of top-level transactions that completed in some form since the
> + * start of the server. This currently is solely used to check whether
> + * GetSnapshotData() needs to recompute the contents of the snapshot, or
> + * not. There are likely other users of this.  Always above 1.
> 
> Does it only count XID-bearing transactions? If so, best mention that.

Oh, good point.


> +GetSnapshotDataFillTooOld(Snapshot snapshot)
> 
> Uh... no clue what's going on here. Granted the code had no comments
> in the old place either, so I guess it's not worse, but even the name
> of the new function is pretty incomprehensible.

It fills the old_snapshot_threshold fields of a Snapshot.


> + * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go
> 
> I know what it means to re-enter a building, but I don't know what it
> means to re-enter the snapshot's xmin.

Re-entering it into the procarray, thereby preventing rows that the
snapshot could see from being removed.

> This whole comment seems a bit murky.

How about:
/*
 * If the current xactCompletionCount is still the same as it was at the
 * time the snapshot was built, we can be sure that rebuilding the
 * contents of the snapshot the hard way would result in the same 
snapshot
 * contents:
 *
 * As explained in transam/README, the set of xids considered running by
 * GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
 * con

Re: FETCH FIRST clause WITH TIES option

2020-04-07 Thread Alvaro Herrera
Pushed, with some additional changes.

So *of course* when I add tests to verify that ruleutils, I find a case
that does not work properly -- meaning, you get a view that pg_dump
emits in a way that won't be accepted:

CREATE VIEW limit_thousand_v_3 AS SELECT thousand FROM onek WHERE thousand < 995
ORDER BY thousand FETCH FIRST NULL ROWS WITH TIES;

note the "NULL" there.  ruleutils would gladly print this out as:

View definition:
 SELECT onek.thousand
   FROM onek
  WHERE onek.thousand < 995
  ORDER BY onek.thousand
 FETCH FIRST NULL::integer ROWS WITH TIES;

which is then not accepted.

The best fix I could come up for this was to reject a bare NULL in the
limit clause.  It's a very stupid fix, because you can still give it a
NULL, using
CREATE VIEW limit_thousand_v_3 AS SELECT thousand FROM onek WHERE thousand < 995
ORDER BY thousand FETCH FIRST (NULL+1) ROWS WITH TIES;
and the like.  But when ruleutils get this, it will add the parens,
which will magically make it work.

It turns out that the SQL standard is much more limited in what it will
accept there.  But our grammar (what we'll accept for the ancient LIMIT
clause) is very lenient -- it'll take just any expression.  I thought
about reducing that to NumericOnly for FETCH FIRST .. WITH TIES, but
then I have to pick: 1) gram.y fails to compile because of a
reduce/reduce conflict, or 2) also restricting FETCH FIRST .. ONLY to
NumericOnly.  Neither of those seemed very palatable.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: A bug when use get_bit() function for a long bytea string

2020-04-07 Thread Tom Lane
I wrote:
> It would be easier to get this done if you had addressed any of the
> objections to the patch as given.  Integer-overflow handling is still
> missing, and you still are assuming that it's okay to change catalog
> entries in released branches.

Since we are hard upon the feature freeze deadline, I took it on myself
to split this apart.  As far as I can see, the only part we really want
to back-patch is the adjustment of the range-limit comparisons in
byteaGetBit and byteaSetBit to use int64 arithmetic, so they don't
go wacko when the input bytea exceeds 256MB.  The other changes are
not live bugs because in current usage the estimated result size of
an encoding or decoding transform couldn't exceed 4 times 1GB.
Hence it won't overflow size_t even on 32-bit machines, thus the
check in palloc() is sufficient to deal with overlength values.
But it's worth making those changes going forward, I suppose,
in case somebody wants to deal with longer strings someday.

There were some other minor problems too, but I think I fixed
everything.

regards, tom lane




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 16:13:07 -0400, Robert Haas wrote:
> On Tue, Apr 7, 2020 at 3:24 PM Andres Freund  wrote:
> > > + ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
> > >
> > > Apparently this array is not dense in the sense that it excludes
> > > unused slots, but comments elsewhere don't seem to entirely agree.
> >
> > What do you mean with "unused slots"? Backends that committed?
> 
> Backends that have no XID. You mean, I guess, that it is "dense" in
> the sense that only live backends are in there, not "dense" in the
> sense that only active write transactions are in there.

Correct.

I tried the "only active write transaction" approach, btw, and had a
hard time making it scale well (due to the much more frequent moving of
entries at commit/abort time).  If we were to go to a 'only active
transactions' array at some point we'd imo still need pretty much all
the other changes made here - so I'm not worried about it for now.


> > > + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + 
> > > max_prepared_xacts;
> > >
> > >   /* ProcGlobal */
> > >   size = add_size(size, sizeof(PROC_HDR));
> > > - /* MyProcs, including autovacuum workers and launcher */
> > > - size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
> > > - /* AuxiliaryProcs */
> > > - size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
> > > - /* Prepared xacts */
> > > - size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
> > > - /* ProcStructLock */
> > > + size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
> > >
> > > This seems like a bad idea. If we establish a precedent that it's OK
> > > to have sizing routines that don't use add_size() and mul_size(),
> > > people are going to cargo cult that into places where there is more
> > > risk of overflow than there is here.
> >
> > Hm. I'm not sure I see the problem. Are you concerned that TotalProcs
> > would overflow due to too big MaxBackends or max_prepared_xacts? The
> > multiplication itself is still protected by add_size(). It didn't seem
> > correct to use add_size for the TotalProcs addition, since that's not
> > really a size. And since the limit for procs is much lower than
> > UINT32_MAX...
> 
> I'm concerned that there are 0 uses of add_size in any shared-memory
> sizing function, and I think it's best to keep it that way.

I can't make sense of that sentence?


We already have code like this, and have for a long time:
/* Size of the ProcArray structure itself */
#define PROCARRAY_MAXPROCS  (MaxBackends + max_prepared_xacts)

adding NUM_AUXILIARY_PROCS doesn't really change that, does it?


> If you initialize TotalProcs = add_size(MaxBackends,
> add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts)) then I'm happy.

Will do.


Greetings,

Andres Freund




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-04-07 Thread Justin Pryzby
On Tue, Apr 07, 2020 at 03:40:18PM +0300, Alexey Kondratov wrote:
> On 2020-04-06 21:44, Justin Pryzby wrote:
> > On Mon, Apr 06, 2020 at 08:43:46PM +0300, Alexey Kondratov wrote:
> > > 
> > > +/* XXX: reusing reindex_option_list */
> > > + | CLUSTER opt_verbose '(' reindex_option_list ')' 
> > > qualified_name
> > > cluster_index_specification
> > > 
> > > Could we actually simply reuse vac_analyze_option_list? From the first
> > > sight it does just the right thing, excepting the special handling of
> > > spelling ANALYZE/ANALYSE, but it does not seem to be a problem.
> > 
> > Hm, do you mean to let cluster.c reject the other options like "analyze" ?
> > I'm not sure why that would be better than reusing reindex?  I think the
> > suggestion will probably be to just copy+paste the reindex option list and
> > rename it to cluster (possibly with the explanation that they're separate
> > and independant and so their behavior shouldn't be tied together).
> 
> I mean to literally use vac_analyze_option_list for reindex and cluster as
> well. Please, check attached 0007. Now, vacuum, reindex and cluster filter
> options list and reject everything that is not supported, so it seems
> completely fine to just reuse vac_analyze_option_list, doesn't it?

It's fine with me :)

Possibly we could rename vac_analyze_option_list as generic_option_list.

I'm resending the patchset like that, and fixed cluster/index to handle not
just "VERBOSE" but "verbose OFF", rather than just ignoring the argument.

That's the last known issue with the patch.  I doubt anyone will elect to pick
it up in the next 8 hours, but I think it's in very good shape for v14 :)

BTW, if you resend a *.patch or *.diff file to a thread, it's best to also
include all the previous patches.  Otherwise the CF bot is likely to complain
that the patch "doesn't apply", or else it'll only test the one patch instead
of the whole series.
http://cfbot.cputube.org/alexey-kondratov.html

-- 
Justin
>From 0846a245af84adfd0114251be062d55538f7a802 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 2 Apr 2020 14:20:11 -0500
Subject: [PATCH v19 1/5] tab completion for reindex(verbose)..

..which was added at ecd222e77 for v9.5.

This handles "verbose" itself as well as the following word.

Separate commit as this could be backpatched to v12 (but backpatching further
is less trivial, due to improvements added at 121213d9d).
---
 src/bin/psql/tab-complete.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 0e7a373caf..e49463f11b 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -3414,7 +3414,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("DATA");
 
 /* REINDEX */
-	else if (Matches("REINDEX"))
+	else if (Matches("REINDEX") || Matches("REINDEX", "(*)"))
 		COMPLETE_WITH("TABLE", "INDEX", "SYSTEM", "SCHEMA", "DATABASE");
 	else if (Matches("REINDEX", "TABLE"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexables,
@@ -3436,6 +3436,17 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH_QUERY(Query_for_list_of_schemas);
 	else if (Matches("REINDEX", "SYSTEM|DATABASE", "CONCURRENTLY"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_databases);
+	else if (HeadMatches("REINDEX", "(*") &&
+			 !HeadMatches("REINDEX", "(*)"))
+	{
+		/*
+		 * This fires if we're in an unfinished parenthesized option list.
+		 * get_previous_words treats a completed parenthesized option list as
+		 * one word, so the above test is correct.
+		 */
+		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+			COMPLETE_WITH("VERBOSE");
+	}
 
 /* SECURITY LABEL */
 	else if (Matches("SECURITY"))
-- 
2.17.0

>From 59a5f725a216be03c05e54c61cd587bc222c3de4 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 27 Mar 2020 17:50:46 -0500
Subject: [PATCH v19 2/5] Change REINDEX/CLUSTER to accept an option list..

..like reindex (CONCURRENTLY)

Change docs in the style of VACUUM.  See also: 52dcfda48778d16683c64ca4372299a099a15b96
---
 doc/src/sgml/ref/cluster.sgml| 19 +-
 doc/src/sgml/ref/reindex.sgml| 43 +---
 src/backend/commands/cluster.c   | 23 -
 src/backend/commands/indexcmds.c | 41 --
 src/backend/nodes/copyfuncs.c|  2 ++
 src/backend/nodes/equalfuncs.c   |  2 ++
 src/backend/parser/gram.y| 35 +++---
 src/backend/tcop/utility.c   | 37 ---
 src/bin/psql/tab-complete.c  | 23 +
 src/include/commands/cluster.h   |  3 ++-
 src/include/commands/defrem.h|  7 +++---
 src/include/nodes/parsenodes.h   |  2 ++
 12 files changed, 169 insertions(+), 68 deletions(-)

diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 4da60d8d56..9980c522f2 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sg

implicit declaration of datumIsEqual in parse_coerce.c

2020-04-07 Thread Tomas Vondra

Hi,

Not sure what changed, but I'm seeing this failure:

parse_coerce.c: In function ‘coerce_type’:
parse_coerce.c:345:9: warning: implicit declaration of function ‘datumIsEqual’ 
[-Wimplicit-function-declaration]
  345 |if (!datumIsEqual(newcon->constvalue, val2, false, newcon->constlen))
  | ^~~~

Not sure if this because of compiler version (I'm on gcc 9.2.1) or
something else - I don't see any obvious changes to relevant parts of
the code, but I haven't dug too much.

Simply including 'utils/datum.h' resolves the issue.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-07 Thread Alexander Korotkov
On Tue, Apr 7, 2020 at 10:58 PM Anna Akenteva  wrote:
> Thank you for your review!
> Ivan and I have worked on the patch and tried to address your comments:

I've pushed this.  I promise to do careful post-commit review and
resolve any issues arising.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: implicit declaration of datumIsEqual in parse_coerce.c

2020-04-07 Thread Tom Lane
Tomas Vondra  writes:
> Not sure what changed, but I'm seeing this failure:

> parse_coerce.c: In function ‘coerce_type’:
> parse_coerce.c:345:9: warning: implicit declaration of function 
> ‘datumIsEqual’ [-Wimplicit-function-declaration]
>345 |if (!datumIsEqual(newcon->constvalue, val2, false, 
> newcon->constlen))
>| ^~~~

That's inside "#ifdef RANDOMIZE_ALLOCATED_MEMORY", which probably
explains why most of us aren't seeing it.  My guess is somebody
removed an #include without realizing that this chunk of code
needed it.

regards, tom lane




Re: proposal \gcsv

2020-04-07 Thread Tom Lane
Pavel Stehule  writes:
> út 7. 4. 2020 v 19:27 odesílatel Tom Lane  napsal:
>> Considering this is the last day before v13 feature freeze, should
>> I push this, or sit on it till v14?  I feel reasonably good that we
>> have a nice feature definition here, but it's awfully late in the
>> cycle to be designing features.

> I am for pushing to v13. This feature should not to break any, and there is
> lot of time to finish details.

Hearing no objections, pushed.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-04-07 Thread Justin Pryzby
On Tue, Apr 07, 2020 at 11:20:46AM -0700, Jeff Davis wrote:
> The enable_hashagg_disk GUC, if set to true, chooses HashAgg based on
> costing. If false, it only generates a HashAgg path if it thinks it will fit
> in work_mem, similar to the old behavior (though it wlil now spill to disk if
> the planner was wrong about it fitting in work_mem).  The current default is
> true.

Are there any other GUCs that behave like that ?  It's confusing to me when I
see "Disk Usage: ... kB", despite setting it to "disable", and without the
usual disable_cost.  I realize that postgres chose the plan on the hypothesis
that it would *not* exceed work_mem, and that spilling to disk is considered
preferable to ignoring the setting, and that "going back" to planning phase
isn't a possibility.

template1=# explain (analyze, costs off, summary off) SELECT a, COUNT(1) FROM 
generate_series(1,99) a GROUP BY 1 ;
 HashAggregate (actual time=1370.945..2877.250 rows=99 loops=1)
   Group Key: a
   Peak Memory Usage: 5017 kB
   Disk Usage: 22992 kB
   HashAgg Batches: 84
   ->  Function Scan on generate_series a (actual time=314.507..741.517 
rows=99 loops=1)

A previous version of the docs said this, which I thought was confusing, and 
you removed it.
But I guess this is the behavior it was trying to .. explain.

+  enable_hashagg_disk (boolean)
+... This only affects the planner choice;
+execution time may still require using disk-based hash
+aggregation. The default is on.

I suggest that should be reworded and then re-introduced, unless there's some
further behavior change allowing the previous behavior of
might-exceed-work-mem.

"This setting determines whether the planner will elect to use a hash plan
which it expects will exceed work_mem and spill to disk.  During execution,
hash nodes which exceed work_mem will spill to disk even if this setting is
disabled.  To avoid spilling to disk, either increase work_mem (or set
enable_hashagg=off)."

For sure the release notes should recommend re-calibrating work_mem.

-- 
Justin




Re: [HACKERS] Restricting maximum keep segments by repslots

2020-04-07 Thread Alvaro Herrera
On 2020-Apr-07, Kyotaro Horiguchi wrote:

> > Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot?
> 
> I did that in the attached. The invalidated is shared-but-not-saved
> member of a slot and initialized to false then irreversibly changed to
> true when the slot loses required segment.
> 
> It is checked by the new function CheckReplicationSlotInvalidated() at
> acquireing a slot and at updating slot by standby reply message. This
> change stops walsender without explicitly killing but I didn't remove
> that code.

This change didn't work well with my proposed change to make
checkpointer acquire slots before marking them invalid.  When I
incorporated your patch in the last version I posted yesterday, there
was a problem that when checkpointer attempted to acquire the slot, it
would fail with "the slot is invalidated"; also if you try to drop the
slot, it would obviously fail.  I think it would work to remove the
SlotIsInvalidated check from the Acquire routine, and instead move it to
the routines that need it (ie. not the InvalidateObsolete one, and also
not the routine to drop slots).

I pushed version 26, with a few further adjustments.

I think what we have now is sufficient, but if you want to attempt this
"invalidated" flag on top of what I pushed, be my guest.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Internal key management system

2020-04-07 Thread Cary Huang
Hello



Thanks a lot for the patch, I think in terms of functionality, the patch 
provides very straightforward functionalities regarding key management. In 
terms of documentation, I think the patch is still lacking some pieces of 
information that kind of prevent people from fully understanding how KMS works 
and how it can be used and why, (at least that is the impression I got from the 
zoom meeting recordings :p). I spent some time today revisiting the 
key-management documentation in the patch and rephrase and restructure it  
based on my current understanding of latest KMS design. I mentioned all 3 
application level keys that we have agreed and emphasize on explaining the SQL 
level encryption key because that is the key that can be used right now. Block 
and WAL levels keys we can add here more information once they are actually 
used in the TDE development. 



Please see below the KMS documentation that I have revised and I hope it will 
be more clear and easier for people to understand KMS. Feel free to make 
adjustments. Please note that we use the term "wrap" and "unwrap" a lot in our 
past discussions. Originally we used the terms within a context involving Key 
encryption keys (KEK). For example, "KMS wraps a master key with KEK". Later, 
we used the same term in a context involving encrypting user secret /password. 
For example, "KMS wraps a user secret with SQL key". In my opinion, both make 
sense but it may be confusing to people having the same term used differently. 
So in my revision below, the terms "wrap" and "unwrap" refer to encrypting or 
decrypting user secret / password as they are used in "pg_wrap() and 
pg_unwrap()". I use the terms "encapsulate" and "restore" when KEK is used to 
encrypt or decrypt a key.







Chapter 32: Encryption Key Management 

--


PostgreSQL supports internal Encryption Key Management System, which is 
designed to manage the life cycles of cryptographic keys within the PostgreSQL 
system. This includes dealing with their generation, storage, usage and 
rotation.



Encryption Key Management is enabled when PostgreSQL is build with 
--with-openssl and cluster passphrase command is specified during initdb. The 
cluster passphrase provided by --cluster-passphrase-command option during 
initdb and the one generated by cluster_passphrase_command in the 
postgresql.conf must match, otherwise, the database cluster will not start up.



32.1 Key Generations and Derivations

--



When cluster_passphrase_command option is specified to the initdb, the process 
will derive the cluster passphrase into a Key Encryption Key (KEK) and a HMAC 
Key using key derivation protocol before the actual generation of application 
level cryptographic level keys.



-Key Encryption Key (KEK)

KEK is primarily used to encapsulate or restore a given application level 
cryptographic key



-HMAC Key

HMAC key is used to compute the HASH of a given application level cryptographic 
key for integrity check purposes



These 2 keys are not stored physically within the PostgreSQL cluster as they 
are designed to be derived from the correctly configured cluster passphrase.



Encryption Key Management System currently manages 3 application level 
cryptographic keys that have different purposes and usages within the 
PostgreSQL system and these are generated using pg_strong_random() after KEK 
and HMAC key derivation during initdb process.



The 3 keys are:



-SQL Level Key

SQL Level Key is used to wrap and unwrap a user secret / passphrase via 
pg_wrap() and pg_unwrap() SQL functions. These 2 functions are designed to be 
used in conjunction with the cryptographic functions provided by pgcrypto 
extension to perform column level encryption/decryption without having to 
supply a clear text user secret or passphrase that is required by many pgcrypto 
functions as input. Please refer to [Wrap and Unwrap User Secret section] for 
usage examples.



-Block Level Key

Block Level Key is primarily used to encrypt / decrypt buffers as part of the 
Transparent Data Encryption (TDE) feature



-WAL Level Key

WAL Level Key is primarily used to encrypt / decrypt WAL files as part of the 
Transparent Data Encryption (TDE) feature



The 3 application level keys above will be encapsulated and hashed using KEK 
and HMAC key mentioned above before they are physically stored to pg_cryptokeys 
directory within the cluster.



32.1. Key Initialization

-



When a PostgreSQL cluster with encryption key management enabled is started, 
the cluster_passphrase_command parameter in postgresql.conf will be evaluated 
and the cluster passphrase will be derived into KEK and HMAC Key in similar 
ways as initdb.



After that, the 3 encapsulated application level cryptographic keys will be 
retrieved from pg_cryptokeys directory to be restored and integrity-checked by 
the key management system using 

Re: WAL usage calculation patch

2020-04-07 Thread Justin Pryzby
On Tue, Apr 07, 2020 at 12:00:29PM +0200, Peter Eisentraut wrote:
> We also have existing cases for the other way:
> 
> actual time=0.050..0.052
> Buffers: shared hit=3 dirtied=1
> 
> The cases mentioned by Justin are not formatted in a key=value format, so
> it's not quite the same, but it also raises the question why they are not.
> 
> Let's figure out a way to consolidate this without making up a third format.

So this re-raises my suggestion here to use colons, Title Case Field Names, and
"Size: ..kB" rather than "bytes=":
|https://www.postgresql.org/message-id/20200403054451.GN14618%40telsasoft.com

As I see it, the sort/hashjoin style is being used for cases with fields with
different units:

   Sort Method: quicksort  Memory: 931kB
   Buckets: 1024  Batches: 1  Memory Usage: 16kB

..which is distinguished from the case where the units are the same, like
buffers (hit=Npages read=Npages dirtied=Npages written=Npages).

Note, as of 1f39bce021, we have hashagg_disk, which looks like this:

template1=# explain analyze SELECT a, COUNT(1) FROM generate_series(1,9) a 
GROUP BY 1 ORDER BY 1;
...
   ->  HashAggregate  (cost=1499.99..1501.99 rows=200 width=12) (actual 
time=166.883..280.943 rows=9 loops=1)
 Group Key: a
 Peak Memory Usage: 4913 kB
 Disk Usage: 1848 kB
 HashAgg Batches: 8

Incremental sort adds yet another variation, which I've mentioned that thread.
I'm hoping to come to some resolution here, first.
https://www.postgresql.org/message-id/20200407042521.GH2228%40telsasoft.com

-- 
Justin




  1   2   >